Table recognition

 

ImageTableDetector

ImageTableDetector is a DL model for detect tables on the image. It based on CascadeTabNet which used Cascade mask Region-based CNN High-Resolution Network (Cascade mask R-CNN HRNet).

Input Columns

Param name Type Default Column Data Description
inputCol string image image struct (Image schema)

Parameters

Param name Type Default Description
scoreThreshold float 0.9 Score threshold for output regions.
applyCorrection boolean false Enable correction of results.

Output Columns

Param name Type Default Column Data Description
outputCol string table_regions array of [Coordinaties]ocr_structures#coordinate-schema)

Example:

import com.johnsnowlabs.ocr.transformers.*
import com.johnsnowlabs.ocr.OcrContext.implicits._

val imagePath = "path to image"

// Read image file as binary file
val df = spark.read
  .format("binaryFile")
  .load(imagePath)
  .asImage("image")

// Define transformer for detect tables
val table_detector = ImageTableDetector
  .pretrained("general_model_table_detection_v2")
  .setInputCol("image")
  .setOutputCol("table_regions")

val draw_regions = new ImageDrawRegions()
  .setInputCol("image")
  .setInputRegionsCol("table_regions")
  .setOutputCol("image_with_regions")


pipeline = PipelineModel(stages=[
    binary_to_image,
    table_detector,
    draw_regions
])

val data = pipeline.transform(df)

data.storeImage("image_with_regions")
from pyspark.ml import PipelineModel
from sparkocr.transformers import *

imagePath = "path to image"

# Read image file as binary file
df = spark.read 
    .format("binaryFile")
    .load(imagePath)

binary_to_image = BinaryToImage() \
    .setInputCol("content") \
    .setOutputCol("image")

# Define transformer for detect tables
table_detector = ImageTableDetector \
  .pretrained("general_model_table_detection_v2", "en", "clinical/ocr") \
  .setInputCol("image") \
  .setOutputCol("table_regions")

draw_regions = ImageDrawRegions() \
  .setInputCol("image") \
  .setInputRegionsCol("table_regions") \
  .setOutputCol("image_with_regions")


pipeline = PipelineModel(stages=[
    binary_to_image,
    table_detector,
    draw_regions
])

data = pipeline.transform(df)

display_images(data, "image_with_regions")

Output:

image

ImageTableCellDetector

ImageTableCellDetector detect cells on image with table. It based on image processing algorithm by detecting horizontal and vertical lines.

Current implementation support few algorithm for extract cells:

  • CellDetectionAlgos.CONTOURS works only for bordered tables.
  • CellDetectionAlgos.MORPHOPS works for bordered, borderless and combined tables.

Input Columns

Param name Type Default Column Data Description
inputCol string image image struct (Image schema)

Parameters

Param name Type Default Description
algoType CellDetectionAlgos CellDetectionAlgos.CONTOURS Algorithm for detect cells.
algoParams string row_treshold=0.05,row_treshold_wide=1.0, row_min_wide=5,column_treshold=0.05, column_treshold_wide=5,column_min_wide=5 Parameters of ‘MORPHOPS’ cells detection algorithm
drawDetectedLines boolean false Enable to draw detected lines to the output image
keepOriginalLines boolean false Keep original images on the output image

Output Columns

Param name Type Default Column Data Description
outputCol string cells array of coordinates of cells
outputImageCol string output_image output image

Example:

import com.johnsnowlabs.ocr.transformers.*
import com.johnsnowlabs.ocr.OcrContext.implicits._

val imagePath = "path to image"

// Read image file as binary file
val df = spark.read
  .format("binaryFile")
  .load(imagePath)
  .asImage("image")

// Define transformer for detect cells
val transformer = new ImageTableCellDetector()
  .setInputCol("image")
  .setOutputCol("cells")

val data = transformer.transform(df)

data.select("cells").show()
from pyspark.ml import PipelineModel
from sparkocr.transformers import *

imagePath = "path to image"

# Read image file as binary file
df = spark.read 
    .format("binaryFile")
    .load(imagePath)

binary_to_image = BinaryToImage() \
    .setInputCol("content") \
    .setOutputCol("image")

# Define transformer for detect cells
transformer = ImageTableCellDetector \
  .setInputCol("image") \
  .setOutputCol("cells") \
  .setAlgoParams("row_treshold=0.05")

pipeline = PipelineModel(stages=[
    binary_to_image,
    transformer
])

data = pipeline.transform(df)
data.select("cells").show()

Image:

image

Output:*

+----------------------------------------------------------+
| cells                                                    |
+----------------------------------------------------------+
||[[[[15, 17, 224, 53]], [[241, 17, 179, 53]], [[423, 17,  |
| 194, 53]], [[619, 17, 164, 53]] ....                     |
+----------------------------------------------------------+

ImageCellsToTextTable

ImageCellsToTextTable runs OCR for cells regions on image, return recognized text to outputCol as TableContainer structure.

Input Columns

Param name Type Default Column Data Description
inputCol string image image struct (Image schema)
cellsCol string celss Array of cells

Parameters

Param name Type Default Description
strip bool true Strip output text.
margin bool 1 Margin of cells in pixelx.
pageSegMode PageSegmentationMode AUTO page segmentation mode
ocrEngineMode EngineMode LSTM_ONLY OCR engine mode
language Language Language.ENG language
ocrParams array of strings [] Array of Ocr params in key=value format.
pdfCoordinates bool false Transform coordinates in positions to PDF points.
modelData string   Path to the local model data.
modelType ModelType ModelType.BASE Model type
downloadModelData bool false Download model data from JSL S3
outputFormat TableOutputFormat TableOutputFormat.TABLE Output format

Output Columns

Param name Type Default Column Data Description
outputCol string table Recognized text as TableContainer

Example:

import org.apache.spark.ml.Pipeline
import com.johnsnowlabs.ocr.transformers.*
import com.johnsnowlabs.ocr.OcrContext.implicits._

val imagePath = "path to image"

// Read image file as binary file
val df = spark.read
  .format("binaryFile")
  .load(imagePath)
  .asImage("image")

// Define transformer for detect cells
val cell_detector = new ImageTableCellDetector()
  .setInputCol("image")
  .setOutputCol("cells")

val table_recognition = new ImageCellsToTextTable()
  .setInputCol("image")
  .setOutputCol("tables")
  .setMargin(2)

// Define pipeline
val pipeline = new Pipeline()
pipeline.setStages(Array(
  cell_detector,
  table_recognition
))

val modelPipeline = pipeline.fit(spark.emptyDataFrame)

val results = modelPipeline.transform(df)

results.select("tables")
  .withColumn("cells", explode(col("tables.chunks")))
  .select((0 until 7).map(i => col("cells")(i).getField("chunkText").alias(s"col$i")): _*)
  .show(false)

from pyspark.ml import PipelineModel
import pyspark.sql.functions as f
from sparkocr.transformers import *

imagePath = "path to image"

# Read image file as binary file
df = spark.read 
    .format("binaryFile")
    .load(imagePath)

binary_to_image = BinaryToImage()
binary_to_image.setImageType(ImageType.TYPE_BYTE_GRAY)
binary_to_image.setInputCol("content")

cell_detector = TableCellDetector()
cell_detector.setInputCol("image")
cell_detector.setOutputCol("cells")
cell_detector.setKeepInput(True)

table_recognition = ImageCellsToTextTable()
table_recognition.setInputCol("image")
table_recognition.setCellsCol('cells')
table_recognition.setMargin(2)
table_recognition.setStrip(True)
table_recognition.setOutputCol('table')

pipeline = PipelineModel(stages=[
    binary_to_image,
    cell_detector,
    table_recognition
])

result = pipeline.transform(df)

results.select("table") \
    .withColumn("cells", f.explode(f.col("table.chunks"))) \
    .select([f.col("cells")[i].getField("chunkText").alias(f"col{i}") for i in
             range(0, 7)]) \
    .show(20, False)

Image:

image

Output:

+----------+-------+--------+------+-----+---------+-------+
|col0      |col1   |col2    |col3  |col4 |col5     |col6   |
+----------+-------+--------+------+-----+---------+-------+
|Order Date|Region |Rep     |Item  |Units|Unit Cost|Total  |
|1/23/10   |Ontario|Kivell  |Binder|50   |$19.99   |$999.50|
|2/9/10    |Ontario|Jardine |Pencil|36   |$4.99    |$179.64|
|2/26/10   |Ontario|Gill    |Pen   |27   |$19.99   |$539.73|
|3/15/10   |Alberta|Sorvino |Pencil|56   |$2.99    |$167.44|
|4/1/10    |Quebec |Jones   |Binder|60   |$4.99    |$299.40|
|4/18/10   |Ontario|Andrews |Pencil|75   |$1.99    |$149.25|
|5/5/10    |Ontario|Jardine |Pencil|90   |$4.99    |$449.10|
|5/22/10   |Alberta|Thompson|Pencil|32   |$1.99    |$63.68 |
+----------+-------+--------+------+-----+---------+-------+
Last updated