ImageTableDetector
ImageTableDetector is a DL model for detecting tables on the image.
It’s based on CascadeTabNet which used Cascade mask Region-based CNN High-Resolution Network (Cascade mask R-CNN HRNet).
Input Columns
| Param name | Type | Default | Column Data Description | 
|---|---|---|---|
| inputCol | string | image | image struct (Image schema) | 
Parameters
| Param name | Type | Default | Description | 
|---|---|---|---|
| scoreThreshold | float | 0.9 | Score threshold for output regions. | 
| applyCorrection | boolean | false | Enable correction of results. | 
Output Columns
| Param name | Type | Default | Column Data Description | 
|---|---|---|---|
| outputCol | string | table_regions | array of [Coordinaties]ocr_structures#coordinate-schema) | 
Example:
import com.johnsnowlabs.ocr.transformers.*
import com.johnsnowlabs.ocr.OcrContext.implicits._
val imagePath = "path to image"
// Read image file as binary file
val df = spark.read
  .format("binaryFile")
  .load(imagePath)
  .asImage("image")
// Define transformer for detect tables
val table_detector = ImageTableDetector
  .pretrained("general_model_table_detection_v2")
  .setInputCol("image")
  .setOutputCol("table_regions")
val draw_regions = new ImageDrawRegions()
  .setInputCol("image")
  .setInputRegionsCol("table_regions")
  .setOutputCol("image_with_regions")
pipeline = PipelineModel(stages=[
    binary_to_image,
    table_detector,
    draw_regions
])
val data = pipeline.transform(df)
data.storeImage("image_with_regions")
from pyspark.ml import PipelineModel
from sparkocr.transformers import *
imagePath = "path to image"
# Read image file as binary file
df = spark.read 
    .format("binaryFile")
    .load(imagePath)
binary_to_image = BinaryToImage() \
    .setInputCol("content") \
    .setOutputCol("image")
# Define transformer for detect tables
table_detector = ImageTableDetector \
  .pretrained("general_model_table_detection_v2", "en", "clinical/ocr") \
  .setInputCol("image") \
  .setOutputCol("table_regions")
draw_regions = ImageDrawRegions() \
  .setInputCol("image") \
  .setInputRegionsCol("table_regions") \
  .setOutputCol("image_with_regions")
pipeline = PipelineModel(stages=[
    binary_to_image,
    table_detector,
    draw_regions
])
data = pipeline.transform(df)
display_images(data, "image_with_regions")
Output:

ImageTableCellDetector
ImageTableCellDetector detect cells in a table image. It’s based on an
image processing algorithm that detects horizontal and vertical lines.
Current implementation support few algorithm for extract cells:
- CellDetectionAlgos.CONTOURS works only for bordered tables.
- CellDetectionAlgos.MORPHOPS works for bordered, borderless and combined tables.
Input Columns
| Param name | Type | Default | Column Data Description | 
|---|---|---|---|
| inputCol | string | image | image struct (Image schema) | 
Parameters
| Param name | Type | Default | Description | 
|---|---|---|---|
| algoType | CellDetectionAlgos | CellDetectionAlgos.CONTOURS | Algorithm for detect cells. | 
| algoParams | string | row_treshold=0.05,row_treshold_wide=1.0, row_min_wide=5,column_treshold=0.05, column_treshold_wide=5,column_min_wide=5 | Parameters of ‘MORPHOPS’ cells detection algorithm | 
| drawDetectedLines | boolean | false | Enable to draw detected lines to the output image | 
| keepOriginalLines | boolean | false | Keep original images on the output image | 
Output Columns
| Param name | Type | Default | Column Data Description | 
|---|---|---|---|
| outputCol | string | cells | array of coordinates of cells | 
| outputImageCol | string | output_image | output image | 
Example:
import com.johnsnowlabs.ocr.transformers.*
import com.johnsnowlabs.ocr.OcrContext.implicits._
val imagePath = "path to image"
// Read image file as binary file
val df = spark.read
  .format("binaryFile")
  .load(imagePath)
  .asImage("image")
// Define transformer for detect cells
val transformer = new ImageTableCellDetector()
  .setInputCol("image")
  .setOutputCol("cells")
val data = transformer.transform(df)
data.select("cells").show()
from pyspark.ml import PipelineModel
from sparkocr.transformers import *
imagePath = "path to image"
# Read image file as binary file
df = spark.read 
    .format("binaryFile")
    .load(imagePath)
binary_to_image = BinaryToImage() \
    .setInputCol("content") \
    .setOutputCol("image")
# Define transformer for detect cells
transformer = ImageTableCellDetector \
  .setInputCol("image") \
  .setOutputCol("cells") \
  .setAlgoParams("row_treshold=0.05")
pipeline = PipelineModel(stages=[
    binary_to_image,
    transformer
])
data = pipeline.transform(df)
data.select("cells").show()
Image:

Output:*
+----------------------------------------------------------+
| cells                                                    |
+----------------------------------------------------------+
||[[[[15, 17, 224, 53]], [[241, 17, 179, 53]], [[423, 17,  |
| 194, 53]], [[619, 17, 164, 53]] ....                     |
+----------------------------------------------------------+
ImageCellsToTextTable
ImageCellsToTextTable runs OCR for cells regions on image, return recognized text
to outputCol as TableContainer structure.
Input Columns
| Param name | Type | Default | Column Data Description | 
|---|---|---|---|
| inputCol | string | image | image struct (Image schema) | 
| cellsCol | string | celss | Array of cells | 
Parameters
| Param name | Type | Default | Description | 
|---|---|---|---|
| strip | bool | true | Strip output text. | 
| margin | bool | 1 | Margin of cells in pixelx. | 
| pageSegMode | PageSegmentationMode | AUTO | page segmentation mode | 
| ocrEngineMode | EngineMode | LSTM_ONLY | OCR engine mode | 
| language | Language | Language.ENG | language | 
| ocrParams | array of strings | [] | Array of Ocr params in key=value format. | 
| pdfCoordinates | bool | false | Transform coordinates in positions to PDF points. | 
| modelData | string | Path to the local model data. | |
| modelType | ModelType | ModelType.BASE | Model type | 
| downloadModelData | bool | false | Download model data from JSL S3 | 
| outputFormat | TableOutputFormat | TableOutputFormat.TABLE | Output format | 
Output Columns
| Param name | Type | Default | Column Data Description | 
|---|---|---|---|
| outputCol | string | table | Recognized text as TableContainer | 
Example:
import org.apache.spark.ml.Pipeline
import com.johnsnowlabs.ocr.transformers.*
import com.johnsnowlabs.ocr.OcrContext.implicits._
val imagePath = "path to image"
// Read image file as binary file
val df = spark.read
  .format("binaryFile")
  .load(imagePath)
  .asImage("image")
// Define transformer for detect cells
val cell_detector = new ImageTableCellDetector()
  .setInputCol("image")
  .setOutputCol("cells")
val table_recognition = new ImageCellsToTextTable()
  .setInputCol("image")
  .setOutputCol("tables")
  .setMargin(2)
// Define pipeline
val pipeline = new Pipeline()
pipeline.setStages(Array(
  cell_detector,
  table_recognition
))
val modelPipeline = pipeline.fit(spark.emptyDataFrame)
val results = modelPipeline.transform(df)
results.select("tables")
  .withColumn("cells", explode(col("tables.chunks")))
  .select((0 until 7).map(i => col("cells")(i).getField("chunkText").alias(s"col$i")): _*)
  .show(false)
from pyspark.ml import PipelineModel
import pyspark.sql.functions as f
from sparkocr.transformers import *
imagePath = "path to image"
# Read image file as binary file
df = spark.read 
    .format("binaryFile")
    .load(imagePath)
binary_to_image = BinaryToImage()
binary_to_image.setImageType(ImageType.TYPE_BYTE_GRAY)
binary_to_image.setInputCol("content")
cell_detector = TableCellDetector()
cell_detector.setInputCol("image")
cell_detector.setOutputCol("cells")
cell_detector.setKeepInput(True)
table_recognition = ImageCellsToTextTable()
table_recognition.setInputCol("image")
table_recognition.setCellsCol('cells')
table_recognition.setMargin(2)
table_recognition.setStrip(True)
table_recognition.setOutputCol('table')
pipeline = PipelineModel(stages=[
    binary_to_image,
    cell_detector,
    table_recognition
])
result = pipeline.transform(df)
results.select("table") \
    .withColumn("cells", f.explode(f.col("table.chunks"))) \
    .select([f.col("cells")[i].getField("chunkText").alias(f"col{i}") for i in
             range(0, 7)]) \
    .show(20, False)
Image:

Output:
+----------+-------+--------+------+-----+---------+-------+
|col0      |col1   |col2    |col3  |col4 |col5     |col6   |
+----------+-------+--------+------+-----+---------+-------+
|Order Date|Region |Rep     |Item  |Units|Unit Cost|Total  |
|1/23/10   |Ontario|Kivell  |Binder|50   |$19.99   |$999.50|
|2/9/10    |Ontario|Jardine |Pencil|36   |$4.99    |$179.64|
|2/26/10   |Ontario|Gill    |Pen   |27   |$19.99   |$539.73|
|3/15/10   |Alberta|Sorvino |Pencil|56   |$2.99    |$167.44|
|4/1/10    |Quebec |Jones   |Binder|60   |$4.99    |$299.40|
|4/18/10   |Ontario|Andrews |Pencil|75   |$1.99    |$149.25|
|5/5/10    |Ontario|Jardine |Pencil|90   |$4.99    |$449.10|
|5/22/10   |Alberta|Thompson|Pencil|32   |$1.99    |$63.68 |
+----------+-------+--------+------+-----+---------+-------+
PREVIOUSPipeline components