Hocr for table recognition

Description

Table structure recognition based on hocr with Tesseract architecture.

Tesseract has been trained on a variety of datasets to improve its recognition capabilities. These datasets include images of text in various languages and scripts, as well as images with different font styles, sizes, and orientations. The training process involves feeding the engine with a large number of images and their corresponding text, allowing the engine to learn the patterns and characteristics of different text styles. One of the most important datasets used in training Tesseract is the UNLV dataset, which contains over 400,000 images of text in different languages, scripts, and font styles. This dataset is widely used in the OCR community and has been instrumental in improving the accuracy of Tesseract. Other datasets that have been used in training Tesseract include the ICDAR dataset, the IIIT-HWS dataset, and the RRC-GV-WS dataset.

In addition to these datasets, Tesseract also uses a technique called adaptive training, where the engine can continuously improve its recognition capabilities by learning from new images and text. This allows Tesseract to adapt to new text styles and languages, and improve its overall accuracy.

Predicted Entities

Live Demo Open in Colab

How to use

binary_to_image = BinaryToImage() \
    .setInputCol("content")  \
    .setOutputCol("image") 

table_detector = ImageTableDetector.pretrained("general_model_table_detection_v2", "en", "clinical/ocr") \
    .setInputCol("image") \
    .setOutputCol("table_regions")

splitter = ImageSplitRegions() \
    .setInputCol("image") \
    .setInputRegionsCol("table_regions") \
    .setOutputCol("table_image") \
    .setDropCols("image") \
    .setImageType(ImageType.TYPE_BYTE_GRAY) \
    .setExplodeCols([])

text_detector = ImageTextDetectorV2.pretrained("image_text_detector_v2", "en", "clinical/ocr") \
    .setInputCol("image") \
    .setOutputCol("text_regions") \
    .setWithRefiner(True)

draw_regions = ImageDrawRegions() \
    .setInputCol("image") \
    .setInputRegionsCol("text_regions") \
    .setOutputCol("image_with_regions") \
    .setRectColor(Color.green) \
    .setRotated(True)

img_to_hocr = ImageToTextV2().pretrained("ocr_small_printed", "en", "clinical/ocr") \
    .setInputCols(["image", "text_regions"]) \
    .setUsePandasUdf(False) \
    .setOutputFormat(OcrOutputFormat.HOCR) \
    .setOutputCol("hocr") \
    .setGroupImages(False)

hocr_to_table = HocrToTextTable() \
    .setInputCol("hocr") \
    .setRegionCol("table_regions") \
    .setOutputCol("tables")

pipeline = PipelineModel(stages=[
    binary_to_image,
    table_detector,
    splitter,
    text_detector,
    draw_regions,
    img_to_hocr,
    hocr_to_table
])

imagePath = "data/tab_images_hocr_1/table4_1.jpg"
image_df= spark.read.format("binaryFile").load(imagePath)

result = pipeline.transform(image_df).cache()

val binary_to_image = new BinaryToImage() 
    .setInputCol("content")  
    .setOutputCol("image") 

val table_detector = new ImageTableDetector
    .pretrained("general_model_table_detection_v2", "en", "clinical/ocr") 
    .setInputCol("image") 
    .setOutputCol("table_regions")

val splitter = new ImageSplitRegions() 
    .setInputCol("image") 
    .setInputRegionsCol("table_regions") 
    .setOutputCol("table_image") 
    .setDropCols("image") 
    .setImageType(ImageType.TYPE_BYTE_GRAY) 
    .setExplodeCols(Array())

val text_detector = new ImageTextDetectorV2
    .pretrained("image_text_detector_v2", "en", "clinical/ocr") 
    .setInputCol("image") 
    .setOutputCol("text_regions") 
    .setWithRefiner(True)

val draw_regions = new ImageDrawRegions() 
    .setInputCol("image") 
    .setInputRegionsCol("text_regions") 
    .setOutputCol("image_with_regions") 
    .setRectColor(Color.green) 
    .setRotated(True)

img_to_hocr = ImageToTextV2()
    .pretrained("ocr_small_printed", "en", "clinical/ocr") 
    .setInputCols(Array("image", "text_regions")) 
    .setUsePandasUdf(False) 
    .setOutputFormat(OcrOutputFormat.HOCR) 
    .setOutputCol("hocr") 
    .setGroupImages(False)

val hocr_to_table = new HocrToTextTable() 
    .setInputCol("hocr") 
    .setRegionCol("table_regions") 
    .setOutputCol("tables")

val pipeline = new PipelineModel().setStages(Array(
    binary_to_image, 
    table_detector, 
    splitter, 
    text_detector, 
    draw_regions, 
    img_to_hocr, 
    hocr_to_table))

val imagePath = "data/tab_images_hocr_1/table4_1.jpg"
val image_df= spark.read.format("binaryFile").load(imagePath)

val result = pipeline.transform(image_df).cache()

Example

Input image

Output image

Output text

text_regions	table_image	pagenum	modificationTime	path	table_regions	length	image	image_with_regions	hocr	tables	exception	table_index
[{0, 0, 566.32025...	{file:/content/ta...	0	2023-01-23 08:21:...	file:/content/tab...	{0, 0, 40.0, 0.0,...	172124	{file:/content/ta...	{file:/content/ta...	<?xml version="1....	{0, 0, 0.0, 0.0,...	null	0

Filename: table4_1.jpg
Page:     0
Table:    0
4
col0	col1	col2	col3
MATERIAL	LABOR	TOTAL
SURFACE FACILITIES	None	None	None
BUILDINGS AND STRUCTURES	29,380	33,640	63,020
MAJOR EQUIPMENT	46,350	4,570	50,920
BULK MATERIAL	29,040	16,410	45,450
SITE DEVELOPMENT	7,570	4,730	12,300
SHAFTS AND HOISTS	None	None	None
MAJOR EQUIPMENT	24,500	8,300	32,800
SHAFTS AND LINING	58,100	31,400	89,500
UNDERGROUND FACILITIES	None	None	None
EXCAVATIONS AND STRUCTURES	2,510	4,510	7,020
MAJOR EQUIPMENT	3,170	220	3,390
BULK MATERIAL	1,960	1,470	3,430
MINING	None	None	None
MAJOR EQUIPMENT	64,700		64,700
MINE CONSTRUCTION	582,330	655,640	1,237,970
BACKFULLING	None	None	None
MINE BACKFILLING	102,300	116,000	218,300
SHAFT SEALING	90	710	200
TOTAL FIELD COSTS	952.000	877.000	1,829,000
ARCHITECT-ENGINEER SERVICES			53,000
OWNER'S COSTS			218,000
CONTINGENCY			534.0001

Model Information

Model Name:	hocr_table_recognition
Type:	ocr
Compatibility:	Visual NLP 4.2.4+
License:	Licensed
Edition:	Official
Language:	en

PREVIOUSLegal NER (Parties, Dates, Alias, Former names, Document Type - lg)

NEXTHocr for table recognition pdf