Hocr for table recognition

Description

Table structure recognition based on hocr with Tesseract architecture.

Tesseract has been trained on a variety of datasets to improve its recognition capabilities. These datasets include images of text in various languages and scripts, as well as images with different font styles, sizes, and orientations. The training process involves feeding the engine with a large number of images and their corresponding text, allowing the engine to learn the patterns and characteristics of different text styles. One of the most important datasets used in training Tesseract is the UNLV dataset, which contains over 400,000 images of text in different languages, scripts, and font styles. This dataset is widely used in the OCR community and has been instrumental in improving the accuracy of Tesseract. Other datasets that have been used in training Tesseract include the ICDAR dataset, the IIIT-HWS dataset, and the RRC-GV-WS dataset.

In addition to these datasets, Tesseract also uses a technique called adaptive training, where the engine can continuously improve its recognition capabilities by learning from new images and text. This allows Tesseract to adapt to new text styles and languages, and improve its overall accuracy.

Predicted Entities

Live Demo Open in Colab

How to use

binary_to_image = BinaryToImage() \
    .setInputCol("content")  \
    .setOutputCol("image") 

table_detector = ImageTableDetector.pretrained("general_model_table_detection_v2", "en", "clinical/ocr") \
    .setInputCol("image") \
    .setOutputCol("table_regions")

splitter = ImageSplitRegions() \
    .setInputCol("image") \
    .setInputRegionsCol("table_regions") \
    .setOutputCol("table_image") \
    .setDropCols("image") \
    .setImageType(ImageType.TYPE_BYTE_GRAY) \
    .setExplodeCols([])

text_detector = ImageTextDetectorV2.pretrained("image_text_detector_v2", "en", "clinical/ocr") \
    .setInputCol("image") \
    .setOutputCol("text_regions") \
    .setWithRefiner(True)

draw_regions = ImageDrawRegions() \
    .setInputCol("image") \
    .setInputRegionsCol("text_regions") \
    .setOutputCol("image_with_regions") \
    .setRectColor(Color.green) \
    .setRotated(True)

img_to_hocr = ImageToTextV2().pretrained("ocr_small_printed", "en", "clinical/ocr") \
    .setInputCols(["image", "text_regions"]) \
    .setUsePandasUdf(False) \
    .setOutputFormat(OcrOutputFormat.HOCR) \
    .setOutputCol("hocr") \
    .setGroupImages(False)

hocr_to_table = HocrToTextTable() \
    .setInputCol("hocr") \
    .setRegionCol("table_regions") \
    .setOutputCol("tables")

pipeline = PipelineModel(stages=[
    binary_to_image,
    table_detector,
    splitter,
    text_detector,
    draw_regions,
    img_to_hocr,
    hocr_to_table
])

imagePath = "data/tab_images_hocr_1/table4_1.jpg"
image_df= spark.read.format("binaryFile").load(imagePath)

result = pipeline.transform(image_df).cache()
val binary_to_image = new BinaryToImage() 
    .setInputCol("content")  
    .setOutputCol("image") 

val table_detector = new ImageTableDetector
    .pretrained("general_model_table_detection_v2", "en", "clinical/ocr") 
    .setInputCol("image") 
    .setOutputCol("table_regions")

val splitter = new ImageSplitRegions() 
    .setInputCol("image") 
    .setInputRegionsCol("table_regions") 
    .setOutputCol("table_image") 
    .setDropCols("image") 
    .setImageType(ImageType.TYPE_BYTE_GRAY) 
    .setExplodeCols(Array())

val text_detector = new ImageTextDetectorV2
    .pretrained("image_text_detector_v2", "en", "clinical/ocr") 
    .setInputCol("image") 
    .setOutputCol("text_regions") 
    .setWithRefiner(True)

val draw_regions = new ImageDrawRegions() 
    .setInputCol("image") 
    .setInputRegionsCol("text_regions") 
    .setOutputCol("image_with_regions") 
    .setRectColor(Color.green) 
    .setRotated(True)

img_to_hocr = ImageToTextV2()
    .pretrained("ocr_small_printed", "en", "clinical/ocr") 
    .setInputCols(Array("image", "text_regions")) 
    .setUsePandasUdf(False) 
    .setOutputFormat(OcrOutputFormat.HOCR) 
    .setOutputCol("hocr") 
    .setGroupImages(False)

val hocr_to_table = new HocrToTextTable() 
    .setInputCol("hocr") 
    .setRegionCol("table_regions") 
    .setOutputCol("tables")

val pipeline = new PipelineModel().setStages(Array(
    binary_to_image, 
    table_detector, 
    splitter, 
    text_detector, 
    draw_regions, 
    img_to_hocr, 
    hocr_to_table))

val imagePath = "data/tab_images_hocr_1/table4_1.jpg"
val image_df= spark.read.format("binaryFile").load(imagePath)

val result = pipeline.transform(image_df).cache()

Example

Input image

Screenshot

Output image

Screenshot

Output text

text_regions	table_image	pagenum	modificationTime	path	table_regions	length	image	image_with_regions	hocr	tables	exception	table_index
[{0, 0, 566.32025...	{file:/content/ta...	0	2023-01-23 08:21:...	file:/content/tab...	{0, 0, 40.0, 0.0,...	172124	{file:/content/ta...	{file:/content/ta...	<?xml version="1....	{0, 0, 0.0, 0.0,...	null	0
Filename: table4_1.jpg
Page:     0
Table:    0
4
col0	col1	col2	col3
0		MATERIAL	LABOR	TOTAL
1	SURFACE FACILITIES	None	None	None
2	BUILDINGS AND STRUCTURES	29,380	33,640	63,020
3	MAJOR EQUIPMENT	46,350	4,570	50,920
4	BULK MATERIAL	29,040	16,410	45,450
5	SITE DEVELOPMENT	7,570	4,730	12,300
6	SHAFTS AND HOISTS	None	None	None
7	MAJOR EQUIPMENT	24,500	8,300	32,800
8	SHAFTS AND LINING	58,100	31,400	89,500
9	UNDERGROUND FACILITIES	None	None	None
10	EXCAVATIONS AND STRUCTURES	2,510	4,510	7,020
11	MAJOR EQUIPMENT	3,170	220	3,390
12	BULK MATERIAL	1,960	1,470	3,430
13	MINING	None	None	None
14	MAJOR EQUIPMENT	64,700		64,700
15	MINE CONSTRUCTION	582,330	655,640	1,237,970
16	BACKFULLING	None	None	None
17	MINE BACKFILLING	102,300	116,000	218,300
18	SHAFT SEALING	90	710	200
19	TOTAL FIELD COSTS	952.000	877.000	1,829,000
20	ARCHITECT-ENGINEER SERVICES			53,000
21	OWNER'S COSTS			218,000
22	CONTINGENCY			534.0001

Model Information

Model Name: hocr_table_recognition
Type: ocr
Compatibility: Visual NLP 4.2.4+
License: Licensed
Edition: Official
Language: en