Spark OCR release notes

5.3.2

Release date: 15-05-2024

Visual NLP 5.3.2 Release Notes 🕶️

We are glad to announce that Visual NLP 5.3.2 has been released.!!! 📢📢📢

Highlights 🔴

Ocr Metrics against Cloud Providers: Textract, and CGP.
LightPipeline support for Table Recognition and Clustering.
PositionFinder supports entities spawning multiple lines.
Other Changes.

Ocr Metrics against Cloud Providers: Textract, and CGP.

Next are metrics for Text Detection and Recognition tasks collected on the FUNSD dataset, the final metric is the average F score across Text Detection and Recognition tasks.

Detection	Recognition		Detection Metrics		Recognition Metrics
		Precision	Recall	Precision	Recall	Avg. F Score
Google OCR	Google OCR	0.3528	0.7776	0.8889	0.8823	0.6854
Amazon Textract	Amazon Textract	0.5284	0.8534	0.8236	0.8539	0.7455
ImageTextDetector (memOpt)	ImageToTextV2 (base checkpoint)	0.6199	0.9044	0.9354	0.9331	0.8349
ImageTextDetector (memOpt)	ImageToTextV2 (large checkpoint)	0.6199	0.9044	0.9457	0.9426	0.8398
ImageTextDetectorV2	ImageToTextV2 (base checkpoint)	0.598	0.9046	0.9354	0.9331	0.8271
ImageTextDetectorV2	ImageToTextV2 (large checkpoint)	0.598	0.9046	0.9457	0.9426	0.8320
ImageTextDetector (memOpt)	ImageToText	0.6199	0.9044	0.464	0.4654	0.6001

Not only the scores are slightly better than those of cloud providers, but also the cost is lower(*),

Service	Cost(USD)
Amazon	120
Azure	30
Google	43.5
JSL	17.6

(*) JSL costs were estimated assuming a Databricks setup.

LightPipeline support for Table Recognition and Clustering

Now you can use Table Extraction and Clustering pipelines as LightPipelines. To do so you just need to create the LightPipeline as usual, check this example using PretrainedPipeline,

Full Example here.

new LightPipeline.fromBinary() method that allows the usage of in-memory binary buffers as inputs to Visual NLP pipelines.

lp = PretrainedPipeline("digital_pdf_table_extractor")
lp.fromLocalPath("page_with_tables.png")

For other examples please check this notebook.

PositionFinder

For cases in which entities spawn multiple lines, PositionFinder was not working properly.

5.3.2

Now, the expected bounding boxes for the entity are returned. Keep in mind that as before more than one bounding box will be returned, and all will share the same chunk_id.

Other Changes

CVE related to commons-compress was removed.
Bug Fixes: ImageToPdf rendering images outside page boundaries.

Previous versions

Version
Version
Version

PREVIOUSVersion Compatibility