Spark OCR release notes

5.0.1

Release date: 21-09-2023

We are glad to announce that Visual NLP 5.0.1 has been released! 🚀🚀🚀 New features, new models, bug fixes, and more! 📢📢📢

🚨 New Features

New Dit based Text Detection Model: Continuing with our commitment to empower Text Extraction and De-identification pipelines we are delivering a new model for text detection, it was trained on the FUNSD dataset, and its utilization is similar to other related models,
```
 python
ImageTextDetector \
.pretrained("image_text_detector_dit", "en", "clinical/ocr") \
.setInputCol("image")
.setOutputCol("region")
.setScoreThreshold(0.5)
```

It is currently the best performing model at the FUNSD dataset, achieving an accuracy of 94% vs Craft detector which achieved 78.7%, and is recommended for De-identification and Text Extraction pipelines.

Dit based VisualDocumentClassifierV3 now supports fine tuning: check the new tutorial, and notebook, on how to fine-tune Dit-based VisualDocumentClassifierV3 on the RVL-CDIP dataset using a Docker image.
New Pretrained Pipeline for Table Extraction: this new pipeline, digital_pdf_table_extractor, extracts tables from digital PDFs. Check and end-to-end example in this notebook.
New notebook explaining how to do inference on RvlCdip with VisualDocumentClassiferV3 on Databricks: check this new notebook explaining how you can process the entire RVL-CDIP dataset using auto-scaling in Databricks in few minutes.
New RvlCdipReader to help read both training and test parts of the RvlCdip document classification dataset. Check this notebook for an example.

🪲 Bug Fixes

Avoid to use downloadable metrics script for Lilt NER training: now all the metric computation can be handled offline for Lilt NER model training.
The bug in data consumption for VisualDocumentNer Lilt models was fixed: this bug affected data ingestion during fine tuning, and affected the quality of the resulting models.
Serialization issues preventing ImageTableDetector and HocrToTextTable from working properly in a pipeline were fixed.
PositionFinder has improved error reporting logic.
ImageToText MacOS errors were solved.

Previous versions

Version
Version
Version

PREVIOUSVersion Compatibility