Spark OCR release notes

4.3.3

Release date: 14-03-2023

We’re glad to announce that Visual NLP 😎 4.3.3 has been released.

Highlights

New parameter keepOriginalEncoding in PdfToHocr.
New Yolo-based table and form detector.
Memory consumption in VisualQuestionAnswering and ImageTableDetector models has been improved.
Fixes in AlabReader
Fixes in HocrToTextTable.

New parameter keepOriginalEncoding in PdfToHocr

Now you can choose to make PdfToHocr return an ASCII normalized version of the characters present in the PDF(keepOriginalEncoding=False) or to return the original Unicode character(keepOriginalEncoding=True). Source PDF, keepOriginalEncoding in PdfToHocr

Keeping the encoding, keepOriginalEncoding in PdfToHocr

Not keeping it, keepOriginalEncoding in PdfToHocr

New Yolo-based Table and Form detector

This new model allows to distinguish between forms and tables, so you can apply different downstream processing afterwards.

New Yolo-based Table and Form detector

Check a full example of utilization in this notebook.

Memory consumption in VisualQuestionAnswering and ImageTableDetector models has been improved

Memory utilization has been improved to make it more GC friendly. The practical result is that big jobs are more stable, and less likely to get restarted because of exhausting resources.

Fixes in AlabReader

AlabReader has been improved to fix some bugs, and to improve the performance.

Fixes in HocrToTextTable

HocrToTextTable has been improved in order to better handle some corner cases in which the last rows of tables were being missed.

This release of Visual NLP is compatible with version 4.3.1 of Spark-NLP and version 4.3.1 of Spark NLP for Healthcare.

Previous versions

Version
Version
Version

PREVIOUSVersion Compatibility