Spark OCR release notes

 

3.12.0

Release date: 14-04-2022

Overview

We’re glad to announce that Spark OCR 3.12.0 has been released! This release comes with new models for Handwritten Text Recognition, Spark 3.2 support, bug fixes, and notebook examples.

New Features

  • Added to the ImageTextDetectorV2:
    • New parameter ‘mergeIntersects’: merge bounding boxes corresponding to detected text regions, when multiple bounding boxes that belong to the same text line overlap.
    • New parameter ‘forceProcessing’: now you can force processing of the results to avoid repeating the computation of results in pipelines where the same results are consumed by different transformers.
    • New feature: sizeThreshold parameter sets the expected size for the recognized text. From now on, text size will be automatically detected when sizeThreshold is set to -1.
  • Added to the ImageToTextV2:
    • New parameter ‘usePandasUdf’: support PandasUdf to allow batch processing internally.
    • New support for formatted output, and HOCR. ocr.setOutputFormat(OcrOutputFormat.HOCR) ocr.setOutputFormat(OcrOutputFormat.FORMATTED_TEXT)
  • Support for Spark 3.2:
    • We added support for the latest Spark version, check installation instructions below.
    • Known problems & workarounds:

SPARK-38330: S3 access issues, there’s a workaround using the following settings,

//Scala
spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true")

#Python
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.path.style.access", "true")

SPARK-37577: changes in default behavior of query optimizer, it is already handled in start() function, or if you start the context manually, setting the following Spark properties,

#Python
spark.conf.set("spark.sql.optimizer.expression.nestedPruning.enabled", False)
spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", False)
  • Improved documentation on the website.

New Models

ocr_small_printed: Text recognition small model for printed text based on ImageToTextV2 ocr_small_handwritten: Text recognition small model for handwritten text based on ImageToTextV2 ocr_base_handwritten: Text recognition base model for handwritten text based on ImageToTextV2

Bug Fixes

  • display_table() function failing to display tables coming from digital PDFs.

New notebooks

3.11.0

Release date: 28-02-2022

Overview

We are glad to announce that Spark OCR 3.11.0 has been released!. This release comes with new models, new features, bug fixes, and notebook examples.

New Features

  • Added ImageTextDetectorV2 Python Spark-OCR Transformer for detecting printed and handwritten text using CRAFT architecture with Refiner Net.
  • Added ImageTextRecognizerV2 Python Spark-OCR Transformer for recognizing printed and handwritten text based on Deep Learning Transformer Architecture.
  • Added FormRelationExtractor for detecting relations between key and value entities in forms.
  • Added the capability of fine tuning VisualDocumentNerV2 models for key-value pairs extraction.

New Models

  • ImageTextDetectorV2: this extends the ImageTextDetectorV1 character level text detection model with a refiner net architecture.
  • ImageTextRecognizerV2: Text recognition for printed text based on the Deep Learning Transformer Architecture.

New notebooks

3.10.0

Release date: 10-01-2022

Overview

Form recognition using LayoutLMv2 and text detection.

New Features

New Models

  • LayoutLMv2 fine-tuned on FUNSD dataset
  • Text detection model based on CRAFT architecture

New notebooks

3.9.1

Release date: 02-11-2021

Overview

Added preserving of original file formatting

Enhancements

New and updated notebooks

3.9.0

Release date: 20-10-2021

Overview

Improve visualization and support Spark NLP.

New Features

Enhancements

  • Added postprocessing to the ImageTableDetector
  • Added Spark NLP by default to spark session in start function
  • Changed default value for ignoreResolution param in ImageToText
  • Updated license-validator. Added support floating license and set AWS keys from license.
  • Added ‘whiteList’ param to the VisualDocumentNER

New and updated notebooks

3.8.0

Release date: 14-09-2021

Overview

Support Microsoft PPT and PPTX documents.

New Features

  • Added PptToPdf transformer for converting PPT and PPTX slides to the PDF document.
  • Added PptToTextTable transformer for extracting tables from PPT and PPTX slides.

New and updated notebooks

3.7.0

Release date: 30-08-2021

Overview

Improve table recognition and render OCR results to the PDF with original image

New Features

  • Added ImageToTextPdf transformer for storing recognized text to the searchable PDF with original image
  • Added PdfAssembler for assembling multipage PDF document from single page PDF documents

Enhancements

New and updated notebooks

3.6.0

Release date: 05-08-2021

Overview

Handwritten detection and visualization improvement.

New Features

  • Added ImageHandwrittenDetector for detecting ‘signature’, ‘date’, ‘name’, ‘title’, ‘address’ and others handwritten text.
  • Added rendering labels and scores in ImageDrawRegions.
  • Added possibility to scale image to fixed size in ImageScaler with keeping original ratio.

Enhancements

  • Support new version of pip for installing python package
  • Added support string labels for detectors
  • Added an auto inferencing of the input shape for detector models
  • New license validator

Bugfixes

  • Fixed display BGR images in display functions

New and updated notebooks

3.5.0

Release date: 15-07-2021

Overview

Improve table detection and table recognition.

More details please read in Extract Tabular Data from PDF in Spark OCR

New Features

Enhancements

New notebooks

3.4.0

Release date: 30-06-2021

Overview

Signature Detection in image-based documents.

More details please read in Signature Detection in Spark OCR

New Features

New notebooks

3.3.0

Release date: 14-06-2021

Overview

Table detection and recognition for scanned documents.

For table detection we added ImageTableDetector. It’s based on CascadeTabNet which used Cascade mask Region-based CNN High-Resolution Network (Cascade mask R-CNN HRNet). The model was pre-trained on the COCO dataset and fine-tuned on ICDAR 2019 competitions dataset for table detection. It demonstrates state of the art results for ICDAR 2013 and TableBank. And top results for ICDAR 2019.

More details please read in Table Detection & Extraction in Spark OCR

New Features

New notebooks

3.2.0

Release date: 28-05-2021

Overview

Multi-modal visual document understanding, built on the LayoutLM architecture. It achieves new state-of-the-art accuracy in several downstream tasks, including form understanding and receipt understanding.

New Features

  • VisualDocumentNER is a DL model for NER problem using text and layout data. Currently available pre-trained model on the SROIE dataset.

Enhancements

  • Added support SPARK_OCR_LICENSE env key for read license.
  • Update dependencies and sync Spark versions with Spark NLP.

Bugfixes

  • Fixed an issue that some ImageReaderSpi plugins are unavailable in the fat jar.

New notebooks

3.1.0

Release date: 16-04-2021

Overview

Image processing on GPU. It is in 3.5 times faster than on CPU.

More details please read in GPU image preprocessing in Spark OCR

New Features

  • GPUImageTransformer with support: scaling, erosion, delation, Otsu and Huang thresholding.
  • Added display_images util function for displaying images from Spark DataFrame in Jupyter notebooks.

Enhancements

Bug fixes

  • Fixed issue with extra dependencies in start function

New notebooks

3.0.0

Release date: 02-04-2021

Overview

We are very excited to release Spark OCR 3.0.0!

Spark OCR 3.0.0 extends the support for Apache Spark 3.0.x and 3.1.x major releases on Scala 2.12 with both Hadoop 2.7. and 3.2. We will support all 4 major Apache Spark and PySpark releases of 2.3.x, 2.4.x, 3.0.x, and 3.1.x.

Spark OCR started to support Tensorflow models. First model is VisualDocumentClassifier.

New Features

  • Support for Apache Spark and PySpark 3.0.x on Scala 2.12
  • Support for Apache Spark and PySpark 3.1.x on Scala 2.12
  • Support 9x new Databricks runtimes:
    • Databricks 7.3
    • Databricks 7.3 ML GPU
    • Databricks 7.4
    • Databricks 7.4 ML GPU
    • Databricks 7.5
    • Databricks 7.5 ML GPU
    • Databricks 7.6
    • Databricks 7.6 ML GPU
    • Databricks 8.0
    • Databricks 8.0 ML (there is no GPU in 8.0)
    • Databricks 8.1
  • Support 2x new EMR 6.x:
    • EMR 6.1.0 (Apache Spark 3.0.0 / Hadoop 3.2.1)
    • EMR 6.2.0 (Apache Spark 3.0.1 / Hadoop 3.2.1)
  • VisualDocumentClassifier model for classification documents using text and layout data.
  • Added support Vietnamese language.

New notebooks

1.11.0

Release date: 25-02-2021

Overview

Support German, French, Spanish and Russian languages. Improving PositionsFinder and ImageToText for better support de-identification.

New Features

  • Loading model data from S3 in ImageToText.
  • Added support German, French, Spanish, Russian languages in ImageToText.
  • Added different OCR model types: Base, Best, Fast in ImageToText.

Enhancements

  • Added spaces symbols to the output positions in the ImageToText transformer.
  • Eliminate python-levensthein from dependencies for simplify installation.

Bugfixes

  • Fixed issue with extracting coordinates in in ImageToText.
  • Fixed loading model data on cluster in yarn mode.

New notebooks

1.10.0

Release date: 20-01-2021

Overview

Support Microsoft Docx documents.

New Features

  • Added DocToText transformer for extract text from DOCX documents.
  • Added DocToTextTable transformer for extract table data from DOCX documents.
  • Added DocToPdf transformer for convert DOCX documents to PDF format.

Bugfixes

  • Fixed issue with loading model data on some cluster configurations

1.9.0

Release date: 11-12-2020

Overview

Extension of FoundationOne report parser and support HOCR output format.

New Features

1.8.0

Release date: 20-11-2020

Overview

Optimisation performance for processing multipage PDF documents. Support up to 10k pages per document.

New Features

  • Added ImageAdaptiveBinarizer Scala transformer with support:
    • Gaussian local thresholding
    • Otsu thresholding
    • Sauvola local thresholding
  • Added possibility to split pdf to small documents for optimize processing in PdfToImage.

Enhancements

  • Added applying binarization in PdfToImage for optimize memory usage.
  • Added pdfCoordinates param to the ImageToText transformer.
  • Added ‘total_pages’ field to the PdfToImage transformer.
  • Added different splitting strategies to the PdfToImage transformer.
  • Simplified paging PdfToImage when run it with splitting to small PDF.
  • Added params to the PdfToText for disable extra functionality.
  • Added master_url param to the python start function.

1.7.0

Release date: 22-09-2020

Overview

Support Spark 2.3.3.

Bugfixes

  • Restored read JPEG2000 image

1.6.0

Release date: 05-09-2020

Overview

Support parsing data from tables for selectable PDFs.

New Features

1.5.0

Release date: 22-07-2020

Overview

FoundationOne report parsing support.

Enhancements

  • Optimized memory usage during image processing

New Features

1.4.0

Release date: 23-06-2020

Overview

Added support Dicom format and improved support image morphological operations.

Enhancements

  • Updated start function. Improved support Spark NLP internal.
  • ImageMorphologyOpening and ImageErosion are removed.
  • Improved existing transformers for support de-identification Dicom documents.
  • Added possibility to draw filled rectangles to ImageDrawRegions.

New Features

  • Support reading and writing Dicom documents.
  • Added ImageMorphologyOperation transformer which support: erosion, dilation, opening and closing operations.

Bugfixes

  • Fixed issue in ImageToText related to extraction coordinates.

1.3.0

Release date: 22-05-2020

Overview

New functionality for de-identification problem.

Enhancements

  • Renamed TesseractOCR to ImageToText.
  • Simplified installation.
  • Added check license from SPARK_NLP_LICENSE env varibale.

New Features

  • Support storing for binaryFormat. Added support storing Image and PDF files.
  • Support selectable pdf for TextToPdf transformer.
  • Added UpdateTextPosition transformer.

1.2.0

Release date: 08-04-2020

Overview

Improved support Databricks and processing selectable pdfs.

Enhancements

  • Adapted Spark OCR for run on Databricks.
  • Added rewriting positions in ImageToText when run together with PdfToText.
  • Added ‘positionsCol’ param to ImageToText.
  • Improved support Spark NLP. Changed start function.

New Features

  • Added showImage implicit to Dataframe for display images in Scala Databricks notebooks.
  • Added display_images function for display images in Python Databricks notebooks.
  • Added propagation selectable pdf file in TextToPdf. Added ‘inputContent’ param to ‘TextToPdf’.

1.1.2

Release date: 09-03-2020

Overview

Minor improvements and fixes

Enhancements

  • Improved messages during license validation

Bugfixes

  • Fixed dependencies issue

1.1.1

Release date: 06-03-2020

Overview

Integration with license server.

Enhancements

  • Added license validation. License can be set in following waysq:
    • Environment variable. Set variable ‘JSL_OCR_LICENSE’.
    • System property. Set property ‘jsl.sparkocr.settings.license’.
    • Application.conf file. Set property ‘jsl.sparkocr.settings.license’.
  • Added auto renew license using jsl license server.

1.1.0

Release date: 03-03-2020

Overview

This release contains improvements for preprocessing image before run OCR and added possibility to store results to PDF for keep original formatting.

New Features

  • Added auto calculation maximum size of objects for removing in ImageRemoveObjects. This improvement avoids to remove . and affect symbols with dots (i, !, ?). Added minSizeFont param to ImageRemoveObjects transformer for activate this functional.
  • Added ocrParams parameter to ImageToText transformer for set any ocr params.
  • Added extraction font size in ImageToText
  • Added TextToPdf transformer for render text with positions to pdf file.

Enhancements

  • Added setting resolution in ImageToText. And added ignoreResolution param with default true value to ImageToText transformer for back compatibility.
  • Added parsing resolution from image metadata in BinaryToImage transformer.
  • Added storing resolution in PdfToImage transformer.
  • Added resolution field to Image schema.
  • Updated ‘start’ function for set ‘PYSPARK_PYTHON’ env variable.
  • Improve auto-scaling/skew correction:
    • improved access to images values
    • removing unnecessary copies of images
    • adding more test cases
    • improving auto-correlation in auto-scaling.

1.0.0

Release date: 12-02-2020

Overview

Spark NLP OCR functionality was reimplemented as set of Spark ML transformers and moved to separate Spark OCR library.

New Features

  • Added extraction coordinates of each symbol in ImageToText
  • Added ImageDrawRegions transformer
  • Added ImageToPdf transformer
  • Added ImageMorphologyOpening transformer
  • Added ImageRemoveObjects transformer
  • Added ImageAdaptiveThresholding transformer

Enhancements

  • Reimplement main functionality as Spark ML transformers
  • Moved DrawRectangle functionality to PdfDrawRegions transformer
  • Added ‘start’ function with support SparkMonitor initialization
  • Moved PositionFinder to Spark OCR

Bugfixes

  • Fixed bug with transforming complex pdf to image
Last updated