Spark OCR release notes

 

1.6.0

Overview

Support parsing data from tables for selectable PDFs.

New Features

1.5.0

Release date: 22-07-2020

Overview

FoundationOne report parsing support.

Enhancements

  • Optimized memory usage during image processing

New Features

1.4.0

Release date: 23-06-2020

Overview

Added support Dicom format and improved support image morphological operations.

Enhancements

  • Updated start function. Improved support Spark NLP internal.
  • ImageMorphologyOpening and ImageErosion are removed.
  • Improved existing transformers for support de-identification Dicom documents.
  • Added possibility to draw filled rectangles to ImageDrawRegions.

New Features

  • Support reading and writing Dicom documents.
  • Added ImageMorphologyOperation transformer which support: erosion, dilation, opening and closing operations.

Bugfixes

  • Fixed issue in ImageToText related to extraction coordinates.

1.3.0

Release date: 22-05-2020

Overview

New functionality for de-identification problem.

Enhancements

  • Renamed TesseractOCR to ImageToText.
  • Simplified installation.
  • Added check license from SPARK_NLP_LICENSE env varibale.

New Features

  • Support storing for binaryFormat. Added support storing Image and PDF files.
  • Support selectable pdf for TextToPdf transformer.
  • Added UpdateTextPosition transformer.

1.2.0

Release date: 08-04-2020

Overview

Improved support Databricks and processing selectable pdfs.

Enhancements

  • Adapted Spark OCR for run on Databricks.
  • Added rewriting positions in ImageToText when run together with PdfToText.
  • Added ‘positionsCol’ param to ImageToText.
  • Improved support Spark NLP. Changed start function.

New Features

  • Added showImage implicit to Dataframe for display images in Scala Databricks notebooks.
  • Added display_images function for display images in Python Databricks notebooks.
  • Added propagation selectable pdf file in TextToPdf. Added ‘inputContent’ param to ‘TextToPdf’.

1.1.2

Release date: 09-03-2020

Overview

Minor improvements and fixes

Enhancements

  • Improved messages during license validation

Bugfixes

  • Fixed dependencies issue

1.1.1

Release date: 06-03-2020

Overview

Integration with license server.

Enhancements

  • Added license validation. License can be set in following waysq:
    • Environment variable. Set variable ‘JSL_OCR_LICENSE’.
    • System property. Set property ‘jsl.sparkocr.settings.license’.
    • Application.conf file. Set property ‘jsl.sparkocr.settings.license’.
  • Added auto renew license using jsl license server.

1.1.0

Release date: 03-03-2020

Overview

This release contains improvements for preprocessing image before run OCR and added possibility to store results to PDF for keep original formatting.

New Features

  • Added auto calculation maximum size of objects for removing in ImageRemoveObjects. This improvement avoids to remove . and affect symbols with dots (i, !, ?). Added minSizeFont param to ImageRemoveObjects transformer for activate this functional.
  • Added ocrParams parameter to ImageToText transformer for set any ocr params.
  • Added extraction font size in ImageToText
  • Added TextToPdf transformer for render text with positions to pdf file.

Enhancements

  • Added setting resolution in ImageToText. And added ignoreResolution param with default true value to ImageToText transformer for back compatibility.
  • Added parsing resolution from image metadata in BinaryToImage transformer.
  • Added storing resolution in PrfToImage transformer.
  • Added resolution field to Image schema.
  • Updated ‘start’ function for set ‘PYSPARK_PYTHON’ env variable.
  • Improve auto-scaling/skew correction:
    • improved access to images values
    • removing unnecessary copies of images
    • adding more test cases
    • improving auto-correlation in auto-scaling.

1.0.0

Release date: 12-02-2020

Overview

Spark NLP OCR functionality was reimplemented as set of Spark ML transformers and moved to separate Spark OCR library.

New Features

  • Added extraction coordinates of each symbol in ImageToText
  • Added ImageDrawRegions transformer
  • Added ImageToPdf transformer
  • Added ImageMorphologyOpening transformer
  • Added ImageRemoveObjects transformer
  • Added ImageAdaptiveThresholding transformer

Enhancements

  • Reimplement main functionality as Spark ML transformers
  • Moved DrawRectangle functionality to PdfDrawRegions transformer
  • Added ‘start’ function with support SparkMonitor initialization
  • Moved PositionFinder to Spark OCR

Bugfixes

  • Fixed bug with transforming complex pdf to image
Last updated