This release contains improvements for preprocessing image before run OCR and added possibility to store results to PDF for keep original formatting.
- Added auto calculation maximum size of objects for removing in
ImageRemoveObjects. This improvement avoids to remove
.and affect symbols with dots (
ImageRemoveObjectstransformer for activate this functional.
TesseractOcrtransformer for set any tesseract params.
- Added extraction font size in
TextToPdftransformer for render text with positions to pdf file.
- Added setting resolution in
TesseractOcr. And added
ignoreResolutionparam with default
TesseractOcrtransformer for back compatibility.
- Added parsing resolution from image metadata in
- Added storing resolution in
- Added resolution field to Image schema.
- Updated ‘start’ function for set ‘PYSPARK_PYTHON’ env variable.
- Improve auto-scaling/skew correction:
- improved access to images values
- removing unnecessary copies of images
- adding more test cases
- improving auto-correlation in auto-scaling.
Release date: 12-02-2020
Spark NLP OCR functionality was reimplemented as set of Spark ML transformers and moved to separate Spark OCR library.
- Added extraction coordinates of each symbol in TesseractOCR
- Added ImageDrawRegions transformer
- Added ImageToPdf transformer
- Added ImageMorphologyOpening transformer
- Added ImageRemoveObjects transformer
- Added ImageAdaptiveThresholding transformer
- Reimplement main functionality as Spark ML transformers
- Moved DrawRectangle functionality to PdfDrawRegions transformer
- Added ‘start’ function with support SparkMonitor initialization
- Moved PositionFinder to Spark OCR
- Fixed bug with transforming complex pdf to image