Support Microsoft Docx documents.
- Added DocToText transformer for extract text from DOCX documents.
- Added DocToTextTable transformer for extract table data from DOCX documents.
- Added DocToPdf transformer for convert DOCX documents to PDF format.
- Fixed issue with loading model data on some cluster configurations
Release date: 11-12-2020
Extension of FoundationOne report parser and support HOCR output format.
- Added ImageToHocr transformer for recognize text from image and store it to HOCR format.
- Added parsing gene lists from ‘Appendix’ in FoundationOneReportParser transformer.
Release date: 20-11-2020
Optimisation performance for processing multipage PDF documents. Support up to 10k pages per document.
- Added ImageAdaptiveBinarizer Scala transformer with support:
- Gaussian local thresholding
- Otsu thresholding
- Sauvola local thresholding
- Added possibility to split pdf to small documents for optimize processing in PdfToImage.
- Added applying binarization in PdfToImage for optimize memory usage.
pdfCoordinatesparam to the ImageToText transformer.
- Added ‘total_pages’ field to the PdfToImage transformer.
- Added different splitting strategies to the PdfToImage transformer.
- Simplified paging PdfToImage when run it with splitting to small PDF.
- Added params to the PdfToText for disable extra functionality.
master_urlparam to the python start function.
Release date: 22-09-2020
Support Spark 2.3.3.
- Restored read JPEG2000 image
Release date: 05-09-2020
Support parsing data from tables for selectable PDFs.
- Added PdfToTextTable transformer for extract tables from Pdf document per each page.
- Added ImageCropper transformer for crop images.
- Added ImageBrandsToText transformer for detect text in defined areas.
Release date: 22-07-2020
FoundationOne report parsing support.
- Optimized memory usage during image processing
- Added FoundationOneReportParser which support parsing patient info, genomic and biomarker findings.
Release date: 23-06-2020
Added support Dicom format and improved support image morphological operations.
- Updated start function. Improved support Spark NLP internal.
- Improved existing transformers for support de-identification Dicom documents.
- Added possibility to draw filled rectangles to ImageDrawRegions.
- Support reading and writing Dicom documents.
- Added ImageMorphologyOperation transformer which support: erosion, dilation, opening and closing operations.
- Fixed issue in ImageToText related to extraction coordinates.
Release date: 22-05-2020
New functionality for de-identification problem.
- Renamed TesseractOCR to ImageToText.
- Simplified installation.
- Added check license from
- Support storing for binaryFormat. Added support storing Image and PDF files.
- Support selectable pdf for TextToPdf transformer.
- Added UpdateTextPosition transformer.
Release date: 08-04-2020
Improved support Databricks and processing selectable pdfs.
- Adapted Spark OCR for run on Databricks.
- Added rewriting positions in ImageToText when run together with PdfToText.
- Added ‘positionsCol’ param to ImageToText.
- Improved support Spark NLP. Changed start function.
- Added showImage implicit to Dataframe for display images in Scala Databricks notebooks.
- Added display_images function for display images in Python Databricks notebooks.
- Added propagation selectable pdf file in TextToPdf. Added ‘inputContent’ param to ‘TextToPdf’.
Release date: 09-03-2020
Minor improvements and fixes
- Improved messages during license validation
- Fixed dependencies issue
Release date: 06-03-2020
Integration with license server.
- Added license validation. License can be set in following waysq:
- Environment variable. Set variable ‘JSL_OCR_LICENSE’.
- System property. Set property ‘jsl.sparkocr.settings.license’.
- Application.conf file. Set property ‘jsl.sparkocr.settings.license’.
- Added auto renew license using jsl license server.
Release date: 03-03-2020
This release contains improvements for preprocessing image before run OCR and added possibility to store results to PDF for keep original formatting.
- Added auto calculation maximum size of objects for removing in
ImageRemoveObjects. This improvement avoids to remove
.and affect symbols with dots (
ImageRemoveObjectstransformer for activate this functional.
ImageToTexttransformer for set any ocr params.
- Added extraction font size in
TextToPdftransformer for render text with positions to pdf file.
- Added setting resolution in
ImageToText. And added
ignoreResolutionparam with default
ImageToTexttransformer for back compatibility.
- Added parsing resolution from image metadata in
- Added storing resolution in
- Added resolution field to Image schema.
- Updated ‘start’ function for set ‘PYSPARK_PYTHON’ env variable.
- Improve auto-scaling/skew correction:
- improved access to images values
- removing unnecessary copies of images
- adding more test cases
- improving auto-correlation in auto-scaling.
Release date: 12-02-2020
Spark NLP OCR functionality was reimplemented as set of Spark ML transformers and moved to separate Spark OCR library.
- Added extraction coordinates of each symbol in ImageToText
- Added ImageDrawRegions transformer
- Added ImageToPdf transformer
- Added ImageMorphologyOpening transformer
- Added ImageRemoveObjects transformer
- Added ImageAdaptiveThresholding transformer
- Reimplement main functionality as Spark ML transformers
- Moved DrawRectangle functionality to PdfDrawRegions transformer
- Added ‘start’ function with support SparkMonitor initialization
- Moved PositionFinder to Spark OCR
- Fixed bug with transforming complex pdf to image