Improve table detection and table recognition.
More details please read in Extract Tabular Data from PDF in Spark OCR
- Added new method to ImageTableCellDetector which support borderless tables and combined tables.
- Added Wolf and Singh adaptive binarization methods to the ImageAdaptiveThresholding.
- Added possibility to use different type of images as input for ImageTableDetector.
- Added display_pdf and display_images_horizontal util functions.
- Tables Recognition from PDF
- Pdf de-identification on Databricks
- Dicom de-identification on Databricks
Signature Detection in image-based documents.
More details please read in Signature Detection in Spark OCR
- ImageSignatureDetector is a DL model for detect signature on the image.
Table detection and recognition for scanned documents.
For table detection we added ImageTableDetector. It based on CascadeTabNet which used Cascade mask Region-based CNN High-Resolution Network (Cascade mask R-CNN HRNet). Model was pre-trained on the COCO dataset and fine tuned on ICDAR 2019 competitions dataset for table detection. It demonstrates state of the art results for ICDAR 2013 and TableBank. And top results for ICDAR 2019.
More details please read in Table Detection & Extraction in Spark OCR
- ImageTableDetector is a DL model for detect tables on the image.
- ImageTableCellDetector is a transformer for detect regions of cells in the table image.
- ImageCellsToTextTable is a transformer for extract text from the detected cells.
Multi-modal visual document understanding, built on the LayoutLM architecture. It achieves new state-of-the-art accuracy in several downstream tasks, including form understanding and receipt understanding.
- VisualDocumentNER is a DL model for NER problem using text and layout data. Currently available pre-trained model on the SROIE dataset.
- Added support
SPARK_OCR_LICENSEenv key for read license.
- Update dependencies and sync Spark versions with Spark NLP.
- Fixed an issue that some ImageReaderSpi plugins are unavailable in the fat jar.
Image processing on GPU. It is in 3..5 times faster than on CPU.
More details please read in GPU image preprocessing in Spark OCR
- GPUImageTransformer with support: scaling, erosion, delation, Otsu and Huang thresholding.
- Added display_images util function for display images from Spark DataFrame in Jupyter notebooks.
- Improve display_image util function.
- Fixed issue with extra dependencies in start function
We are very excited to release Spark OCR 3.0.0!
Spark OCR 3.0.0 extends the support for Apache Spark 3.0.x and 3.1.x major releases on Scala 2.12 with both Hadoop 2.7. and 3.2. We will support all 4 major Apache Spark and PySpark releases of 2.3.x, 2.4.x, 3.0.x, and 3.1.x.
Spark OCR started to support Tensorflow models. First model is VisualDocumentClassifier.
- Support for Apache Spark and PySpark 3.0.x on Scala 2.12
- Support for Apache Spark and PySpark 3.1.x on Scala 2.12
- Support 9x new Databricks runtimes:
- Databricks 7.3
- Databricks 7.3 ML GPU
- Databricks 7.4
- Databricks 7.4 ML GPU
- Databricks 7.5
- Databricks 7.5 ML GPU
- Databricks 7.6
- Databricks 7.6 ML GPU
- Databricks 8.0
- Databricks 8.0 ML (there is no GPU in 8.0)
- Databricks 8.1
- Support 2x new EMR 6.x:
- EMR 6.1.0 (Apache Spark 3.0.0 / Hadoop 3.2.1)
- EMR 6.2.0 (Apache Spark 3.0.1 / Hadoop 3.2.1)
- VisualDocumentClassifier model for classification documents using text and layout data.
- Added support Vietnamese language.
Support German, French, Spanish and Russian languages. Improving PositionsFinder and ImageToText for better support de-identification.
- Loading model data from S3 in ImageToText.
- Added support German, French, Spanish, Russian languages in ImageToText.
- Added different OCR model types: Base, Best, Fast in ImageToText.
- Added spaces symbols to the output positions in the ImageToText transformer.
- Eliminate python-levensthein from dependencies for simplify installation.
- Fixed issue with extracting coordinates in in ImageToText.
- Fixed loading model data on cluster in yarn mode.
Support Microsoft Docx documents.
- Added DocToText transformer for extract text from DOCX documents.
- Added DocToTextTable transformer for extract table data from DOCX documents.
- Added DocToPdf transformer for convert DOCX documents to PDF format.
- Fixed issue with loading model data on some cluster configurations
Release date: 11-12-2020
Extension of FoundationOne report parser and support HOCR output format.
- Added ImageToHocr transformer for recognize text from image and store it to HOCR format.
- Added parsing gene lists from ‘Appendix’ in FoundationOneReportParser transformer.
Release date: 20-11-2020
Optimisation performance for processing multipage PDF documents. Support up to 10k pages per document.
- Added ImageAdaptiveBinarizer Scala transformer with support:
- Gaussian local thresholding
- Otsu thresholding
- Sauvola local thresholding
- Added possibility to split pdf to small documents for optimize processing in PdfToImage.
- Added applying binarization in PdfToImage for optimize memory usage.
pdfCoordinatesparam to the ImageToText transformer.
- Added ‘total_pages’ field to the PdfToImage transformer.
- Added different splitting strategies to the PdfToImage transformer.
- Simplified paging PdfToImage when run it with splitting to small PDF.
- Added params to the PdfToText for disable extra functionality.
master_urlparam to the python start function.
Release date: 22-09-2020
Support Spark 2.3.3.
- Restored read JPEG2000 image
Release date: 05-09-2020
Support parsing data from tables for selectable PDFs.
- Added PdfToTextTable transformer for extract tables from Pdf document per each page.
- Added ImageCropper transformer for crop images.
- Added ImageBrandsToText transformer for detect text in defined areas.
Release date: 22-07-2020
FoundationOne report parsing support.
- Optimized memory usage during image processing
- Added FoundationOneReportParser which support parsing patient info, genomic and biomarker findings.
Release date: 23-06-2020
Added support Dicom format and improved support image morphological operations.
- Updated start function. Improved support Spark NLP internal.
- Improved existing transformers for support de-identification Dicom documents.
- Added possibility to draw filled rectangles to ImageDrawRegions.
- Support reading and writing Dicom documents.
- Added ImageMorphologyOperation transformer which support: erosion, dilation, opening and closing operations.
- Fixed issue in ImageToText related to extraction coordinates.
Release date: 22-05-2020
New functionality for de-identification problem.
- Renamed TesseractOCR to ImageToText.
- Simplified installation.
- Added check license from
- Support storing for binaryFormat. Added support storing Image and PDF files.
- Support selectable pdf for TextToPdf transformer.
- Added UpdateTextPosition transformer.
Release date: 08-04-2020
Improved support Databricks and processing selectable pdfs.
- Adapted Spark OCR for run on Databricks.
- Added rewriting positions in ImageToText when run together with PdfToText.
- Added ‘positionsCol’ param to ImageToText.
- Improved support Spark NLP. Changed start function.
- Added showImage implicit to Dataframe for display images in Scala Databricks notebooks.
- Added display_images function for display images in Python Databricks notebooks.
- Added propagation selectable pdf file in TextToPdf. Added ‘inputContent’ param to ‘TextToPdf’.
Release date: 09-03-2020
Minor improvements and fixes
- Improved messages during license validation
- Fixed dependencies issue
Release date: 06-03-2020
Integration with license server.
- Added license validation. License can be set in following waysq:
- Environment variable. Set variable ‘JSL_OCR_LICENSE’.
- System property. Set property ‘jsl.sparkocr.settings.license’.
- Application.conf file. Set property ‘jsl.sparkocr.settings.license’.
- Added auto renew license using jsl license server.
Release date: 03-03-2020
This release contains improvements for preprocessing image before run OCR and added possibility to store results to PDF for keep original formatting.
- Added auto calculation maximum size of objects for removing in
ImageRemoveObjects. This improvement avoids to remove
.and affect symbols with dots (
ImageRemoveObjectstransformer for activate this functional.
ImageToTexttransformer for set any ocr params.
- Added extraction font size in
TextToPdftransformer for render text with positions to pdf file.
- Added setting resolution in
ImageToText. And added
ignoreResolutionparam with default
ImageToTexttransformer for back compatibility.
- Added parsing resolution from image metadata in
- Added storing resolution in
- Added resolution field to Image schema.
- Updated ‘start’ function for set ‘PYSPARK_PYTHON’ env variable.
- Improve auto-scaling/skew correction:
- improved access to images values
- removing unnecessary copies of images
- adding more test cases
- improving auto-correlation in auto-scaling.
Release date: 12-02-2020
Spark NLP OCR functionality was reimplemented as set of Spark ML transformers and moved to separate Spark OCR library.
- Added extraction coordinates of each symbol in ImageToText
- Added ImageDrawRegions transformer
- Added ImageToPdf transformer
- Added ImageMorphologyOpening transformer
- Added ImageRemoveObjects transformer
- Added ImageAdaptiveThresholding transformer
- Reimplement main functionality as Spark ML transformers
- Moved DrawRectangle functionality to PdfDrawRegions transformer
- Added ‘start’ function with support SparkMonitor initialization
- Moved PositionFinder to Spark OCR
- Fixed bug with transforming complex pdf to image