Release date: 14-04-2022
We’re glad to announce that Spark OCR 3.12.0 has been released! This release comes with new models for Handwritten Text Recognition, Spark 3.2 support, bug fixes, and notebook examples.
- Added to the ImageTextDetectorV2:
- New parameter ‘mergeIntersects’: merge bounding boxes corresponding to detected text regions, when multiple bounding boxes that belong to the same text line overlap.
- New parameter ‘forceProcessing’: now you can force processing of the results to avoid repeating the computation of results in pipelines where the same results are consumed by different transformers.
- New feature: sizeThreshold parameter sets the expected size for the recognized text. From now on, text size will be automatically detected when sizeThreshold is set to -1.
- Added to the ImageToTextV2:
- New parameter ‘usePandasUdf’: support PandasUdf to allow batch processing internally.
- New support for formatted output, and HOCR. ocr.setOutputFormat(OcrOutputFormat.HOCR) ocr.setOutputFormat(OcrOutputFormat.FORMATTED_TEXT)
- Support for Spark 3.2:
- We added support for the latest Spark version, check installation instructions below.
- Known problems & workarounds:
SPARK-38330: S3 access issues, there’s a workaround using the following settings,
//Scala spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true") #Python spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.path.style.access", "true")
SPARK-37577: changes in default behavior of query optimizer, it is already handled in start() function, or if you start the context manually, setting the following Spark properties,
#Python spark.conf.set("spark.sql.optimizer.expression.nestedPruning.enabled", False) spark.conf.set("spark.sql.optimizer.nestedSchemaPruning.enabled", False)
- Improved documentation on the website.
ocr_small_printed: Text recognition small model for printed text based on ImageToTextV2 ocr_small_handwritten: Text recognition small model for handwritten text based on ImageToTextV2 ocr_base_handwritten: Text recognition base model for handwritten text based on ImageToTextV2
- display_table() function failing to display tables coming from digital PDFs.
- SparkOcrImageToTextV2OutputFormats.ipynb, different output formats for ImageToTextV2.
Release date: 28-02-2022
We are glad to announce that Spark OCR 3.11.0 has been released!. This release comes with new models, new features, bug fixes, and notebook examples.
- Added ImageTextDetectorV2 Python Spark-OCR Transformer for detecting printed and handwritten text using CRAFT architecture with Refiner Net.
- Added ImageTextRecognizerV2 Python Spark-OCR Transformer for recognizing printed and handwritten text based on Deep Learning Transformer Architecture.
- Added FormRelationExtractor for detecting relations between key and value entities in forms.
- Added the capability of fine tuning VisualDocumentNerV2 models for key-value pairs extraction.
- ImageTextDetectorV2: this extends the ImageTextDetectorV1 character level text detection model with a refiner net architecture.
- ImageTextRecognizerV2: Text recognition for printed text based on the Deep Learning Transformer Architecture.
- Visual Document NER v2
- Creating Rest a API with Synapse to extract text from images, SparkOcrRestApi
- Creating Rest a API with Synapse to extract text from PDFs, SparkOcrRestApiPdf
Release date: 10-01-2022
Form recognition using LayoutLMv2 and text detection.
- Added VisualDocumentNERv2 transformer
- Added DL based ImageTextDetector transformer
- Support rotated regions in ImageSplitRegions
- Support rotated regions in ImageDrawRegions
- LayoutLMv2 fine-tuned on FUNSD dataset
- Text detection model based on CRAFT architecture
Release date: 02-11-2021
Added preserving of original file formatting
- Added keepLayout param to the ImageToText
New and updated notebooks
Release date: 20-10-2021
Improve visualization and support Spark NLP.
- Added HocrTokenizer
- Added HocrDocumentAssembler
- Added ImageDrawAnnotations
- Added support Arabic language in ImageToText and ImageToHocr
- Added postprocessing to the ImageTableDetector
- Added Spark NLP by default to spark session in start function
- Changed default value for ignoreResolution param in ImageToText
- Updated license-validator. Added support floating license and set AWS keys from license.
- Added ‘whiteList’ param to the VisualDocumentNER
New and updated notebooks
Release date: 14-09-2021
Support Microsoft PPT and PPTX documents.
- Added PptToPdf transformer for converting PPT and PPTX slides to the PDF document.
- Added PptToTextTable transformer for extracting tables from PPT and PPTX slides.
New and updated notebooks
Release date: 30-08-2021
Improve table recognition and render OCR results to the PDF with original image
- Added ImageToTextPdf transformer for storing recognized text to the searchable PDF with original image
- Added PdfAssembler for assembling multipage PDF document from single page PDF documents
- Added support dbfs for store models. This allow to use models on Databricks.
- Improved ImageTableCellDetector algorithms
- Added params for tuning ImageTableCellDetector algorithms
- Added possibility to render detected lines to the original image in ImageTableCellDetector
- Added support to store recognized results to CSV in ImageCellsToTextTable
- Added display_table and display_tables functions
- Added display_pdf_file function for displaying pdf in embedded pdf viewer
- Updated license validator
New and updated notebooks
- Process multiple page scanned PDF (New)
- Image Table Detection example
- Image Cell Recognition example
- Image Table Recognition
- Tables Recognition from PDF
Release date: 05-08-2021
Handwritten detection and visualization improvement.
- Added ImageHandwrittenDetector for detecting ‘signature’, ‘date’, ‘name’, ‘title’, ‘address’ and others handwritten text.
- Added rendering labels and scores in ImageDrawRegions.
- Added possibility to scale image to fixed size in ImageScaler with keeping original ratio.
- Support new version of pip for installing python package
- Added support string labels for detectors
- Added an auto inferencing of the input shape for detector models
- New license validator
- Fixed display BGR images in display functions
New and updated notebooks
Release date: 15-07-2021
Improve table detection and table recognition.
More details please read in Extract Tabular Data from PDF in Spark OCR
- Added new method to ImageTableCellDetector which support borderless tables and combined tables.
- Added Wolf and Singh adaptive binarization methods to the ImageAdaptiveThresholding.
- Added possibility to use different type of images as input for ImageTableDetector.
- Added display_pdf and display_images_horizontal util functions.
- Tables Recognition from PDF
- Pdf de-identification on Databricks
- Dicom de-identification on Databricks
Release date: 30-06-2021
Signature Detection in image-based documents.
More details please read in Signature Detection in Spark OCR
- ImageSignatureDetector is a DL model for detecting signature on the image.
Release date: 14-06-2021
Table detection and recognition for scanned documents.
For table detection we added ImageTableDetector. It’s based on CascadeTabNet which used Cascade mask Region-based CNN High-Resolution Network (Cascade mask R-CNN HRNet). The model was pre-trained on the COCO dataset and fine-tuned on ICDAR 2019 competitions dataset for table detection. It demonstrates state of the art results for ICDAR 2013 and TableBank. And top results for ICDAR 2019.
More details please read in Table Detection & Extraction in Spark OCR
- ImageTableDetector is a DL model for detect tables on the image.
- ImageTableCellDetector is a transformer for detect regions of cells in the table image.
- ImageCellsToTextTable is a transformer for extract text from the detected cells.
Release date: 28-05-2021
Multi-modal visual document understanding, built on the LayoutLM architecture. It achieves new state-of-the-art accuracy in several downstream tasks, including form understanding and receipt understanding.
- VisualDocumentNER is a DL model for NER problem using text and layout data. Currently available pre-trained model on the SROIE dataset.
- Added support
SPARK_OCR_LICENSEenv key for read license.
- Update dependencies and sync Spark versions with Spark NLP.
- Fixed an issue that some ImageReaderSpi plugins are unavailable in the fat jar.
Release date: 16-04-2021
Image processing on GPU. It is in 3.5 times faster than on CPU.
More details please read in GPU image preprocessing in Spark OCR
- GPUImageTransformer with support: scaling, erosion, delation, Otsu and Huang thresholding.
- Added display_images util function for displaying images from Spark DataFrame in Jupyter notebooks.
- Improve display_image util function.
- Fixed issue with extra dependencies in start function
Release date: 02-04-2021
We are very excited to release Spark OCR 3.0.0!
Spark OCR 3.0.0 extends the support for Apache Spark 3.0.x and 3.1.x major releases on Scala 2.12 with both Hadoop 2.7. and 3.2. We will support all 4 major Apache Spark and PySpark releases of 2.3.x, 2.4.x, 3.0.x, and 3.1.x.
Spark OCR started to support Tensorflow models. First model is VisualDocumentClassifier.
- Support for Apache Spark and PySpark 3.0.x on Scala 2.12
- Support for Apache Spark and PySpark 3.1.x on Scala 2.12
- Support 9x new Databricks runtimes:
- Databricks 7.3
- Databricks 7.3 ML GPU
- Databricks 7.4
- Databricks 7.4 ML GPU
- Databricks 7.5
- Databricks 7.5 ML GPU
- Databricks 7.6
- Databricks 7.6 ML GPU
- Databricks 8.0
- Databricks 8.0 ML (there is no GPU in 8.0)
- Databricks 8.1
- Support 2x new EMR 6.x:
- EMR 6.1.0 (Apache Spark 3.0.0 / Hadoop 3.2.1)
- EMR 6.2.0 (Apache Spark 3.0.1 / Hadoop 3.2.1)
- VisualDocumentClassifier model for classification documents using text and layout data.
- Added support Vietnamese language.
Release date: 25-02-2021
Support German, French, Spanish and Russian languages. Improving PositionsFinder and ImageToText for better support de-identification.
- Loading model data from S3 in ImageToText.
- Added support German, French, Spanish, Russian languages in ImageToText.
- Added different OCR model types: Base, Best, Fast in ImageToText.
- Added spaces symbols to the output positions in the ImageToText transformer.
- Eliminate python-levensthein from dependencies for simplify installation.
- Fixed issue with extracting coordinates in in ImageToText.
- Fixed loading model data on cluster in yarn mode.
Release date: 20-01-2021
Support Microsoft Docx documents.
- Added DocToText transformer for extract text from DOCX documents.
- Added DocToTextTable transformer for extract table data from DOCX documents.
- Added DocToPdf transformer for convert DOCX documents to PDF format.
- Fixed issue with loading model data on some cluster configurations
Release date: 11-12-2020
Extension of FoundationOne report parser and support HOCR output format.
- Added ImageToHocr transformer for recognize text from image and store it to HOCR format.
- Added parsing gene lists from ‘Appendix’ in FoundationOneReportParser transformer.
Release date: 20-11-2020
Optimisation performance for processing multipage PDF documents. Support up to 10k pages per document.
- Added ImageAdaptiveBinarizer Scala transformer with support:
- Gaussian local thresholding
- Otsu thresholding
- Sauvola local thresholding
- Added possibility to split pdf to small documents for optimize processing in PdfToImage.
- Added applying binarization in PdfToImage for optimize memory usage.
pdfCoordinatesparam to the ImageToText transformer.
- Added ‘total_pages’ field to the PdfToImage transformer.
- Added different splitting strategies to the PdfToImage transformer.
- Simplified paging PdfToImage when run it with splitting to small PDF.
- Added params to the PdfToText for disable extra functionality.
master_urlparam to the python start function.
Release date: 22-09-2020
Support Spark 2.3.3.
- Restored read JPEG2000 image
Release date: 05-09-2020
Support parsing data from tables for selectable PDFs.
- Added PdfToTextTable transformer for extract tables from Pdf document per each page.
- Added ImageCropper transformer for crop images.
- Added ImageBrandsToText transformer for detect text in defined areas.
Release date: 22-07-2020
FoundationOne report parsing support.
- Optimized memory usage during image processing
- Added FoundationOneReportParser which support parsing patient info, genomic and biomarker findings.
Release date: 23-06-2020
Added support Dicom format and improved support image morphological operations.
- Updated start function. Improved support Spark NLP internal.
- Improved existing transformers for support de-identification Dicom documents.
- Added possibility to draw filled rectangles to ImageDrawRegions.
- Support reading and writing Dicom documents.
- Added ImageMorphologyOperation transformer which support: erosion, dilation, opening and closing operations.
- Fixed issue in ImageToText related to extraction coordinates.
Release date: 22-05-2020
New functionality for de-identification problem.
- Renamed TesseractOCR to ImageToText.
- Simplified installation.
- Added check license from
- Support storing for binaryFormat. Added support storing Image and PDF files.
- Support selectable pdf for TextToPdf transformer.
- Added UpdateTextPosition transformer.
Release date: 08-04-2020
Improved support Databricks and processing selectable pdfs.
- Adapted Spark OCR for run on Databricks.
- Added rewriting positions in ImageToText when run together with PdfToText.
- Added ‘positionsCol’ param to ImageToText.
- Improved support Spark NLP. Changed start function.
- Added showImage implicit to Dataframe for display images in Scala Databricks notebooks.
- Added display_images function for display images in Python Databricks notebooks.
- Added propagation selectable pdf file in TextToPdf. Added ‘inputContent’ param to ‘TextToPdf’.
Release date: 09-03-2020
Minor improvements and fixes
- Improved messages during license validation
- Fixed dependencies issue
Release date: 06-03-2020
Integration with license server.
- Added license validation. License can be set in following waysq:
- Environment variable. Set variable ‘JSL_OCR_LICENSE’.
- System property. Set property ‘jsl.sparkocr.settings.license’.
- Application.conf file. Set property ‘jsl.sparkocr.settings.license’.
- Added auto renew license using jsl license server.
Release date: 03-03-2020
This release contains improvements for preprocessing image before run OCR and added possibility to store results to PDF for keep original formatting.
- Added auto calculation maximum size of objects for removing in
ImageRemoveObjects. This improvement avoids to remove
.and affect symbols with dots (
ImageRemoveObjectstransformer for activate this functional.
ImageToTexttransformer for set any ocr params.
- Added extraction font size in
TextToPdftransformer for render text with positions to pdf file.
- Added setting resolution in
ImageToText. And added
ignoreResolutionparam with default
ImageToTexttransformer for back compatibility.
- Added parsing resolution from image metadata in
- Added storing resolution in
- Added resolution field to Image schema.
- Updated ‘start’ function for set ‘PYSPARK_PYTHON’ env variable.
- Improve auto-scaling/skew correction:
- improved access to images values
- removing unnecessary copies of images
- adding more test cases
- improving auto-correlation in auto-scaling.
Release date: 12-02-2020
Spark NLP OCR functionality was reimplemented as set of Spark ML transformers and moved to separate Spark OCR library.
- Added extraction coordinates of each symbol in ImageToText
- Added ImageDrawRegions transformer
- Added ImageToPdf transformer
- Added ImageMorphologyOpening transformer
- Added ImageRemoveObjects transformer
- Added ImageAdaptiveThresholding transformer
- Reimplement main functionality as Spark ML transformers
- Moved DrawRectangle functionality to PdfDrawRegions transformer
- Added ‘start’ function with support SparkMonitor initialization
- Moved PositionFinder to Spark OCR
- Fixed bug with transforming complex pdf to image