Spark NLP release notes 1.8.0

 

1.8.0

Release date: 20-11-2020

Overview

Optimisation performance for processing multipage PDF documents. Support up to 10k pages per document.

New Features

  • Added ImageAdaptiveBinarizer Scala transformer with support:
    • Gaussian local thresholding
    • Otsu thresholding
    • Sauvola local thresholding
  • Added possibility to split pdf to small documents for optimize processing in PdfToImage.

Enhancements

  • Added applying binarization in PdfToImage for optimize memory usage.
  • Added pdfCoordinates param to the ImageToText transformer.
  • Added ‘total_pages’ field to the PdfToImage transformer.
  • Added different splitting strategies to the PdfToImage transformer.
  • Simplified paging PdfToImage when run it with splitting to small PDF.
  • Added params to the PdfToText for disable extra functionality.
  • Added master_url param to the python start function.

Versions

Last updated