Spark NLP release notes 4.0.2

4.0.2

Release date: 12-09-2022

We are glad to announce that Spark OCR 4.0.2 has been released! This release comes with new features, fixes and more!

VisualDocumentClassifierV2 is now trainable! Continuing with the effort to make all the most useful models easily trainable, we added training capabilities to this annotator.
Added support for Simplified Chinese.
Added new ‘PdfToForm’ annotator, capable of extracting forms from digital PDFs. This is different from previously introduced VisualDocumentNER annotator in that this new annotator works only on digital documents, as opposite to the scanned forms handled by VisualDocumentNER. PdfToForm is complementary to VisualDocumentNER.

We added two new notebooks for VisualDocumentClassifierV2, a preprocessing notebook, useful when you’re dealing with large datasets, and a fine-tuning notebook.
We added a new sample notebook showing how to extract forms from digital PDF documents.
We added a new sample notebook explaining how to use Simplified Chinese OCR.