NLP models are great at processing digital text, but many real-word applications use documents with more complex formats. For example, healthcare systems often include visual lab results, sequencing reports, clinical trial forms, and other scanned documents. When we only use an NLP approach for document understanding, we lose layout and style information - which can be vital for document image understanding. New advances in multi-modal learning allow models to learn from both the text in documents (via NLP) and visual layout (via computer vision).
We provide multi-modal visual document understanding, built on Spark OCR based on the LayoutLM architecture. It achieves new state-of-the-art accuracy in several downstream tasks, including form understanding (from 70.7 to 79.3), receipt understanding (from 94.0 to 95.2) and document image classification (from 93.1 to 94.4).
Please check also webinar: Visual Document Understanding with Multi-Modal Image & Text Mining in Spark OCR 3
VisualDocumentClassifier
VisualDocumentClassifier
is a DL model for document classification using text and layout data.
Currently available pretrained model on the Tobacco3482 dataset, that contains 3482 images belonging
to 10 different classes (Resume, News, Note, Advertisement, Scientific, Report, Form, Letter, Email and Memo)
Input Columns
Param name | Type | Default | Column Data Description |
---|---|---|---|
inputCol | string | hocr | Сolumn name with HOCR of the document |
Parameters
Param name | Type | Default | Description |
---|---|---|---|
maxSentenceLength | int | 128 | Maximum sentence length. |
caseSensitive | boolean | false | Determines whether model is case sensitive. |
confidenceThreshold | float | 0f | Confidence threshold. |
Output Columns
Param name | Type | Default | Column Data Description |
---|---|---|---|
labelCol | string | label | Name of output column with the predicted label. |
confidenceCol | string | confidence | Name of output column with confidence. |
Example:
import com.johnsnowlabs.ocr.transformers.*
import com.johnsnowlabs.ocr.OcrContext.implicits._
val imagePath = "path to image"
// Read image file as binary file
val df = spark.read
.format("binaryFile")
.load(imagePath)
.asImage("image")
val imageToHocr = new ImageToHocr()
.setInputCol("image")
.setOutputCol("hocr")
val visualDocumentClassifier = VisualDocumentClassifier
.pretrained("visual_document_classifier_tobacco3482", "en", "clinical/ocr")
.setMaxSentenceLength(128)
.setInputCol("hocr")
.setLabelCol("label")
.setConfidenceCol("conf")
val pipeline = new Pipeline()
pipeline.setStages(Array(
imageToHocr,
visualDocumentClassifier
))
val modelPipeline = pipeline.fit(df)
val result = modelPipeline.transform(df)
result.select("label").show()
from pyspark.ml import PipelineModel
from sparkocr.transformers import *
imagePath = "path to image"
# Read image file as binary file
df = spark.read
.format("binaryFile")
.load(imagePath)
binary_to_image = BinaryToImage() \
.setInputCol("content") \
.setOutputCol("image")
ocr = ImageToHocr() \
.setInputCol("image") \
.setOutputCol("hocr")
document_classifier = VisualDocumentClassifier() \
.pretrained("visual_document_classifier_tobacco3482", "en", "clinical/ocr") \
.setMaxSentenceLength(128) \
.setInputCol("hocr") \
.setLabelCol("label") \
.setConfidenceCol("conf")
# Define pipeline
pipeline = PipelineModel(stages=[
binary_to_image,
ocr,
document_classifier,
])
result = pipeline.transform(df)
result.select("label").show()
Output:
+------+
| label|
+------+
|Letter|
+------+
VisualDocumentNER
VisualDocumentNER
is a DL model for NER documents using text and layout data.
Currently available pre-trained model on the SROIE dataset. The dataset has 1000 whole
scanned receipt images.
Input Columns
Param name | Type | Default | Column Data Description |
---|---|---|---|
inputCol | string | hocr | Сolumn name with HOCR of the document |
Parameters
Param name | Type | Default | Description |
---|---|---|---|
maxSentenceLength | int | 512 | Maximum sentence length. |
caseSensitive | boolean | false | Determines whether model is case sensitive. |
whiteList | Array[String] | Whitelist of output labels |
Output Columns
Param name | Type | Default | Column Data Description |
---|---|---|---|
outputCol | string | entities | Name of output column with entities Annotation. |
Example:
import com.johnsnowlabs.ocr.transformers.*
import com.johnsnowlabs.ocr.OcrContext.implicits._
val imagePath = "path to image"
// Read image file as binary file
val df = spark.read
.format("binaryFile")
.load(imagePath)
.asImage("image")
val imageToHocr = new ImageToHocr()
.setInputCol("image")
.setOutputCol("hocr")
val visualDocumentNER = VisualDocumentNER
.pretrained("visual_document_NER_SROIE0526", "en", "public/ocr/models")
.setMaxSentenceLength(512)
.setInputCol("hocr")
val pipeline = new Pipeline()
pipeline.setStages(Array(
imageToHocr,
visualDocumentNER
))
val modelPipeline = pipeline.fit(df)
val result = modelPipeline.transform(df)
result.select("entities").show()
from pyspark.ml import PipelineModel
from sparkocr.transformers import *
imagePath = "path to image"
# Read image file as binary file
df = spark.read
.format("binaryFile")
.load(imagePath)
binary_to_image = BinaryToImage() \
.setInputCol("content") \
.setOutputCol("image")
ocr = ImageToHocr() \
.setInputCol("image") \
.setOutputCol("hocr")
document_ner = VisualDocumentNer() \
.pretrained("visual_document_NER_SROIE0526", "en", "public/ocr/models") \
.setMaxSentenceLength(512) \
.setInputCol("hocr") \
.setLabelCol("label")
# Define pipeline
pipeline = PipelineModel(stages=[
binary_to_image,
ocr,
document_ner,
])
result = pipeline.transform(df)
result.select("entities").show()
Output:
+-------------------------------------------------------------------------+
|entities |
+-------------------------------------------------------------------------+
|[[entity, 0, 0, O, [word -> 0£0, token -> 0£0], []], [entity, 0, 0, |
| B-COMPANY, [word -> AEON, token -> aeon], []], [entity, 0, 0, B-COMPANY,|
| [word -> CO., token -> co], ... |
+-------------------------------------------------------------------------+
VisualDocumentNER
VisualDocumentNER
is the main entry point to transformer-based models for document NER. An example of a VisualDocumentNER task is the detection of keys and values like in the FUNSD dataset.
These keys and values represent the structure of a form, and are typically “connected” to each other by using a FormRelationExtractor model.
Some other VisualDocumentNER models are trained without this post-processing stage in mind, and consider entities in isolation. Some sample entities would be names, places, or medications, where the goal is not to connect the entities to other entities, but to use those entities individually.
VisualDocumentNER follows the same architecture as VisualDocumentClassifier receiving ‘visual tokens’, this is, tokens with coordinates in HOCR format, along with images to inform the model.
Check the Models Hub for available models.
Input Columns
Param name | Type | Default | Column Data Description |
---|---|---|---|
inputCols | Array[String] | Сolumn names for tokens of the document and image |
Parameters
Param name | Type | Default | Description |
---|---|---|---|
maxSentenceLength | int | 512 | Maximum sentence length. |
whiteList | Array[String] | Whitelist of output labels |
Output Columns
Param name | Type | Default | Column Data Description |
---|---|---|---|
outputCol | string | entities | Name of output column with entities Annotation. |
Example:
import com.johnsnowlabs.ocr.transformers.*
import com.johnsnowlabs.ocr.OcrContext.implicits._
val imagePath = "path to image"
var dataFrame = spark.read.format("binaryFile").load(imagePath)
var bin2imTransformer = new BinaryToImage()
bin2imTransformer.setImageType(ImageType.TYPE_3BYTE_BGR)
val ocr = new ImageToHocr()
.setInputCol("image")
.setOutputCol("hocr")
.setIgnoreResolution(false)
.setOcrParams(Array("preserve_interword_spaces=0"))
val tokenizer = new HocrTokenizer()
.setInputCol("hocr")
.setOutputCol("token")
val visualDocumentNER = VisualDocumentNER
.pretrained("lilt_roberta_funsd_v1", "en", "clinical/ocr")
.setInputCols(Array("token", "image"))
val pipeline = new Pipeline()
.setStages(Array(
bin2imTransformer,
ocr,
tokenizer,
visualDocumentNER
))
val results = pipeline
.fit(dataFrame)
.transform(dataFrame)
.select("entities")
.cache()
result.select("entities").show()
from pyspark.ml import PipelineModel
from sparkocr.transformers import *
imagePath = "path to image"
# Read image file as binary file
df = spark.read
.format("binaryFile")
.load(imagePath)
binToImage = BinaryToImage() \
.setInputCol("content") \
.setOutputCol("image")
ocr = ImageToHocr()\
.setInputCol("image")\
.setOutputCol("hocr")\
.setIgnoreResolution(False)\
.setOcrParams(["preserve_interword_spaces=0"])
tokenizer = HocrTokenizer()\
.setInputCol("hocr")\
.setOutputCol("token")
ner = VisualDocumentNer()\
.pretrained("lilt_roberta_funsd_v1", "en", "clinical/ocr")\
.setInputCols(["token", "image"])\
.setOutputCol("entities")
pipeline = PipelineModel(stages=[
binToImage,
ocr,
tokenizer,
ner
])
result = pipeline.transform(df)
result.withColumn('filename', path\_array.getItem(f.size(path_array)- 1)) \
.withColumn("exploded_entities", f.explode("entities")) \
.select("filename", "exploded_entities") \
.show(truncate=False)
Output sample:
+---------+-------------------------------------------------------------------------------------------------------------------------+
|filename |exploded_entities |
+---------+-------------------------------------------------------------------------------------------------------------------------+
|form1.jpg|[entity, 0, 6, i-answer, [x -> 1027, y -> 89, height -> 19, confidence -> 96, word -> Version:, width -> 90], []] |
|form1.jpg|[entity, 25, 35, b-header, [x -> 407, y -> 190, height -> 37, confidence -> 96, word -> Institution, width -> 241], []] |
|form1.jpg|[entity, 37, 40, i-header, [x -> 667, y -> 190, height -> 37, confidence -> 96, word -> Name, width -> 130], []] |
|form1.jpg|[entity, 42, 52, b-question, [x -> 498, y -> 276, height -> 19, confidence -> 96, word -> Institution, width -> 113], []]|
|form1.jpg|[entity, 54, 60, i-question, [x -> 618, y -> 276, height -> 19, confidence -> 96, word -> Address, width -> 89], []] |
+---------+-------------------------------------------------------------------------------------------------------------------------+
FormRelationExtractor
FormRelationExtractor
detect relation between keys and values detected by VisualDocumentNER
.
It can detect relations only for key/value in same line.
Input Columns
Param name | Type | Default | Column Data Description |
---|---|---|---|
inputCol | String | Column name for entities Annotation |
Parameters
Param name | Type | Default | Description |
---|---|---|---|
lineTolerance | int | 15 | Line tolerance in pixels. This is the space between lines that will be assumed. It is used for grouping text regions by lines. |
keyPattern | String | question | Pattern of entity name for keys in form. |
valuePattern | String | answer | Pattern of entity name for values in form. |
Output Columns
Param name | Type | Default | Column Data Description |
---|---|---|---|
outputCol | string | relations | Name of output column with relation Annotations. |
Example:
import com.johnsnowlabs.ocr.transformers.*
import com.johnsnowlabs.ocr.OcrContext.implicits._
val imagePath = "path to image"
var dataFrame = spark.read.format("binaryFile").load(imagePath)
var bin2imTransformer = new BinaryToImage()
bin2imTransformer.setImageType(ImageType.TYPE_3BYTE_BGR)
val ocr = new ImageToHocr()
.setInputCol("image")
.setOutputCol("hocr")
.setIgnoreResolution(false)
.setOcrParams(Array("preserve_interword_spaces=0"))
val tokenizer = new HocrTokenizer()
.setInputCol("hocr")
.setOutputCol("token")
val visualDocumentNER = VisualDocumentNER
.pretrained("lilt_roberta_funsd_v1", "en", "clinical/ocr")
.setInputCols(Array("token", "image"))
val relExtractor = new FormRelationExtractor()
.setInputCol("entities")
.setOutputCol("relations")
val pipeline = new Pipeline()
.setStages(Array(
bin2imTransformer,
ocr,
tokenizer,
visualDocumentNER,
relExtractor
))
val results = pipeline
.fit(dataFrame)
.transform(dataFrame)
.select("relations")
.cache()
results.select(explode("relations")).show(3, False)
from pyspark.ml import PipelineModel
from sparkocr.transformers import *
imagePath = "path to image"
# Read image file as binary file
df = spark.read
.format("binaryFile")
.load(imagePath)
binToImage = BinaryToImage() \
.setInputCol("content") \
.setOutputCol("image")
ocr = ImageToHocr()\
.setInputCol("image")\
.setOutputCol("hocr")\
.setIgnoreResolution(False)\
.setOcrParams(["preserve_interword_spaces=0"])
tokenizer = HocrTokenizer()\
.setInputCol("hocr")\
.setOutputCol("token")
ner = VisualDocumentNer()\
.pretrained("lilt_roberta_funsd_v1", "en", "clinical/ocr")\
.setInputCols(["token", "image"])\
.setOutputCol("entities")
rel_extractor = FormRelationExtractor() \
.setInputCol("entities") \
.setOutputCol("relations")
pipeline = PipelineModel(stages=[
binToImage,
ocr,
tokenizer,
ner,
rel_extractor
])
result = pipeline.transform(df)
result.select(explode("relations")).show(3, False)
Output sample:
+---------------------------------------------------------------------+
|col |
+---------------------------------------------------------------------+
|[relation, 112, 134, Name: Dribbler, bbb, [bbox1 -> 58 478 69 19, ...|
|[relation, 136, 161, Study Date: 12-09-2006, 6:34, [bbox1 -> 431 ... |
|[relation, 345, 361, BP: 120 80 mmHg, [bbox1 -> 790 478 30 19, ... |
+---------------------------------------------------------------------+