Visual document understanding

 

NLP models are great at processing digital text, but many real-word applications use documents with more complex formats. For example, healthcare systems often include visual lab results, sequencing reports, clinical trial forms, and other scanned documents. When we only use an NLP approach for document understanding, we lose layout and style information - which can be vital for document image understanding. New advances in multi-modal learning allow models to learn from both the text in documents (via NLP) and visual layout (via computer vision).

We provide multi-modal visual document understanding, built on Spark OCR based on the LayoutLM architecture. It achieves new state-of-the-art accuracy in several downstream tasks, including form understanding (from 70.7 to 79.3), receipt understanding (from 94.0 to 95.2) and document image classification (from 93.1 to 94.4).

Please check also webinar: Visual Document Understanding with Multi-Modal Image & Text Mining in Spark OCR 3

VisualDocumentClassifier

VisualDocumentClassifier is a DL model for document classification using text and layout data. Currently available pretrained model on the Tobacco3482 dataset, that contains 3482 images belonging to 10 different classes (Resume, News, Note, Advertisement, Scientific, Report, Form, Letter, Email and Memo)

Input Columns

Param name Type Default Column Data Description
inputCol string hocr Сolumn name with HOCR of the document

Parameters

Param name Type Default Description
maxSentenceLength int 128 Maximum sentence length.
caseSensitive boolean false Determines whether model is case sensitive.
confidenceThreshold float 0f Confidence threshold.

Output Columns

Param name Type Default Column Data Description
labelCol string label Name of output column with the predicted label.
confidenceCol string confidence Name of output column with confidence.

Example:

import com.johnsnowlabs.ocr.transformers.*
import com.johnsnowlabs.ocr.OcrContext.implicits._

val imagePath = "path to image"

// Read image file as binary file
val df = spark.read
  .format("binaryFile")
  .load(imagePath)
  .asImage("image")

val imageToHocr = new ImageToHocr()
  .setInputCol("image")
  .setOutputCol("hocr")

val visualDocumentClassifier = VisualDocumentClassifier
  .pretrained("visual_document_classifier_tobacco3482", "en", "clinical/ocr")
  .setMaxSentenceLength(128)
  .setInputCol("hocr")
  .setLabelCol("label")
  .setConfidenceCol("conf")

val pipeline = new Pipeline()
pipeline.setStages(Array(
  imageToHocr,
  visualDocumentClassifier
))

val modelPipeline = pipeline.fit(df)

val result =  modelPipeline.transform(df)
result.select("label").show()
from pyspark.ml import PipelineModel
from sparkocr.transformers import *

imagePath = "path to image"

# Read image file as binary file
df = spark.read 
    .format("binaryFile")
    .load(imagePath)

binary_to_image = BinaryToImage() \
    .setInputCol("content") \
    .setOutputCol("image")

ocr = ImageToHocr() \
    .setInputCol("image") \
    .setOutputCol("hocr")

document_classifier = VisualDocumentClassifier() \
  .pretrained("visual_document_classifier_tobacco3482", "en", "clinical/ocr") \
  .setMaxSentenceLength(128) \
  .setInputCol("hocr") \
  .setLabelCol("label") \
  .setConfidenceCol("conf")

# Define pipeline
pipeline = PipelineModel(stages=[
    binary_to_image,
    ocr,
    document_classifier,
    
])

result = pipeline.transform(df)
result.select("label").show()

Output:

+------+
| label|
+------+
|Letter|
+------+

VisualDocumentNER

VisualDocumentNER is a DL model for NER documents using text and layout data. Currently available pre-trained model on the SROIE dataset. The dataset has 1000 whole scanned receipt images.

Input Columns

Param name Type Default Column Data Description
inputCol string hocr Сolumn name with HOCR of the document

Parameters

Param name Type Default Description
maxSentenceLength int 512 Maximum sentence length.
caseSensitive boolean false Determines whether model is case sensitive.
whiteList Array[String]   Whitelist of output labels

Output Columns

Param name Type Default Column Data Description
outputCol string entities Name of output column with entities Annotation.

Example:

import com.johnsnowlabs.ocr.transformers.*
import com.johnsnowlabs.ocr.OcrContext.implicits._

val imagePath = "path to image"

// Read image file as binary file
val df = spark.read
  .format("binaryFile")
  .load(imagePath)
  .asImage("image")

val imageToHocr = new ImageToHocr()
  .setInputCol("image")
  .setOutputCol("hocr")

val visualDocumentNER = VisualDocumentNER
  .pretrained("visual_document_NER_SROIE0526", "en", "public/ocr/models")
  .setMaxSentenceLength(512)
  .setInputCol("hocr")

val pipeline = new Pipeline()

pipeline.setStages(Array(
  imageToHocr,
  visualDocumentNER
))

val modelPipeline = pipeline.fit(df)
val result =  modelPipeline.transform(df)

result.select("entities").show()
from pyspark.ml import PipelineModel
from sparkocr.transformers import *

imagePath = "path to image"

# Read image file as binary file
df = spark.read 
    .format("binaryFile")
    .load(imagePath)

binary_to_image = BinaryToImage() \
    .setInputCol("content") \
    .setOutputCol("image")
ocr = ImageToHocr() \
    .setInputCol("image") \
    .setOutputCol("hocr")

document_ner = VisualDocumentNer() \
  .pretrained("visual_document_NER_SROIE0526", "en", "public/ocr/models") \
  .setMaxSentenceLength(512) \
  .setInputCol("hocr") \
  .setLabelCol("label") 

# Define pipeline
pipeline = PipelineModel(stages=[
    binary_to_image,
    ocr,
    document_ner,
])

result = pipeline.transform(df)
result.select("entities").show()

Output:

+-------------------------------------------------------------------------+
|entities                                                                 |
+-------------------------------------------------------------------------+
|[[entity, 0, 0, O, [word -> 0£0, token -> 0£0], []], [entity, 0, 0,      |
| B-COMPANY, [word -> AEON, token -> aeon], []], [entity, 0, 0, B-COMPANY,|
| [word -> CO., token -> co], ...                                         |
+-------------------------------------------------------------------------+

VisualDocumentNER

VisualDocumentNER is the main entry point to transformer-based models for document NER. An example of a VisualDocumentNER task is the detection of keys and values like in the FUNSD dataset. These keys and values represent the structure of a form, and are typically “connected” to each other by using a FormRelationExtractor model. Some other VisualDocumentNER models are trained without this post-processing stage in mind, and consider entities in isolation. Some sample entities would be names, places, or medications, where the goal is not to connect the entities to other entities, but to use those entities individually. VisualDocumentNER follows the same architecture as VisualDocumentClassifier receiving ‘visual tokens’, this is, tokens with coordinates in HOCR format, along with images to inform the model. Check the Models Hub for available models.

Input Columns

Param name Type Default Column Data Description
inputCols Array[String]   Сolumn names for tokens of the document and image

Parameters

Param name Type Default Description
maxSentenceLength int 512 Maximum sentence length.
whiteList Array[String]   Whitelist of output labels

Output Columns

Param name Type Default Column Data Description
outputCol string entities Name of output column with entities Annotation.

Example:

import com.johnsnowlabs.ocr.transformers.*
import com.johnsnowlabs.ocr.OcrContext.implicits._

val imagePath = "path to image"

var dataFrame = spark.read.format("binaryFile").load(imagePath)

var bin2imTransformer = new BinaryToImage()
bin2imTransformer.setImageType(ImageType.TYPE_3BYTE_BGR)

val ocr = new ImageToHocr()
  .setInputCol("image")
  .setOutputCol("hocr")
  .setIgnoreResolution(false)
  .setOcrParams(Array("preserve_interword_spaces=0"))

val tokenizer = new HocrTokenizer()
  .setInputCol("hocr")
  .setOutputCol("token")

val visualDocumentNER = VisualDocumentNER
  .pretrained("lilt_roberta_funsd_v1", "en", "clinical/ocr")
  .setInputCols(Array("token", "image"))

val pipeline = new Pipeline()
  .setStages(Array(
    bin2imTransformer,
    ocr,
    tokenizer,
    visualDocumentNER
  ))

val results = pipeline
  .fit(dataFrame)
  .transform(dataFrame)
  .select("entities")
  .cache()

result.select("entities").show()
from pyspark.ml import PipelineModel
from sparkocr.transformers import *

imagePath = "path to image"

# Read image file as binary file
df = spark.read 
    .format("binaryFile")
    .load(imagePath)

binToImage = BinaryToImage() \
    .setInputCol("content") \
    .setOutputCol("image")

ocr = ImageToHocr()\
    .setInputCol("image")\
    .setOutputCol("hocr")\
    .setIgnoreResolution(False)\
    .setOcrParams(["preserve_interword_spaces=0"])

tokenizer = HocrTokenizer()\
    .setInputCol("hocr")\
    .setOutputCol("token")

ner = VisualDocumentNer()\
    .pretrained("lilt_roberta_funsd_v1", "en", "clinical/ocr")\
    .setInputCols(["token", "image"])\
    .setOutputCol("entities")

pipeline = PipelineModel(stages=[
    binToImage,
    ocr,
    tokenizer,
    ner
    ])

result = pipeline.transform(df)
result.withColumn('filename', path\_array.getItem(f.size(path_array)- 1)) \
    .withColumn("exploded_entities", f.explode("entities")) \
    .select("filename", "exploded_entities") \
    .show(truncate=False)

Output sample:

+---------+-------------------------------------------------------------------------------------------------------------------------+
|filename |exploded_entities                                                                                                        |
+---------+-------------------------------------------------------------------------------------------------------------------------+
|form1.jpg|[entity, 0, 6, i-answer, [x -> 1027, y -> 89, height -> 19, confidence -> 96, word -> Version:, width -> 90], []]        |
|form1.jpg|[entity, 25, 35, b-header, [x -> 407, y -> 190, height -> 37, confidence -> 96, word -> Institution, width -> 241], []]  |
|form1.jpg|[entity, 37, 40, i-header, [x -> 667, y -> 190, height -> 37, confidence -> 96, word -> Name, width -> 130], []]         |
|form1.jpg|[entity, 42, 52, b-question, [x -> 498, y -> 276, height -> 19, confidence -> 96, word -> Institution, width -> 113], []]|
|form1.jpg|[entity, 54, 60, i-question, [x -> 618, y -> 276, height -> 19, confidence -> 96, word -> Address, width -> 89], []]     |
+---------+-------------------------------------------------------------------------------------------------------------------------+

FormRelationExtractor

FormRelationExtractor detect relation between keys and values detected by VisualDocumentNER.

It can detect relations only for key/value in same line.

Input Columns

Param name Type Default Column Data Description
inputCol String   Column name for entities Annotation

Parameters

Param name Type Default Description
lineTolerance int 15 Line tolerance in pixels. This is the space between lines that will be assumed. It is used for grouping text regions by lines.
keyPattern String question Pattern of entity name for keys in form.
valuePattern String answer Pattern of entity name for values in form.

Output Columns

Param name Type Default Column Data Description
outputCol string relations Name of output column with relation Annotations.

Example:

import com.johnsnowlabs.ocr.transformers.*
import com.johnsnowlabs.ocr.OcrContext.implicits._

val imagePath = "path to image"

var dataFrame = spark.read.format("binaryFile").load(imagePath)

var bin2imTransformer = new BinaryToImage()
bin2imTransformer.setImageType(ImageType.TYPE_3BYTE_BGR)

val ocr = new ImageToHocr()
  .setInputCol("image")
  .setOutputCol("hocr")
  .setIgnoreResolution(false)
  .setOcrParams(Array("preserve_interword_spaces=0"))

val tokenizer = new HocrTokenizer()
  .setInputCol("hocr")
  .setOutputCol("token")

val visualDocumentNER = VisualDocumentNER
  .pretrained("lilt_roberta_funsd_v1", "en", "clinical/ocr")
  .setInputCols(Array("token", "image"))

val relExtractor = new FormRelationExtractor()
  .setInputCol("entities")
  .setOutputCol("relations")

val pipeline = new Pipeline()
  .setStages(Array(
    bin2imTransformer,
    ocr,
    tokenizer,
    visualDocumentNER,
    relExtractor
  ))

val results = pipeline
  .fit(dataFrame)
  .transform(dataFrame)
  .select("relations")
  .cache()

results.select(explode("relations")).show(3, False)
from pyspark.ml import PipelineModel
from sparkocr.transformers import *

imagePath = "path to image"

# Read image file as binary file
df = spark.read 
    .format("binaryFile")
    .load(imagePath)

binToImage = BinaryToImage() \
    .setInputCol("content") \
    .setOutputCol("image")

ocr = ImageToHocr()\
    .setInputCol("image")\
    .setOutputCol("hocr")\
    .setIgnoreResolution(False)\
    .setOcrParams(["preserve_interword_spaces=0"])

tokenizer = HocrTokenizer()\
    .setInputCol("hocr")\
    .setOutputCol("token")

ner = VisualDocumentNer()\
    .pretrained("lilt_roberta_funsd_v1", "en", "clinical/ocr")\
    .setInputCols(["token", "image"])\
    .setOutputCol("entities")

rel_extractor = FormRelationExtractor() \
    .setInputCol("entities") \
    .setOutputCol("relations")

pipeline = PipelineModel(stages=[
    binToImage,
    ocr,
    tokenizer,
    ner,
    rel_extractor
    ])

result = pipeline.transform(df)
result.select(explode("relations")).show(3, False)

Output sample:

+---------------------------------------------------------------------+
|col                                                                  |
+---------------------------------------------------------------------+
|[relation, 112, 134, Name: Dribbler, bbb, [bbox1 -> 58 478 69 19, ...|
|[relation, 136, 161, Study Date: 12-09-2006, 6:34, [bbox1 -> 431 ... |
|[relation, 345, 361, BP: 120 80 mmHg, [bbox1 -> 790 478 30 19, ...   |
+---------------------------------------------------------------------+
Last updated