LiLT model finetuned on FUNSD for Key Value Recognition

Description

The LiLT model, introduced in the research paper titled “LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding” authored by Jiapeng Wang, Lianwen Jin, and Kai Ding, offers a versatile solution for structured document comprehension across multiple languages. It achieves this by seamlessly integrating any pre-trained RoBERTa text encoder with a lightweight Layout Transformer, thereby enabling LayoutLM-like document comprehension capabilities for a wide range of languages.

To prepare the LiLT model for its tasks, it underwent pretraining using the FUNSD dataset, which focuses on key-value recognition. The Form Understanding in Noisy Scanned Documents (FUNSD) dataset comprises 199 fully annotated scanned forms that reflect real-world challenges. These documents exhibit noise and exhibit significant visual diversity, making form understanding (FoUn) a complex and demanding task. The FUNSD dataset serves as a valuable resource for various tasks, including text detection, optical character recognition, spatial layout analysis, and entity labeling/linking.

In the abstract of the LiLT paper, the authors emphasize the growing importance of structured document understanding in the context of intelligent document processing. They highlight a common limitation in existing models, which are often tailored to specific languages, particularly English, based on their pretraining data. To address this limitation, LiLT is introduced as a straightforward yet effective Language-independent Layout Transformer. This model can be pretrained on structured documents from a single language and subsequently fine-tuned on other languages using readily available monolingual or multilingual pre-trained textual models. The paper reports experimental results across eight languages, demonstrating that LiLT can achieve competitive or even superior performance on various widely-used downstream benchmarks. This capability allows for language-independent benefits stemming from pretraining on document layout structure.

Predicted Entities

other, b-header, i-header, b-question, i-question, b-answer, i-answer.

Live Demo Open in Colab Download

How to use

binary_to_image = BinaryToImage()\
    .setOutputCol("image") \
    .setImageType(ImageType.TYPE_3BYTE_BGR)

img_to_hocr = ImageToHocr()\
    .setInputCol("image")\
    .setOutputCol("hocr")\
    .setIgnoreResolution(False)\
    .setOcrParams(["preserve_interword_spaces=0"])

tokenizer = HocrTokenizer()\
    .setInputCol("hocr")\
    .setOutputCol("tokens")

doc_ner = VisualDocumentNer()\
    .pretrained("lilt_roberta_funsd_v1", "en", "clinical/ocr")\
    .setInputCols(["tokens", "image"])\
    .setOutputCol("entities")

draw = ImageDrawAnnotations() \
    .setInputCol("image") \
    .setInputChunksCol("entities") \
    .setOutputCol("image_with_annotations") \
    .setFontSize(10) \
    .setLineWidth(4)\
    .setRectColor(Color.red)

# OCR pipeline
pipeline = PipelineModel(stages=[
    binary_to_image,
    img_to_hocr,
    tokenizer,
    doc_ner,
    draw
])

test_image_path = pkg_resources.resource_filename('sparkocr', 'resources/ocr/forms/form1.jpg')
bin_df = spark.read.format("binaryFile").load(test_image_path)

results = pipeline.transform(bin_df).cache()
val binary_to_image = new BinaryToImage()
    .setOutputCol("image")
    .setImageType(ImageType.TYPE_3BYTE_BGR)

val img_to_hocr = new ImageToHocr()
    .setInputCol("image")
    .setOutputCol("hocr")
    .setIgnoreResolution(False)
    .setOcrParams(Array("preserve_interword_spaces=0"))

val tokenizer = new HocrTokenizer()
    .setInputCol("hocr")
    .setOutputCol("tokens")

val doc_ner = VisualDocumentNer()
    .pretrained("lilt_roberta_funsd_v1", "en", "clinical/ocr")
    .setInputCols(Array("tokens", "image"))
    .setOutputCol("entities")

val draw = new ImageDrawAnnotations()
    .setInputCol("image")
    .setInputChunksCol("entities")
    .setOutputCol("image_with_annotations")
    .setFontSize(10)
    .setLineWidth(4)
    .setRectColor(Color.red)

val pipeline = new PipelineModel().setStages(Array(
    binary_to_image,
    img_to_hocr,
    tokenizer,
    doc_ner,
    draw))

val test_image_path = pkg_resources.resource_filename("sparkocr", "resources/ocr/forms/form1.jpg")
val bin_df = spark.read.format("binaryFile").load(test_image_path)

val results = pipeline.transform(bin_df).cache()

Example

Input image

Screenshot

Output image

Screenshot

Output text

+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------+
|filename |exploded_entities                                                                                                                                      |
+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------+
|form1.jpg|{named_entity, 0, 7, b-question, {x -> 1027, y -> 89, height -> 19, confidence -> 106, token -> version, ocr_confidence -> 96, width -> 90}, []}       |
|form1.jpg|{named_entity, 7, 8, i-question, {x -> 1027, y -> 89, height -> 19, confidence -> 67, token -> :, ocr_confidence -> 96, width -> 90}, []}              |
|form1.jpg|{named_entity, 10, 15, b-question, {x -> 1020, y -> 128, height -> 23, confidence -> 95, token -> study, ocr_confidence -> 96, width -> 61}, []}       |
|form1.jpg|{named_entity, 15, 17, b-question, {x -> 1088, y -> 128, height -> 19, confidence -> 77, token -> id, ocr_confidence -> 96, width -> 29}, []}          |
|form1.jpg|{named_entity, 17, 18, i-question, {x -> 1088, y -> 128, height -> 19, confidence -> 41, token -> :, ocr_confidence -> 96, width -> 29}, []}           |
|form1.jpg|{named_entity, 20, 31, b-question, {x -> 407, y -> 190, height -> 37, confidence -> 86, token -> institution, ocr_confidence -> 95, width -> 241}, []} |
|form1.jpg|{named_entity, 31, 35, i-question, {x -> 667, y -> 190, height -> 37, confidence -> 44, token -> name, ocr_confidence -> 95, width -> 130}, []}        |
|form1.jpg|{named_entity, 35, 46, b-question, {x -> 498, y -> 276, height -> 19, confidence -> 102, token -> institution, ocr_confidence -> 96, width -> 113}, []}|
|form1.jpg|{named_entity, 46, 53, b-question, {x -> 618, y -> 276, height -> 19, confidence -> 67, token -> address, ocr_confidence -> 96, width -> 89}, []}      |
|form1.jpg|{named_entity, 53, 64, b-question, {x -> 454, y -> 339, height -> 19, confidence -> 103, token -> institution, ocr_confidence -> 96, width -> 114}, []}|
|form1.jpg|{named_entity, 64, 71, b-question, {x -> 575, y -> 339, height -> 19, confidence -> 82, token -> address, ocr_confidence -> 96, width -> 90}, []}      |
|form1.jpg|{named_entity, 71, 75, b-question, {x -> 673, y -> 339, height -> 19, confidence -> 98, token -> line, ocr_confidence -> 96, width -> 43}, []}         |
|form1.jpg|{named_entity, 75, 76, i-question, {x -> 724, y -> 339, height -> 19, confidence -> 72, token -> #, ocr_confidence -> 96, width -> 24}, []}            |
|form1.jpg|{named_entity, 76, 77, i-question, {x -> 724, y -> 339, height -> 19, confidence -> 65, token -> 2, ocr_confidence -> 96, width -> 24}, []}            |
|form1.jpg|{named_entity, 77, 86, b-question, {x -> 497, y -> 402, height -> 23, confidence -> 88, token -> telephone, ocr_confidence -> 91, width -> 117}, []}   |
|form1.jpg|{named_entity, 86, 87, b-question, {x -> 622, y -> 402, height -> 19, confidence -> 64, token -> &, ocr_confidence -> 91, width -> 14}, []}            |
|form1.jpg|{named_entity, 87, 92, b-question, {x -> 645, y -> 402, height -> 19, confidence -> 88, token -> email, ocr_confidence -> 95, width -> 60}, []}        |
|form1.jpg|{named_entity, 92, 96, b-question, {x -> 58, y -> 478, height -> 19, confidence -> 108, token -> name, ocr_confidence -> 92, width -> 69}, []}         |
|form1.jpg|{named_entity, 96, 97, i-question, {x -> 58, y -> 478, height -> 19, confidence -> 84, token -> :, ocr_confidence -> 92, width -> 69}, []}             |
|form1.jpg|{named_entity, 97, 101, b-answer, {x -> 143, y -> 478, height -> 22, confidence -> 108, token -> dribbler, ocr_confidence -> 91, width -> 92}, []}     |
+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------+
only showing top 20 rows

Model Information

Model Name: lilt_roberta_funsd_v1
Type: ocr
Compatibility: Visual NLP 4.0.0+
License: Licensed
Edition: Official
Language: en

References

FUNSD