Document Layout Analysis optimized with DIT

Description

DiT, as proposed in the paper “DiT: Self-supervised Pre-training for Document Image Transformer” authored by Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei, applies the self-supervised learning objective from BEiT (BERT pre-training of Image Transformers) to a dataset comprising 42 million document images. This approach has demonstrated cutting-edge performance specifically in tasks related to document layout analysis, with a focus on the PubLayNet dataset. PubLayNet consists of over 360,000 document images constructed by automatically parsing PubMed XML files. The utilization of DiT proves highly effective in understanding and interpreting the layout structures within diverse document images.

Predicted Entities

Live Demo Open in Colab Download

How to use

binary_to_image = BinaryToImage()
binary_to_image.setImageType(ImageType.TYPE_3BYTE_BGR)

dit_layout = DocumentLayoutAnalyzerDit \
    .pretrained("publaynet_dit_base_mrcnn_jsl", "en", "clinical/ocr") \
    .setInputCol("image") \
    .setOutputCol("regions") \
    .setScoreThreshold(0.5)

draw_regions = ImageDrawRegions() \
    .setInputCol("image") \
    .setInputRegionsCol("regions") \
    .setOutputCol("image_with_regions") \
    .setRectColor(Color.green)

# OCR pipeline
pipeline = PipelineModel(stages=[
    binary_to_image,
    dit_layout,
    draw_regions
])

imagePath = "/content/images"
image_df = spark.read.format("binaryFile").load(imagePath)

result = pipeline.transform(image_df).cache()
val binary_to_image = BinaryToImage().setImageType(ImageType.TYPE_3BYTE_BGR)

val dit_layout = ImageLayoutAnalyzerDit
    .pretrained("publaynet_dit_base_mrcnn", "en", "clinical/ocr")
    .setInputCol("image")
    .setOutputCol("regions")
    .setScoreThreshold(0.5)

val draw_regions = ImageDrawRegions()
    .setInputCol("image")
    .setInputRegionsCol("regions")
    .setOutputCol("image_with_regions")
    .setRectColor(Color.green)

# OCR pipeline
val pipeline = new PipelineModel().setStages(Array(
    binary_to_image,
    dit_layout,
    draw_regions))

val imagePath = "/content/images"
val image_df = spark.read.format("binaryFile").load(imagePath)

val result = pipeline.transform(image_df).cache()

Example

Input:

Screenshot

Output:

Screenshot

Model Information

Model Name: publaynet_dit_base_mrcnn
Type: ocr
Compatibility: Visual NLP 5.1.0+
License: Licensed
Edition: Official
Language: en