Description
DiT was proposed in DiT: Self-supervised Pre-training for Document Image Transformer by Junlong Li, Yiheng Xu, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei. DiT applies the self-supervised objective of BEiT (BERT pre-training of Image Transformers) to 42 million document images. This model was trained for document image classification in the RVL-CDIP dataset (a collection of 400,000 images belonging to one of 16 classes).
The abstract from the paper is the following: Image Transformer has recently achieved significant progress for natural image understanding, either using supervised (ViT, DeiT, etc.) or self-supervised (BEiT, MAE, etc.) pre-training techniques. In this paper, we propose DiT, a self-supervised pre-trained Document Image Transformer model using large-scale unlabeled text images for Document AI tasks, which is essential since no supervised counterparts ever exist due to the lack of human labeled document images. We leverage DiT as the backbone network in a variety of vision-based Document AI tasks, including document image classification, document layout analysis, as well as table detection. Experiment results have illustrated that the self-supervised pre-trained DiT model achieves new state-of-the-art results on these downstream tasks, e.g. document image classification (91.11 → 92.69), document layout analysis (91.0 → 94.9) and table detection (94.23 → 96.55).
Predicted Entities
label
.
Live Demo Open in Colab Download
How to use
binary_to_image = BinaryToImage()\
.setInputCol("content") \
.setOutputCol("image") \
.setImageType(ImageType.TYPE_3BYTE_BGR)
doc_class = VisualDocumentClassifierV3() \
.pretrained("dit_base_finetuned_rvlcdip", "en", "clinical/ocr") \
.setInputCols(["image"]) \
.setOutputCol("label")
# OCR pipeline
pipeline = PipelineModel(stages=[
binary_to_image,
doc_class
])
test_image_path = pkg_resources.resource_filename('sparkocr', 'resources/ocr/visualdoc/00556614_00556648.tif')
bin_df = spark.read.format("binaryFile").load(test_image_path).limit(50)
results = pipeline.transform(bin_df).cache()
val binary_to_image = new BinaryToImage()
.setInputCol("content")
.setOutputCol("image")
.setImageType(ImageType.TYPE_3BYTE_BGR)
val doc_class = VisualDocumentClassifierV3()
.pretrained("dit_base_finetuned_rvlcdip", "en", "clinical/ocr")
.setInputCols(Array("image"))
.setOutputCol("label")
# OCR pipeline
val pipeline = new PipelineModel().setStages(Array(
binary_to_image,
doc_class))
val test_image_path = pkg_resources.resource_filename("sparkocr", "resources/ocr/visualdoc/00556614_00556648.tif")
val bin_df = spark.read.format("binaryFile").load(test_image_path).limit(50)
val results = pipeline.transform(bin_df).cache()
Example
Input:
Output text
+-------+
|label |
+-------+
|invoice|
+-------+
Model Information
Model Name: | dit_base_finetuned_rvlcdip |
Type: | ocr |
Compatibility: | Visual NLP 4.0.0+ |
License: | Licensed |
Edition: | Official |
Language: | en |
Size: | 319.6 MB |
References
IIT-CDIP, RVL-CDIP