Description
This is a Visual NER, a model trained on the top of LayoutLMV2 to detect regions in Tickets. This model can be used after, for example, the Binary Image Classifier of Tickets, available at https://nlp.johnsnowlabs.com/2022/09/07/finvisualclf_vit_tickets_en.html
Predicted Entities
COMPANY
, DATE
, AMOUNT
, NAME
, NUM
, UNITPRICE
, CNT
, DISCOUNTPRICE
, PRICE
, ITEMSUBTOTAL
, VATyn
, SUBTOTAL
, TOTALDISCOUNT
, SERVICEPRICE
, OTHERSVCPRICE
, TAX
, TOTAL
, CASH
, CHANGE
, CREDITCARD
, EMONEY
How to use
binary_to_image = BinaryToImage()\
.setOutputCol("image") \
.setImageType(ImageType.TYPE_3BYTE_BGR)
img_to_hocr = ImageToHocr()\
.setInputCol("image")\
.setOutputCol("hocr")\
.setIgnoreResolution(False)\
.setOcrParams(["preserve_interword_spaces=0"])
tokenizer = HocrTokenizer()\
.setInputCol("hocr")\
.setOutputCol("token")
doc_ner = VisualDocumentNerV21()\
.pretrained("visualner_receipts", "en", "clinical/ocr")\
.setInputCols(["token", "image"])\
.setOutputCol("entities")
draw = ImageDrawAnnotations() \
.setInputCol("image") \
.setInputChunksCol("entities") \
.setOutputCol("image_with_annotations") \
.setFontSize(10) \
.setLineWidth(4)\
.setRectColor(Color.red)
# OCR pipeline
pipeline = PipelineModel(stages=[
binary_to_image,
img_to_hocr,
tokenizer,
doc_ner,
draw
])
import pkg_resources
bin_df = spark.read.format("binaryFile").load('data/t01.jpg')
bin_df.show()
results = pipeline.transform(bin_df).cache()
res = results.collect()
## since pyspark2.3 doesn't have element_at, 'getItem' is involked
path_array = f.split(results['path'], '/')
# from pyspark2.4
# results.withColumn("filename", f.element_at(f.split("path", "/"), -1)) \
results.withColumn('filename', path_array.getItem(f.size(path_array)- 1)) \
.withColumn("exploded_entities", f.explode("entities")) \
.select("filename", "exploded_entities") \
.show(truncate=False)
Results
+----------+-------------------------------------------------------------------------------------------------------------------------------------------+
|filename |exploded_entities |
+----------+-------------------------------------------------------------------------------------------------------------------------------------------+
|test0.jpeg|{named_entity, 24, 24, UNITPRICE-B, {confidence -> 95, width -> 66, x -> 306, y -> 229, word -> #010029, token -> #, height -> 17}, []} |
|test0.jpeg|{named_entity, 32, 35, NAME-B, {confidence -> 91, width -> 38, x -> 200, y -> 250, word -> Sale, token -> sale, height -> 17}, []} |
|test0.jpeg|{named_entity, 37, 37, OTHERS, {confidence -> 91, width -> 8, x -> 249, y -> 253, word -> #, token -> #, height -> 15}, []} |
|test0.jpeg|{named_entity, 39, 47, NUM-B, {confidence -> 96, width -> 83, x -> 270, y -> 252, word -> 143710882, token -> 143710882, height -> 17}, []}|
|test0.jpeg|{named_entity, 49, 52, NAME-B, {confidence -> 96, width -> 37, x -> 191, y -> 274, word -> Team, token -> team, height -> 17}, []} |
|test0.jpeg|{named_entity, 66, 68, CNT-B, {confidence -> 88, width -> 28, x -> 82, y -> 296, word -> Jan, token -> jan, height -> 16}, []} |
|test0.jpeg|{named_entity, 114, 114, OTHERS, {confidence -> 63, width -> 27, x -> 229, y -> 323, word -> ***, token -> *, height -> 13}, []} |
+----------+-------------------------------------------------------------------------------------------------------------------------------------------+
Model Information
Model Name: | visualner_receipts |
Type: | ocr |
Compatibility: | Visual NLP 4.0.0+ |
License: | Licensed |
Edition: | Official |
Language: | xx |
Size: | 744.4 MB |
References
CORD