Text cleaner v1

Description

Model for cleaning image with text. It is based on text detection model with extra post-processing.

Predicted Entities

corrected_image.

Live Demo Open in Colab Copy S3 URI

How to use

pdf_to_image = PdfToImage() \
    .setInputCol("content") \
    .setOutputCol("image") \
    .setResolution(300)

ocr = ImageToText() \
    .setInputCol("image") \
    .setOutputCol("text") \
    .setConfidenceThreshold(70) \
    .setIgnoreResolution(False)

cleaner = ImageTextCleaner \
    .pretrained("text_cleaner_v1", "en", "clinical/ocr") \
    .setInputCol("image") \
    .setOutputCol("corrected_image") \
    .setMedianBlur(0) \
    .setSizeThreshold(10) \
    .setTextThreshold(0.3) \
    .setLinkThreshold(0.2) \
    .setPadding(5) \
    .setBinarize(False)

ocr_corrected = ImageToText() \
    .setInputCol("corrected_image") \
    .setOutputCol("corrected_text") \
    .setConfidenceThreshold(70) \
    .setIgnoreResolution(False)

pipeline = PipelineModel(stages=[
    pdf_to_image,
    ocr,
    cleaner,
    ocr_corrected
])

pdf_example = 'data/pdfs/noised.pdf'
pdf_example_df = spark.read.format("binaryFile").load(pdf_example).cache()

results = pipeline.transform(pdf_example_df).cache()
val pdf_to_image = new PdfToImage() 
    .setInputCol("content") 
    .setOutputCol("image") 
    .setResolution(300)

val ocr = new ImageToText() 
    .setInputCol("image") 
    .setOutputCol("text") 
    .setConfidenceThreshold(70) 
    .setIgnoreResolution(False)

val cleaner = ImageTextCleaner 
    .pretrained("text_cleaner_v1", "en", "clinical/ocr") 
    .setInputCol("image") 
    .setOutputCol("corrected_image") 
    .setMedianBlur(0) 
    .setSizeThreshold(10) 
    .setTextThreshold(0.3) 
    .setLinkThreshold(0.2) 
    .setPadding(5) 
    .setBinarize(False)

val ocr_corrected = new ImageToText() 
    .setInputCol("corrected_image") 
    .setOutputCol("corrected_text") 
    .setConfidenceThreshold(70) 
    .setIgnoreResolution(False)

val pipeline = new PipelineModel().setStages(Array(
    pdf_to_image, 
    ocr, 
    cleaner, 
    ocr_corrected))

val pdf_example = "data/pdfs/noised.pdf"
val pdf_example_df = spark.read.format("binaryFile").load(pdf_example).cache()

val results = pipeline.transform(pdf_example_df).cache()

Example

Input image

Screenshot

Output image

Screenshot

Output text

Detected text:
 

 

Sample specifications written by
 , BLEND CASING RECASING

- OLD GOLD STRAIGHT Tobacco Blend

Control for Sample No. 5030

Cigarettes:

OLD GOLD STRAIGHT

 

John H. M. Bohlken

FINAL FLAVOR MENTHOL FLAVOR

Tars and Nicotine, Taste Panel, Burning Time, Gas Phase Analysis,
Benzo (A) Pyrene Analyses — T/C -CF~ O.C S51: Fee -

Written by -- John H. M. Bohlken
Original to -Mr. C. L. Tucker, dr.
Copies to ---Dr. A. W. Spears

C

~

Model Information

Model Name: text_cleaner_v1
Type: ocr
Compatibility: Visual NLP 4.1.0+
License: Licensed
Edition: Official
Language: en
Size: 77.1 MB