Text cleaner v1

Description

This model is designed to enhance image quality by reducing noise and imperfections while preserving the original content. It is based on a text detection framework but incorporates additional post-processing techniques to refine and clean the image. By effectively identifying and minimizing distortions, the model improves clarity and ensures a more visually accurate representation of the original image.

Predicted Entities

corrected_image.

Live Demo Open in Colab Download Copy S3 URI

How to use

pdf_to_image = PdfToImage() \
    .setInputCol("content") \
    .setOutputCol("image") \
    .setResolution(300)

ocr = ImageToText() \
    .setInputCol("image") \
    .setOutputCol("text") \
    .setConfidenceThreshold(70) \
    .setIgnoreResolution(False)

cleaner = ImageTextCleaner \
    .pretrained("text_cleaner_v1", "en", "clinical/ocr") \
    .setInputCol("image") \
    .setOutputCol("corrected_image") \
    .setMedianBlur(0) \
    .setSizeThreshold(10) \
    .setTextThreshold(0.3) \
    .setLinkThreshold(0.2) \
    .setPadding(5) \
    .setBinarize(False)

ocr_corrected = ImageToText() \
    .setInputCol("corrected_image") \
    .setOutputCol("corrected_text") \
    .setConfidenceThreshold(70) \
    .setIgnoreResolution(False)

pipeline = PipelineModel(stages=[
    pdf_to_image,
    ocr,
    cleaner,
    ocr_corrected
])

pdf_example = 'data/pdfs/noised.pdf'
pdf_example_df = spark.read.format("binaryFile").load(pdf_example).cache()

results = pipeline.transform(pdf_example_df).cache()

val pdf_to_image = new PdfToImage() 
    .setInputCol("content") 
    .setOutputCol("image") 
    .setResolution(300)

val ocr = new ImageToText() 
    .setInputCol("image") 
    .setOutputCol("text") 
    .setConfidenceThreshold(70) 
    .setIgnoreResolution(False)

val cleaner = ImageTextCleaner 
    .pretrained("text_cleaner_v1", "en", "clinical/ocr") 
    .setInputCol("image") 
    .setOutputCol("corrected_image") 
    .setMedianBlur(0) 
    .setSizeThreshold(10) 
    .setTextThreshold(0.3) 
    .setLinkThreshold(0.2) 
    .setPadding(5) 
    .setBinarize(False)

val ocr_corrected = new ImageToText() 
    .setInputCol("corrected_image") 
    .setOutputCol("corrected_text") 
    .setConfidenceThreshold(70) 
    .setIgnoreResolution(False)

val pipeline = new PipelineModel().setStages(Array(
    pdf_to_image, 
    ocr, 
    cleaner, 
    ocr_corrected))

val pdf_example = "data/pdfs/noised.pdf"
val pdf_example_df = spark.read.format("binaryFile").load(pdf_example).cache()

val results = pipeline.transform(pdf_example_df).cache()

Example

Input image

Output image

Output text

Detected text:
 
Sample specifications written by
 , BLEND CASING RECASING

- OLD GOLD STRAIGHT Tobacco Blend

Control for Sample No. 5030

Cigarettes:

OLD GOLD STRAIGHT

John H. M. Bohlken

FINAL FLAVOR MENTHOL FLAVOR

Tars and Nicotine, Taste Panel, Burning Time, Gas Phase Analysis,
Benzo (A) Pyrene Analyses — T/C -CF~ O.C S51: Fee -

Written by -- John H. M. Bohlken
Original to -Mr. C. L. Tucker, dr.
Copies to ---Dr. A. W. Spears

C

~

Model Information

Model Name:	text_cleaner_v1
Type:	ocr
Compatibility:	Visual NLP 4.1.0+
License:	Licensed
Edition:	Official
Language:	en
Size:	77.1 MB

PREVIOUSОcr small for printed text

NEXTVisual NER on 10K Filings (SEC)