Description
Model for cleaning image with text. It is based on text detection model with extra post-processing.
Predicted Entities
Live Demo Open in Colab Copy S3 URI
How to use
pdf_to_image = PdfToImage() \
.setInputCol("content") \
.setOutputCol("image") \
.setResolution(300)
ocr = ImageToText() \
.setInputCol("image") \
.setOutputCol("text") \
.setConfidenceThreshold(70) \
.setIgnoreResolution(False)
cleaner = ImageTextCleaner \
.pretrained("text_cleaner_v1", "en", "clinical/ocr") \
.setInputCol("image") \
.setOutputCol("corrected_image") \
.setMedianBlur(0) \
.setSizeThreshold(10) \
.setTextThreshold(0.3) \
.setLinkThreshold(0.2) \
.setPadding(5) \
.setBinarize(False)
ocr_corrected = ImageToText() \
.setInputCol("corrected_image") \
.setOutputCol("corrected_text") \
.setConfidenceThreshold(70) \
.setIgnoreResolution(False)
pipeline = PipelineModel(stages=[
pdf_to_image,
ocr,
cleaner,
ocr_corrected
])
pdf_example = 'data/pdfs/noised.pdf'
pdf_example_df = spark.read.format("binaryFile").load(pdf_example).cache()
results = pipeline.transform(pdf_example_df).cache()
val pdf_to_image = new PdfToImage()
.setInputCol("content")
.setOutputCol("image")
.setResolution(300)
val ocr = new ImageToText()
.setInputCol("image")
.setOutputCol("text")
.setConfidenceThreshold(70)
.setIgnoreResolution(False)
val cleaner = ImageTextCleaner
.pretrained("text_cleaner_v1", "en", "clinical/ocr")
.setInputCol("image")
.setOutputCol("corrected_image")
.setMedianBlur(0)
.setSizeThreshold(10)
.setTextThreshold(0.3)
.setLinkThreshold(0.2)
.setPadding(5)
.setBinarize(False)
val ocr_corrected = new ImageToText()
.setInputCol("corrected_image")
.setOutputCol("corrected_text")
.setConfidenceThreshold(70)
.setIgnoreResolution(False)
val pipeline = new PipelineModel().setStages(Array(
pdf_to_image,
ocr,
cleaner,
ocr_corrected))
val pdf_example = "data/pdfs/noised.pdf"
val pdf_example_df = spark.read.format("binaryFile").load(pdf_example).cache()
val results = pipeline.transform(pdf_example_df).cache()
Example
Output text
Detected text:
Sample specifications written by
, BLEND CASING RECASING
- OLD GOLD STRAIGHT Tobacco Blend
Control for Sample No. 5030
Cigarettes:
OLD GOLD STRAIGHT
John H. M. Bohlken
FINAL FLAVOR MENTHOL FLAVOR
Tars and Nicotine, Taste Panel, Burning Time, Gas Phase Analysis,
Benzo (A) Pyrene Analyses — T/C -CF~ O.C S51: Fee -
Written by -- John H. M. Bohlken
Original to -Mr. C. L. Tucker, dr.
Copies to ---Dr. A. W. Spears
C
~
Model Information
Model Name: | text_cleaner_v1 |
Type: | ocr |
Compatibility: | Visual NLP 4.1.0+ |
License: | Licensed |
Edition: | Official |
Language: | en |
Size: | 77.1 MB |