Description
This is a pretrained pipeline designed to remove noise from printed documents, improving the clarity and readability of the text for more accurate Optical Character Recognition (OCR). The model effectively filters out background noise, smudges, and other visual distortions commonly found in scanned or photographed documents, ensuring that the text is clean and well-defined for optimal OCR performance.
By preprocessing the document images to reduce noise, this pipeline significantly enhances the accuracy of text extraction, reducing the likelihood of errors caused by imperfections in the original document. It is an essential tool for improving OCR reliability in environments where document quality may vary, making it invaluable for tasks such as document digitization, automated content extraction, and data analysis.
Predicted Entities
Live Demo Open in Colab Download
How to use
pdf_pipeline = PretrainedPipeline('mixed_scanned_digital_pdf_image_cleaner', 'en', 'clinical/ocr')
pdf_path = '/content/pdfs/'
pdf_example_df = spark.read.format("binaryFile").load(pdf_path).cache()
result = pdf_pipeline.transform(pdf_example_df)
val pdf_pipeline = new PretrainedPipeline("mixed_scanned_digital_pdf_image_cleaner", "en", "clinical/ocr")
val pdf_path = "/content/pdfs/"
val pdf_example_df = spark.read.format("binaryFile").load(pdf_path).cache()
val result = pdf_pipeline.transform(pdf_example_df)
Example
Input
Output
Sample specifications written by
, BLEND CASING RECASING
- OLD GOLD STRAIGHT Tobacco Blend
Control for Sample No. 5030
Cigarettes:
OLD GOLD STRAIGHT
John H. M. Bohlken
FINAL FLAVOR MENTHOL FLAVOR
Tars and Nicotine, Taste Panel, Burning Time, Gas Phase Analysis,
Benzo (A) Pyrene Analyses — T/C -CF~ O.C S51: Fee -
Written by -- John H. M. Bohlken
Original to -Mr. C. L. Tucker, dr.
Copies to ---Dr. A. W. Spears
C
~
Model Information
Model Name: | mixed_scanned_digital_pdf_image_cleaner |
Type: | pipeline |
Compatibility: | Visual NLP 5.0.2+ |
License: | Licensed |
Edition: | Official |
Language: | en |