Pretrained Pipeline for Reading and Removing Noise in Mixed Scanned and Digital PDF Documents

Description

Pretrained pipeline designed to remove noise in input printed documents, enhancing OCR readability for more accurate text extraction.

Predicted Entities

Live Demo Open in Colab Download

How to use

pdf_pipeline = PretrainedPipeline('mixed_scanned_digital_pdf_image_cleaner', 'en', 'clinical/ocr')

pdf_path = '/content/pdfs/'
pdf_example_df = spark.read.format("binaryFile").load(pdf_path).cache()

result = pdf_pipeline.transform(pdf_example_df)
val pdf_pipeline = new PretrainedPipeline("mixed_scanned_digital_pdf_image_cleaner", "en", "clinical/ocr")

val pdf_path = "/content/pdfs/"
val pdf_example_df = spark.read.format("binaryFile").load(pdf_path).cache()

val result = pdf_pipeline.transform(pdf_example_df)

Example

Input

Screenshot

Output

Screenshot

Sample specifications written by
 , BLEND CASING RECASING

- OLD GOLD STRAIGHT Tobacco Blend

Control for Sample No. 5030

Cigarettes:

OLD GOLD STRAIGHT

 

John H. M. Bohlken

FINAL FLAVOR MENTHOL FLAVOR

Tars and Nicotine, Taste Panel, Burning Time, Gas Phase Analysis,
Benzo (A) Pyrene Analyses — T/C -CF~ O.C S51: Fee -

Written by -- John H. M. Bohlken
Original to -Mr. C. L. Tucker, dr.
Copies to ---Dr. A. W. Spears

C

~

Model Information

Model Name: mixed_scanned_digital_pdf_image_cleaner
Type: pipeline
Compatibility: Visual NLP 5.0.2+
License: Licensed
Edition: Official
Language: en