Pretrained Pipeline for Reading Printed Text with PDF Documents

Description

This is a pretrained pipeline designed to extract printed text from document PDFs. The model utilizes advanced Optical Character Recognition (OCR) techniques to accurately convert printed content into machine-readable digital text. It is optimized to handle various document layouts and font types, ensuring high-quality text extraction from PDFs, whether they are scanned images or digitally generated documents.

Predicted Entities

Live Demo Open in Colab Download

How to use

pdf_pipeline = PretrainedPipeline('pdf_printed_transformer_extraction', 'en', 'clinical/ocr')

pdf_path = '/content/pdfs/'
pdf_example_df = spark.read.format("binaryFile").load(pdf_path).cache()

result = pdf_pipeline.transform(pdf_example_df)

val pdf_pipeline = new PretrainedPipeline("pdf_printed_transformer_extraction", "en", "clinical/ocr")

val pdf_path = "/content/pdfs/"
val pdf_example_df = spark.read.format("binaryFile").load(pdf_path).cache()

val result = pdf_pipeline.transform(pdf_example_df)

Example

Input

Screenshot

Output

STARBUCKS Store #19208
11902 Euclid Avenue
Cleveland, OH (216) 229-U749

CHK 664250
12/07/2014 06:43 PM
112003. Drawers 2. Reg: 2

¥t Pep Mocha 4.5
Sbux Card 495
AMXARKERARANG 228
Subtotal $4.95
Total $4.95
Change Cue BO LOO
- Check Closed ~

"49/07/2014 06:43 py

oBUX Card «3228 New Balance: 37.45
Card is registertd

Model Information

Model Name:	pdf_printed_transformer_extraction
Type:	pipeline
Compatibility:	Visual NLP 5.0.2+
License:	Licensed
Edition:	Official
Language:	en

PREVIOUSPretrained Pipeline for Reading Handwritten Text with PDF Documents

NEXTDocument Visual Question Answering optimized with Pix2Struct