Pretrained Pipeline for Reading Printed Text with PDF Documents

Description

Pretrained pipeline designed to extract printed text from document PDFs. It empowers accurate and efficient conversion of printed content into digital text, making it an invaluable tool for text recognition tasks.

Predicted Entities

Live Demo Open in Colab Download

How to use

pdf_pipeline = PretrainedPipeline('pdf_printed_transformer_extraction', 'en', 'clinical/ocr')

pdf_path = '/content/pdfs/'
pdf_example_df = spark.read.format("binaryFile").load(pdf_path).cache()

result = pdf_pipeline.transform(pdf_example_df)
val pdf_pipeline = new PretrainedPipeline("pdf_printed_transformer_extraction", "en", "clinical/ocr")

val pdf_path = "/content/pdfs/"
val pdf_example_df = spark.read.format("binaryFile").load(pdf_path).cache()

val result = pdf_pipeline.transform(pdf_example_df)

Example

Input

Screenshot

Output

STARBUCKS Store #19208
11902 Euclid Avenue
Cleveland, OH (216) 229-U749

CHK 664250
12/07/2014 06:43 PM
112003. Drawers 2. Reg: 2

¥t Pep Mocha 4.5
Sbux Card 495
AMXARKERARANG 228
Subtotal $4.95
Total $4.95
Change Cue BO LOO
- Check Closed ~

"49/07/2014 06:43 py

oBUX Card «3228 New Balance: 37.45
Card is registertd

Model Information

Model Name: pdf_printed_transformer_extraction
Type: pipeline
Compatibility: Visual NLP 5.0.2+
License: Licensed
Edition: Official
Language: en