Description
This is a pretrained pipeline designed to extract printed text from document PDFs. The model utilizes advanced Optical Character Recognition (OCR) techniques to accurately convert printed content into machine-readable digital text. It is optimized to handle various document layouts and font types, ensuring high-quality text extraction from PDFs, whether they are scanned images or digitally generated documents.
Predicted Entities
Live Demo Open in Colab Download
How to use
pdf_pipeline = PretrainedPipeline('pdf_printed_transformer_extraction', 'en', 'clinical/ocr')
pdf_path = '/content/pdfs/'
pdf_example_df = spark.read.format("binaryFile").load(pdf_path).cache()
result = pdf_pipeline.transform(pdf_example_df)
val pdf_pipeline = new PretrainedPipeline("pdf_printed_transformer_extraction", "en", "clinical/ocr")
val pdf_path = "/content/pdfs/"
val pdf_example_df = spark.read.format("binaryFile").load(pdf_path).cache()
val result = pdf_pipeline.transform(pdf_example_df)
Example
Input
Output
STARBUCKS Store #19208
11902 Euclid Avenue
Cleveland, OH (216) 229-U749
CHK 664250
12/07/2014 06:43 PM
112003. Drawers 2. Reg: 2
¥t Pep Mocha 4.5
Sbux Card 495
AMXARKERARANG 228
Subtotal $4.95
Total $4.95
Change Cue BO LOO
- Check Closed ~
"49/07/2014 06:43 py
oBUX Card «3228 New Balance: 37.45
Card is registertd
Model Information
Model Name: | pdf_printed_transformer_extraction |
Type: | pipeline |
Compatibility: | Visual NLP 5.0.2+ |
License: | Licensed |
Edition: | Official |
Language: | en |