Description
This is a pretrained pipeline designed for Table Extraction in mixed scanned and digital PDF documents. The model first detects tables within the input documents and then extracts the table structure, including rows, columns, and individual cells. It is optimized to work across various PDF formats, whether scanned or digitally generated, ensuring precise and efficient extraction of tabular data from diverse document types.
Predicted Entities
Live Demo Open in Colab Download –> Copy S3 URI
How to use
pipeline = PretrainedPipeline('digital_pdf_table_extractor', 'en', 'clinical/ocr')
pdf_path = '/content/pdfs/'
pdf_example_df = spark.read.format("binaryFile").load(pdf_path).cache()
result = pipeline.transform(pdf_example_df)
val pipeline = new PretrainedPipeline("digital_pdf_table_extractor", "en", "clinical/ocr")
val pdf_path = "/content/pdfs/"
val pdf_example_df = spark.read.format("binaryFile").load(pdf_path).cache()
val result = pipeline.transform(pdf_example_df)
Example
Model Information
Model Name: | digital_pdf_table_extractor |
Type: | ocr |
Compatibility: | Visual NLP 5.4.0+ |
License: | Licensed |
Edition: | Official |
Language: | en |