Pretrained Pipeline for Table Extraction

Description

This is a pretrained pipeline designed for Table Extraction in mixed scanned and digital PDF documents. The model first detects tables within the input documents and then extracts the table structure, including rows, columns, and individual cells. It is optimized to work across various PDF formats, whether scanned or digitally generated, ensuring precise and efficient extraction of tabular data from diverse document types.

Predicted Entities

Live Demo Open in Colab Download –> Copy S3 URI

How to use

pipeline = PretrainedPipeline('digital_pdf_table_extractor', 'en', 'clinical/ocr')

pdf_path = '/content/pdfs/'
pdf_example_df = spark.read.format("binaryFile").load(pdf_path).cache()

result = pipeline.transform(pdf_example_df)
val pipeline = new PretrainedPipeline("digital_pdf_table_extractor", "en", "clinical/ocr")

val pdf_path = "/content/pdfs/"
val pdf_example_df = spark.read.format("binaryFile").load(pdf_path).cache()

val result = pipeline.transform(pdf_example_df)

Example

Input image

Screenshot

Output image

Screenshot

Model Information

Model Name: digital_pdf_table_extractor
Type: ocr
Compatibility: Visual NLP 5.4.0+
License: Licensed
Edition: Official
Language: en