Pretrained Pipeline for Table Structure Extraction

Description

This is a pretrained pipeline designed for conducting Table Structure Extraction on mixed scanned and digital PDF documents. The model accurately identifies and extracts tables from PDFs, regardless of whether the document is scanned or digitally generated. By analyzing the structure and layout of the document, it can effectively detect tables, identify rows, columns, and cell boundaries, and reconstruct the table’s structure in a usable format.

Predicted Entities

Live Demo Open in Colab Download Copy S3 URI

How to use

pipeline = PretrainedPipeline('basic_table_extractor', 'en', 'clinical/ocr')

pdf_path = '/content/pdfs/'
pdf_example_df = spark.read.format("binaryFile").load(pdf_path).cache()

result = pipeline.transform(pdf_example_df)
val pipeline = new PretrainedPipeline("basic_table_extractor", "en", "clinical/ocr")

val pdf_path = "/content/pdfs/"
val pdf_example_df = spark.read.format("binaryFile").load(pdf_path).cache()

val result = pipeline.transform(pdf_example_df)

Example

Input image

Screenshot

Output image

Screenshot

Model Information

Model Name: basic_table_extractor
Type: ocr
Compatibility: Visual NLP 5.4.0+
License: Licensed
Edition: Official
Language: en