Description
This is a pretrained pipeline designed for conducting Table Structure Extraction on mixed scanned and digital PDF documents. The model accurately identifies and extracts tables from PDFs, regardless of whether the document is scanned or digitally generated. By analyzing the structure and layout of the document, it can effectively detect tables, identify rows, columns, and cell boundaries, and reconstruct the table’s structure in a usable format.
Predicted Entities
Live Demo Open in Colab Download Copy S3 URI
How to use
pipeline = PretrainedPipeline('basic_table_extractor', 'en', 'clinical/ocr')
pdf_path = '/content/pdfs/'
pdf_example_df = spark.read.format("binaryFile").load(pdf_path).cache()
result = pipeline.transform(pdf_example_df)
val pipeline = new PretrainedPipeline("basic_table_extractor", "en", "clinical/ocr")
val pdf_path = "/content/pdfs/"
val pdf_example_df = spark.read.format("binaryFile").load(pdf_path).cache()
val result = pipeline.transform(pdf_example_df)
Example
Model Information
Model Name: | basic_table_extractor |
Type: | ocr |
Compatibility: | Visual NLP 5.4.0+ |
License: | Licensed |
Edition: | Official |
Language: | en |