sparknlp_jsl.utils.visualner_annotations_parser
#
Module Contents#
Functions#
|
Converts token data to CONLL format. |
|
Extracts tokens from HOCR data with their bounding box and label information. |
|
Processes PDF files, extracts annotations using VisualNER, and converts to CONLL format. |
|
Converts VisualNER JSON annotations to CONLL format. |
|
Writes CONLL lines to a file. |
- convert_to_CONLL(data, doc_id)#
Converts token data to CONLL format.
- Parameters:
data (list of dict) – List of token dictionaries with text and label information.
doc_id (int) – Document ID for the CONLL file.
- Returns:
List of CONLL formatted lines.
- Return type:
list of str
- get_tokens(hocr)#
Extracts tokens from HOCR data with their bounding box and label information.
- Parameters:
hocr (list of dict) – List of HOCR data dictionaries.
- Returns:
Extracted tokens with bounding box and label information.
- Return type:
list of dict
- pdfs_to_visualner_to_conll(spark, pdf_dir, pdf_visual_pipeline_model, output_file='Text_NER_Dataset.conll')#
Processes PDF files, extracts annotations using VisualNER, and converts to CONLL format.
- Parameters:
spark (SparkSession) – Active Spark session.
pdf_dir (str) – Directory containing PDF files to process.
pdf_visual_pipeline_model (Model) – Pre-trained VisualNER pipeline model.
output_file (str, optional) – Path to the output CONLL file. Default is ‘Text_NER_Dataset.conll’.
- visualner_annotations_to_conll(visual_json_input_file, conll_output_file)#
Converts VisualNER JSON annotations to CONLL format.
- Parameters:
visual_json_input_file (str) – Path to the input JSON file containing VisualNER annotations.
conll_output_file (str) – Path to the output file where the CONLL formatted data will be saved.
- write_CONLL(conll_lines, filename)#
Writes CONLL lines to a file.
- Parameters:
conll_lines (list of str) – List of CONLL formatted lines.
filename (str) – Path to the output CONLL file.