`sparknlp_jsl.utils.visualner_annotations_parser`#

Module Contents#

`convert_to_CONLL`(data, doc_id)	Converts token data to CONLL format.
`get_tokens`(hocr)	Extracts tokens from HOCR data with their bounding box and label information.
`pdfs_to_visualner_to_conll`(spark, pdf_dir, ...[, ...])	Processes PDF files, extracts annotations using VisualNER, and converts to CONLL format.
`visualner_annotations_to_conll`(visual_json_input_file, ...)	Converts VisualNER JSON annotations to CONLL format.
`write_CONLL`(conll_lines, filename)	Writes CONLL lines to a file.

convert_to_CONLL(data, doc_id)#

Converts token data to CONLL format.

Parameters:

data (list of dict) – List of token dictionaries with text and label information.
doc_id (int) – Document ID for the CONLL file.

Returns:

List of CONLL formatted lines.

Return type:

list of str

get_tokens(hocr)#

Extracts tokens from HOCR data with their bounding box and label information.

pdfs_to_visualner_to_conll(spark, pdf_dir, pdf_visual_pipeline_model, output_file='Text_NER_Dataset.conll')#

Processes PDF files, extracts annotations using VisualNER, and converts to CONLL format.

Parameters:

spark (SparkSession) – Active Spark session.
pdf_dir (str) – Directory containing PDF files to process.
pdf_visual_pipeline_model (Model) – Pre-trained VisualNER pipeline model.
output_file (str, optional) – Path to the output CONLL file. Default is ‘Text_NER_Dataset.conll’.

visualner_annotations_to_conll(visual_json_input_file, conll_output_file)#

Converts VisualNER JSON annotations to CONLL format.

Parameters:

visual_json_input_file (str) – Path to the input JSON file containing VisualNER annotations.
conll_output_file (str) – Path to the output file where the CONLL formatted data will be saved.

write_CONLL(conll_lines, filename)#

Writes CONLL lines to a file.

Parameters: