sparknlp_jsl.utils.visualner_annotations_parser#

Module Contents#

Functions#

convert_to_CONLL(data, doc_id)

Converts token data to CONLL format.

get_tokens(hocr)

Extracts tokens from HOCR data with their bounding box and label information.

pdfs_to_visualner_to_conll(spark, pdf_dir, ...[, ...])

Processes PDF files, extracts annotations using VisualNER, and converts to CONLL format.

visualner_annotations_to_conll(visual_json_input_file, ...)

Converts VisualNER JSON annotations to CONLL format.

write_CONLL(conll_lines, filename)

Writes CONLL lines to a file.

convert_to_CONLL(data, doc_id)#

Converts token data to CONLL format.

Parameters:
  • data (list of dict) – List of token dictionaries with text and label information.

  • doc_id (int) – Document ID for the CONLL file.

Returns:

List of CONLL formatted lines.

Return type:

list of str

get_tokens(hocr)#

Extracts tokens from HOCR data with their bounding box and label information.

Parameters:

hocr (list of dict) – List of HOCR data dictionaries.

Returns:

Extracted tokens with bounding box and label information.

Return type:

list of dict

pdfs_to_visualner_to_conll(spark, pdf_dir, pdf_visual_pipeline_model, output_file='Text_NER_Dataset.conll')#

Processes PDF files, extracts annotations using VisualNER, and converts to CONLL format.

Parameters:
  • spark (SparkSession) – Active Spark session.

  • pdf_dir (str) – Directory containing PDF files to process.

  • pdf_visual_pipeline_model (Model) – Pre-trained VisualNER pipeline model.

  • output_file (str, optional) – Path to the output CONLL file. Default is ‘Text_NER_Dataset.conll’.

visualner_annotations_to_conll(visual_json_input_file, conll_output_file)#

Converts VisualNER JSON annotations to CONLL format.

Parameters:
  • visual_json_input_file (str) – Path to the input JSON file containing VisualNER annotations.

  • conll_output_file (str) – Path to the output file where the CONLL formatted data will be saved.

write_CONLL(conll_lines, filename)#

Writes CONLL lines to a file.

Parameters:
  • conll_lines (list of str) – List of CONLL formatted lines.

  • filename (str) – Path to the output CONLL file.