sparknlp_jsl.utils.ocr_nlp_processor#

Module Contents#

Functions#

ocr_entity_processor(, color_chart_path, chunk_col, ...)

Generates an annotated PDF file using input PDF files

Attributes#

colors

ocr_entity_processor(spark: pyspark.sql.SparkSession, file_path: str, ner_pipeline: pyspark.ml.PipelineModel, style: str = 'bounding_box', save_dir: str = 'save_folder', label: bool = False, label_color: str = 'red', box_color: tuple = (0, 0, 0), color_chart_path: str = 'color_chart.png', chunk_col: str = 'ner_chunk', black_list: List[str] | None = [], display_result: bool = False, resolution: int = 200, confidenceThreshold: int = 70, pageIteratorLevel=PageIteratorLevel.SYMBOL, pageSegMode=PageSegmentationMode.SPARSE_TEXT_OSD, outline_color=None, outline_width=2, text_band=None, text_type='printed') IO#

Generates an annotated PDF file using input PDF files :param spark: Spark session with spark-nlp-jsl and spark-ocr jar :type spark: SparkSession

Parameters:
  • file_path (str) – Path to PDF files

  • ner_pipeline (PipelineModel) – Fitted NER pipeline

  • chunk_col (str) – OutputCol name of the chunk in ner pipeline that will be annotated, default ‘ner_chunk’

  • black_list (list) – List of NER labels that will be painted over in ‘highlight’ and ‘bounding_box’ styles

  • style – PDF file process style that has 3 options; ‘colored_box’: Draws bands with a single color over the chunks detected

by NER pipeline (default is black). ‘bounding_box’: Colorful bounding boxes around the chunks detected by NER pipeline. Each color represents a different NER label. ‘highlight’: Colorful highlights over the chunks detected by NER pipeline. Each color represents a different NER label. :type style: str

Parameters:
  • save_dir (str) – Path for saving folder of processed PDF files, defaults to ‘save_folder’

  • label (bool) – Set to True to write NER labels over the chunks, defaults to False

  • label_color (str) – Color of NER labels if ‘label=True’ , defaults to “red”

  • box_color (tuple) – RGB code for colored box if ‘style=colored_box’, defaults to “black”

  • color_chart_path – File name of color chart in PNG format that shows the colors of NER labels in the

processed file, defaults to “color_chart.png” :type color_chart_path: str

Parameters:
  • display_result (bool) – Set to True to see the output of processed file, defaults to False

  • outline_color (tuple) – Color of the outline (for all detected entities) if ‘style=bounding_box’, defaults to None (random color for each specific entitiy)

  • outline_width (int) – Width of the outline if ‘style=bounding_box’, defaults to 2

  • text_band (str) – Text to be writted on the colored box (band) if ‘style=colored_box’, defaults to None (no text)

Returns:

PDF file

Return type:

IO

colors = ['aqua', 'aquamarine', 'black', 'blanchedalmond', 'blue', 'blueviolet', 'brown', 'burlywood',...#