sparknlp_jsl.utils.ocr_nlp_processor
#
Module Contents#
Functions#
|
Generates an annotated PDF file using input PDF files |
Attributes#
- ocr_entity_processor(spark: pyspark.sql.SparkSession, file_path: str, ner_pipeline: pyspark.ml.PipelineModel, style: str = 'bounding_box', save_dir: str = 'save_folder', label: bool = False, label_color: str = 'red', box_color: tuple = (0, 0, 0), color_chart_path: str = 'color_chart.png', chunk_col: str = 'ner_chunk', black_list: List[str] | None = [], display_result: bool = False, resolution: int = 200, confidenceThreshold: int = 70, pageIteratorLevel=PageIteratorLevel.SYMBOL, pageSegMode=PageSegmentationMode.SPARSE_TEXT_OSD, outline_color=None, outline_width=2, text_band=None, text_type='printed') IO #
Generates an annotated PDF file using input PDF files :param spark: Spark session with spark-nlp-jsl and spark-ocr jar :type spark: SparkSession
- Parameters:
file_path (str) – Path to PDF files
ner_pipeline (PipelineModel) – Fitted NER pipeline
chunk_col (str) – OutputCol name of the chunk in ner pipeline that will be annotated, default ‘ner_chunk’
black_list (list) – List of NER labels that will be painted over in ‘highlight’ and ‘bounding_box’ styles
style – PDF file process style that has 3 options; ‘colored_box’: Draws bands with a single color over the chunks detected
by NER pipeline (default is black). ‘bounding_box’: Colorful bounding boxes around the chunks detected by NER pipeline. Each color represents a different NER label. ‘highlight’: Colorful highlights over the chunks detected by NER pipeline. Each color represents a different NER label. :type style: str
- Parameters:
save_dir (str) – Path for saving folder of processed PDF files, defaults to ‘save_folder’
label (bool) – Set to True to write NER labels over the chunks, defaults to False
label_color (str) – Color of NER labels if ‘label=True’ , defaults to “red”
box_color (tuple) – RGB code for colored box if ‘style=colored_box’, defaults to “black”
color_chart_path – File name of color chart in PNG format that shows the colors of NER labels in the
processed file, defaults to “color_chart.png” :type color_chart_path: str
- Parameters:
display_result (bool) – Set to True to see the output of processed file, defaults to False
outline_color (tuple) – Color of the outline (for all detected entities) if ‘style=bounding_box’, defaults to None (random color for each specific entitiy)
outline_width (int) – Width of the outline if ‘style=bounding_box’, defaults to 2
text_band (str) – Text to be writted on the colored box (band) if ‘style=colored_box’, defaults to None (no text)
- Returns:
PDF file
- Return type:
IO
- colors = ['aqua', 'aquamarine', 'black', 'blanchedalmond', 'blue', 'blueviolet', 'brown', 'burlywood',...#