sparknlp_jsl.pipeline_output_parser#

Module Contents#

Classes#

PipelineOutputParser

This class is used to parse the output of a Spark NLP pipeline.

class PipelineOutputParser(columns_maps=None)#

This class is used to parse the output of a Spark NLP pipeline.

It provides methods for extracting named entities, assertions, code mapping, relations, summaries, and deidentification from the output.

Examples

>>> import sparknlp
>>> import sparknlp_jsl
>>> from sparknlp.pretrained import PretrainedPipeline
>>> from sparknlp_jsl.pipeline_output_parser import PipelineOutputParser

Load the pipeline and the text

>>> text = [
...     "Immunohistochemistry was negative for thyroid transcription factor-1 and napsin A. The test was positive for ER and PR, and negative for HER2.",
...     "The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago." +
...     "The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to the residual breast. The cancer recurred as a right lung metastasis 13 years later." +
...     "He underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy."
... ]
>>> oncology_pipeline = PretrainedPipeline("oncology_biomarker_pipeline", "en", "clinical/models")
>>> annotations = oncology_pipeline.fullAnnotate(text)

Define the column_maps dictionary

>>> column_maps = {
...     'document_identifier': 'XYZ_123',
...     'document_text': 'document',
...     'entities': ["merged_chunk"],
...     'assertions': ["assertion_chunk"],
...     'resolver': ["icd10_code"]
...     'relations': [],
...     'summaries': [],
...     "deidentifications" : [],
...     "classifications":[]
... }

Initialized parser from the dictionary

>>> pipeline_parser = PipelineOutputParser(column_maps)

Run the parser on the output of a Spark NLP pipeline

>>> parsed_result = pipeline_parser.run(annotations)
>>> print(parsed_result)
check_column_mapping_dict(result: dict)#

This function is used to check if the column mapping dictionary is valid. It checks if the following keys are present in the dictionary: document_text, entities, relations, and summary. It also checks if the values of the entities, relations, and summary keys are also dictionaries, and if the keys ner_chunk_column_name, resolver_column_name, and assertion_column_name are present in these dictionaries. :param result: The output of the pipeline. :type result: list

Raises:

KeyError – If a key is missing from the column mapping dictionary.

check_unique_id(entities, ner_chunk_result, ner_chunk_begin, ner_chunk_end)#
get_assertions(result)#

Extracts assertions from the pipeline output.

Parameters:

assertion_result (list) – The output of the assertion column.

Returns:

  • chunk_id (str): The unique identifier for the chunk of text that the assertion applies to.

  • chunk (str): The text of the chunk.

  • assertion (str): The assertion status.

  • assertion_confidence (str): The confidence score of the assertion model in identifying assertion status.

  • assertion_source (str): The source of the assertion information

Return type:

A list of dictionaries, where each dictionary represents an assertion, containing the following keys

get_classification(result)#

This function retrieves classification results from the pipeline output.

Parameters: result (list): The output of the pipeline from fullAnnotate. This list contains annotations for each document.

Returns: list: A list of dictionaries, where each dictionary represents a classification result.

Each dictionary contains the following keys: - category (str): The category of the classification. - sentence (str): The sentence from which the classification was made. - sentence_id (int): The ID of the sentence.

get_deidentification(result)#

Extract deidentification text from the pipeline output.

Parameters:

result (list) – The output of the pipeline.

Returns:

  • original (list): A list of original text.

  • obfuscated (list): A list of deidentified text.

  • masked (list): A list of masked text.

Return type:

A list of dictionaries, where each dictionary represents a deidentification containing the following keys

get_documents(result)#

Retrieves the document text from the pipeline output.

Arg: result (list): The output of the pipeline from fullAnnotate. This list contains annotations for each document.

Returns: list: A list of strings, where each string represents the text of a document.

get_entities(result)#

Extract named entities from the pipeline output.

Parameters:

result (list) – The output of the pipeline.

Returns:

A list of dictionaries, where each dictionary represents a named entity. Each dictionary contains the following keys:

  • chunk_id (str): The unique identifier for the entity.

  • chunk (str): The text of the entity.

  • begin (int): The index of the beginning of the entity in the text.

  • end (int): The index of the end of the entity in the text.

  • ner_label (str): The label of the entity, as determined by the NER model.

  • ner_source (str): The source of the NER model that generated this entity.

  • ner_confidence (float): The confidence score of the NER model in identifying this entity.

Return type:

entity_results (list)

get_relations(result, return_relation_entities=False)#

Extract relations from the pipeline output.

Parameters:

result (list) – The output of the pipeline.

Returns:

  • relation (str): The type of relation.

  • entity1 (str): The first entity envolved in the relation.

  • chunk1_id (str): The unique identifier for the first chunk.

  • entity1_begin (int): The index of the beginning of the first chunk.

  • entity1_end (int): The index of the end of the first chunk.

  • chunk1 (str): The text of the first chunk.

  • entity2 (str): The second entity envolved in the relation.

  • chunk2_id (str): The unique identifier for the second chunk.

  • entity2_begin (int): The index of the beginning of the second chunk.

  • entity2_end (int): The index of the end of the second chunk.

  • chunk2 (str): The text of the second chunk.

  • confidence (float): The confidence score of the relation.

  • direction (str): The direction of the relation.

Return type:

A list of dictionaries, where each dictionary represents a relation, containing the following keys

get_resolutions(result, entities)#

Extracts terminology codes from the pipeline output.

Returns:

  • vocab (str): The vocablary information about teminology codes.

  • chunk_id (str): The unique identifier for entity.

  • chunk (str): The text of the chunk.

  • term_code (str): The code of the term.

  • resolutions (str): The resolved text of the term.

  • all_k_codes (list): A list of all k results.

  • all_k_resolutions (list): A list of all k resolutions.

  • all_k_aux_labels (list): A list of all k aux labels.

  • all_k_distances (list): A list of all k distances.

  • confidence (float): The confidence score of the resolutions.

Return type:

A list of dictionaries, where each dictionary represents a named entity, containing the following keys

get_summary(result)#

Generates summary from the provided text.

Returns:

Summarized text

get_unique_id(ner_chunk_result, ner_chunk_begin, ner_chunk_end, ner_label)#

Generate a unique identifier for a named entity chunk.

Returns: str: A unique identifier for the named entity chunk, encoded in base64. The identifier is truncated to the first 8 characters.

Note: This method uses the uuid3 function from the uuid module to generate a unique identifier based on the provided parameters. The identifier is then encoded in base64 and truncated to the first 8 characters.

run(results, return_relation_entities=False)#

Parse the pipeline output.

Parameters:

results (list) – The output of the pipeline from fullAnnotate.

Returns:

  • document_identifier (str): The informations of the document.

  • document_id (str): The ID of the document.

  • document_text (str): The text of the document.

  • entities (list): A list of dictionaries, where each dictionary represents a named entity, containing the following keys:
    • chunk_id (str): The unique identifier for the chunk.

    • chunk (str): The text of the entity.

    • begin (int): The index of the beginning of the entity in the text.

    • end (int): The index of the end of the entity in the text.

    • ner_label (str): The label of the entity, as determined by the NER model.

    • ner_source (str): The source of the NER model that generated this entity.

    • ner_confidence (float): The confidence score of the NER model in identifying this entity.

  • assertions (list): A list of dictionaries, where each dictionary represents a assertion, containing the following keys:
    • chunk_id (str): The unique identifier for the chunk.

    • chunk (str): The text of the chunk.

    • assertion (str): The assertion status.

    • assertion_confidence (str): The confidence score of the assertion model in identifying assertion status.

    • assertion_source (str): The source of the assertion information

  • resolutions (list): A list of dictionaries, where each dictionary represents a resolutions, containing the following keys:
    • vocab (str): The vocablary information about teminology codes.

    • chunk_id (str): The unique identifier for the chunk.

    • chunk (str): The text of the chunk.

    • term_code (str): The code of the term.

    • resolutions (str): The resolved text of the term.

    • all_k_codes (list): A list of all k results.

    • all_k_resolutions (list): A list of all k resolutions.

    • all_k_aux_labels (list): A list of all k aux labels.

    • all_k_distances (list): A list of all k distances.

    • confidence (float): The confidence score of the resolutions.

  • relations (list): A list of dictionaries, where each dictionary represents a relation, containing the following keys:
    • relation (str): The type of relation.

    • entity1 (str): The first entity envolved in the relation.

    • chunk1_id (str): The unique identifier for the first chunk.

    • entity1_begin (int): The index of the beginning of the first chunk.

    • entity1_end (int): The index of the end of the first chunk.

    • chunk1 (str): The text of the first chunk.

    • entity2 (str): The second entity envolved in the relation.

    • chunk2_id (str): The unique identifier for the second chunk.

    • entity2_begin (int): The index of the beginning of the second chunk.

    • entity2_end (int): The index of the end of the second chunk.

    • chunk2 (str): The text of the second chunk.

    • confidence (float): The confidence score of the relation.

    • direction (str): The direction of the relation.

  • deidentifications: (list) A list of dictionaries, where each dictionary represents a deidentification, containing the following keys:
    • original (list): A list of original text.

    • obfuscated (list): A list of deidentified text.

    • masked (list): A list of masked text.

  • classifications (list): A list of dictionaries, where each dictionary represents a classification result.
    • category (str): The category of the classification.

    • sentence (str): The sentence from which the classification was made.

    • sentence_id (int): The ID of the sentence.

Return type:

A dictionary containing the parsed results, where the key is “result” and the value is a list of dictionaries, where each dictionary represents a document, containing the following keys

update_columns_maps(columns_maps=None)#

Update the column mapping dictionary.

Parameters:

columns_maps (dict) – A dictionary containing the column mappings.