`sparknlp_jsl.pipeline_output_parser`#

Module Contents#

Classes#

PipelineOutputParser

This class is used to parse the output of a Spark NLP pipeline.

class PipelineOutputParser(columns_maps=None)#

This class is used to parse the output of a Spark NLP pipeline.

It provides methods for extracting named entities, assertions, code mapping, relations, summaries, and deidentification from the output.

Examples

>>> import sparknlp
>>> import sparknlp_jsl
>>> from sparknlp.pretrained import PretrainedPipeline
>>> from sparknlp_jsl.pipeline_output_parser import PipelineOutputParser

Load the pipeline and the text

>>> text = [
...     "Immunohistochemistry was negative for thyroid transcription factor-1 and napsin A. The test was positive for ER and PR, and negative for HER2.",
...     "The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago." +
...     "The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to the residual breast. The cancer recurred as a right lung metastasis 13 years later." +
...     "He underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy."
... ]

>>> oncology_pipeline = PretrainedPipeline("oncology_biomarker_pipeline", "en", "clinical/models")
>>> annotations = oncology_pipeline.fullAnnotate(text)

Define the column_maps dictionary

>>> column_maps = {
...     'document_identifier': 'XYZ_123',
...     'document_text': 'document',
...     'entities': ["merged_chunk"],
...     'assertions': ["assertion_chunk"],
...     'resolver': ["icd10_code"]
...     'relations': [],
...     'summaries': [],
...     "deidentifications" : [],
...     "classifications":[]
... }

Initialized parser from the dictionary

>>> pipeline_parser = PipelineOutputParser(column_maps)

Run the parser on the output of a Spark NLP pipeline

>>> parsed_result = pipeline_parser.run(annotations)
>>> print(parsed_result)

check_column_mapping_dict(result: dict)#

This function is used to check if the column mapping dictionary is valid. It checks if the following keys are present in the dictionary: document_text, entities, relations, and summary. It also checks if the values of the entities, relations, and summary keys are also dictionaries, and if the keys ner_chunk_column_name, resolver_column_name, and assertion_column_name are present in these dictionaries. :param result: The output of the pipeline. :type result: list

Raises:: KeyError – If a key is missing from the column mapping dictionary.

check_unique_id(entities, ner_chunk_result, ner_chunk_begin, ner_chunk_end)#

get_assertions(result)#

Extracts assertions from the pipeline output.

Parameters:

assertion_result (list) – The output of the assertion column.

Returns:

chunk_id (str): The unique identifier for the chunk of text that the assertion applies to.
chunk (str): The text of the chunk.
assertion (str): The assertion status.
assertion_confidence (str): The confidence score of the assertion model in identifying assertion status.
assertion_source (str): The source of the assertion information

Return type:

A list of dictionaries, where each dictionary represents an assertion, containing the following keys

get_classification(result)#

This function retrieves classification results from the pipeline output.

Parameters: result (list): The output of the pipeline from fullAnnotate. This list contains annotations for each document.

Returns: list: A list of dictionaries, where each dictionary represents a classification result.

Each dictionary contains the following keys: - category (str): The category of the classification. - sentence (str): The sentence from which the classification was made. - sentence_id (int): The ID of the sentence.

get_deidentification(result)#

Extract deidentification text from the pipeline output.

Parameters:

result (list) – The output of the pipeline.

Returns:

original (list): A list of original text.
obfuscated (list): A list of deidentified text.
masked (list): A list of masked text.

Return type:

A list of dictionaries, where each dictionary represents a deidentification containing the following keys

get_documents(result)#

Retrieves the document text from the pipeline output.

Arg: result (list): The output of the pipeline from fullAnnotate. This list contains annotations for each document.

Returns: list: A list of strings, where each string represents the text of a document.

get_entities(result)#

Extract named entities from the pipeline output.

Parameters:

result (list) – The output of the pipeline.

Returns:

A list of dictionaries, where each dictionary represents a named entity. Each dictionary contains the following keys:

chunk_id (str): The unique identifier for the entity.

chunk (str): The text of the entity.

begin (int): The index of the beginning of the entity in the text.

end (int): The index of the end of the entity in the text.

ner_label (str): The label of the entity, as determined by the NER model.

ner_source (str): The source of the NER model that generated this entity.

ner_confidence (float): The confidence score of the NER model in identifying this entity.

Return type:

entity_results (list)

get_relations(result, return_relation_entities=False)#

Extract relations from the pipeline output.

Parameters:

result (list) – The output of the pipeline.

Returns:

relation (str): The type of relation.
entity1 (str): The first entity envolved in the relation.
chunk1_id (str): The unique identifier for the first chunk.
entity1_begin (int): The index of the beginning of the first chunk.
entity1_end (int): The index of the end of the first chunk.
chunk1 (str): The text of the first chunk.
entity2 (str): The second entity envolved in the relation.
chunk2_id (str): The unique identifier for the second chunk.
entity2_begin (int): The index of the beginning of the second chunk.
entity2_end (int): The index of the end of the second chunk.
chunk2 (str): The text of the second chunk.
confidence (float): The confidence score of the relation.
direction (str): The direction of the relation.

Return type:

A list of dictionaries, where each dictionary represents a relation, containing the following keys

get_resolutions(result, entities)#

Extracts terminology codes from the pipeline output.

Returns:

vocab (str): The vocablary information about teminology codes.
chunk_id (str): The unique identifier for entity.
chunk (str): The text of the chunk.
term_code (str): The code of the term.
resolutions (str): The resolved text of the term.
all_k_codes (list): A list of all k results.
all_k_resolutions (list): A list of all k resolutions.
all_k_aux_labels (list): A list of all k aux labels.
all_k_distances (list): A list of all k distances.
confidence (float): The confidence score of the resolutions.

Return type:

A list of dictionaries, where each dictionary represents a named entity, containing the following keys

get_summary(result)#

Generates summary from the provided text.

Returns:: Summarized text

get_unique_id(ner_chunk_result, ner_chunk_begin, ner_chunk_end, ner_label)#

Generate a unique identifier for a named entity chunk.

Returns: str: A unique identifier for the named entity chunk, encoded in base64. The identifier is truncated to the first 8 characters.

Note: This method uses the uuid3 function from the uuid module to generate a unique identifier based on the provided parameters. The identifier is then encoded in base64 and truncated to the first 8 characters.

run(results, return_relation_entities=False)#

Parse the pipeline output.

Parameters:

results (list) – The output of the pipeline from fullAnnotate.

Returns:

document_identifier (str): The informations of the document.
document_id (str): The ID of the document.
document_text (str): The text of the document.
entities (list): A list of dictionaries, where each dictionary represents a named entity, containing the following keys:
- chunk_id (str): The unique identifier for the chunk.
- chunk (str): The text of the entity.
- begin (int): The index of the beginning of the entity in the text.
- end (int): The index of the end of the entity in the text.
- ner_label (str): The label of the entity, as determined by the NER model.
- ner_source (str): The source of the NER model that generated this entity.
- ner_confidence (float): The confidence score of the NER model in identifying this entity.
assertions (list): A list of dictionaries, where each dictionary represents a assertion, containing the following keys:
- chunk_id (str): The unique identifier for the chunk.
- chunk (str): The text of the chunk.
- assertion (str): The assertion status.
- assertion_confidence (str): The confidence score of the assertion model in identifying assertion status.
- assertion_source (str): The source of the assertion information
resolutions (list): A list of dictionaries, where each dictionary represents a resolutions, containing the following keys:
- vocab (str): The vocablary information about teminology codes.
- chunk_id (str): The unique identifier for the chunk.
- chunk (str): The text of the chunk.
- term_code (str): The code of the term.
- resolutions (str): The resolved text of the term.
- all_k_codes (list): A list of all k results.
- all_k_resolutions (list): A list of all k resolutions.
- all_k_aux_labels (list): A list of all k aux labels.
- all_k_distances (list): A list of all k distances.
- confidence (float): The confidence score of the resolutions.
relations (list): A list of dictionaries, where each dictionary represents a relation, containing the following keys:
- relation (str): The type of relation.
- entity1 (str): The first entity envolved in the relation.
- chunk1_id (str): The unique identifier for the first chunk.
- entity1_begin (int): The index of the beginning of the first chunk.
- entity1_end (int): The index of the end of the first chunk.
- chunk1 (str): The text of the first chunk.
- entity2 (str): The second entity envolved in the relation.
- chunk2_id (str): The unique identifier for the second chunk.
- entity2_begin (int): The index of the beginning of the second chunk.
- entity2_end (int): The index of the end of the second chunk.
- chunk2 (str): The text of the second chunk.
- confidence (float): The confidence score of the relation.
- direction (str): The direction of the relation.
deidentifications: (list) A list of dictionaries, where each dictionary represents a deidentification, containing the following keys:
- original (list): A list of original text.
- obfuscated (list): A list of deidentified text.
- masked (list): A list of masked text.
classifications (list): A list of dictionaries, where each dictionary represents a classification result.
- category (str): The category of the classification.
- sentence (str): The sentence from which the classification was made.
- sentence_id (int): The ID of the sentence.

Return type:

A dictionary containing the parsed results, where the key is “result” and the value is a list of dictionaries, where each dictionary represents a document, containing the following keys

update_columns_maps(columns_maps=None)#

Update the column mapping dictionary.

Parameters:: columns_maps (dict) – A dictionary containing the column mappings.

sparknlp_jsl.pipeline_output_parser#

Module Contents#

Classes#

`sparknlp_jsl.pipeline_output_parser`#