sparknlp_jsl.pipeline_output_parser
#
Module Contents#
Classes#
This class is used to parse the output of a Spark NLP pipeline. |
- class PipelineOutputParser(columns_maps=None)#
This class is used to parse the output of a Spark NLP pipeline.
It provides methods for extracting named entities, assertions, code mapping, relations, summaries, and deidentification from the output.
Examples
>>> import sparknlp >>> import sparknlp_jsl >>> from sparknlp.pretrained import PretrainedPipeline >>> from sparknlp_jsl.pipeline_output_parser import PipelineOutputParser
Load the pipeline and the text
>>> text = [ ... "Immunohistochemistry was negative for thyroid transcription factor-1 and napsin A. The test was positive for ER and PR, and negative for HER2.", ... "The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago." + ... "The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to the residual breast. The cancer recurred as a right lung metastasis 13 years later." + ... "He underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy." ... ]
>>> oncology_pipeline = PretrainedPipeline("oncology_biomarker_pipeline", "en", "clinical/models") >>> annotations = oncology_pipeline.fullAnnotate(text)
Define the column_maps dictionary
>>> column_maps = { ... 'document_identifier': 'XYZ_123', ... 'document_text': 'document', ... 'entities': ["merged_chunk"], ... 'assertions': ["assertion_chunk"], ... 'resolver': ["icd10_code"] ... 'relations': [], ... 'summaries': [], ... "deidentifications" : [], ... "classifications":[] ... }
Initialized parser from the dictionary
>>> pipeline_parser = PipelineOutputParser(column_maps)
Run the parser on the output of a Spark NLP pipeline
>>> parsed_result = pipeline_parser.run(annotations) >>> print(parsed_result)
- check_column_mapping_dict(result: dict)#
This function is used to check if the column mapping dictionary is valid. It checks if the following keys are present in the dictionary: document_text, entities, relations, and summary. It also checks if the values of the entities, relations, and summary keys are also dictionaries, and if the keys ner_chunk_column_name, resolver_column_name, and assertion_column_name are present in these dictionaries. :param result: The output of the pipeline. :type result: list
- Raises:
KeyError – If a key is missing from the column mapping dictionary.
- check_unique_id(entities, ner_chunk_result, ner_chunk_begin, ner_chunk_end)#
- get_assertions(result)#
Extracts assertions from the pipeline output.
- Parameters:
assertion_result (list) – The output of the assertion column.
- Returns:
chunk_id (str): The unique identifier for the chunk of text that the assertion applies to.
chunk (str): The text of the chunk.
assertion (str): The assertion status.
assertion_confidence (str): The confidence score of the assertion model in identifying assertion status.
assertion_source (str): The source of the assertion information
- Return type:
A list of dictionaries, where each dictionary represents an assertion, containing the following keys
- get_classification(result)#
This function retrieves classification results from the pipeline output.
Parameters: result (list): The output of the pipeline from fullAnnotate. This list contains annotations for each document.
Returns: list: A list of dictionaries, where each dictionary represents a classification result.
Each dictionary contains the following keys: - category (str): The category of the classification. - sentence (str): The sentence from which the classification was made. - sentence_id (int): The ID of the sentence.
- get_deidentification(result)#
Extract deidentification text from the pipeline output.
- Parameters:
result (list) – The output of the pipeline.
- Returns:
original (list): A list of original text.
obfuscated (list): A list of deidentified text.
masked (list): A list of masked text.
- Return type:
A list of dictionaries, where each dictionary represents a deidentification containing the following keys
- get_documents(result)#
Retrieves the document text from the pipeline output.
Arg: result (list): The output of the pipeline from fullAnnotate. This list contains annotations for each document.
Returns: list: A list of strings, where each string represents the text of a document.
- get_entities(result)#
Extract named entities from the pipeline output.
- Parameters:
result (list) – The output of the pipeline.
- Returns:
A list of dictionaries, where each dictionary represents a named entity. Each dictionary contains the following keys:
chunk_id (str): The unique identifier for the entity.
chunk (str): The text of the entity.
begin (int): The index of the beginning of the entity in the text.
end (int): The index of the end of the entity in the text.
ner_label (str): The label of the entity, as determined by the NER model.
ner_source (str): The source of the NER model that generated this entity.
ner_confidence (float): The confidence score of the NER model in identifying this entity.
- Return type:
entity_results (list)
- get_relations(result, return_relation_entities=False)#
Extract relations from the pipeline output.
- Parameters:
result (list) – The output of the pipeline.
- Returns:
relation (str): The type of relation.
entity1 (str): The first entity envolved in the relation.
chunk1_id (str): The unique identifier for the first chunk.
entity1_begin (int): The index of the beginning of the first chunk.
entity1_end (int): The index of the end of the first chunk.
chunk1 (str): The text of the first chunk.
entity2 (str): The second entity envolved in the relation.
chunk2_id (str): The unique identifier for the second chunk.
entity2_begin (int): The index of the beginning of the second chunk.
entity2_end (int): The index of the end of the second chunk.
chunk2 (str): The text of the second chunk.
confidence (float): The confidence score of the relation.
direction (str): The direction of the relation.
- Return type:
A list of dictionaries, where each dictionary represents a relation, containing the following keys
- get_resolutions(result, entities)#
Extracts terminology codes from the pipeline output.
- Returns:
vocab (str): The vocablary information about teminology codes.
chunk_id (str): The unique identifier for entity.
chunk (str): The text of the chunk.
term_code (str): The code of the term.
resolutions (str): The resolved text of the term.
all_k_codes (list): A list of all k results.
all_k_resolutions (list): A list of all k resolutions.
all_k_aux_labels (list): A list of all k aux labels.
all_k_distances (list): A list of all k distances.
confidence (float): The confidence score of the resolutions.
- Return type:
A list of dictionaries, where each dictionary represents a named entity, containing the following keys
- get_summary(result)#
Generates summary from the provided text.
- Returns:
Summarized text
- get_unique_id(ner_chunk_result, ner_chunk_begin, ner_chunk_end, ner_label)#
Generate a unique identifier for a named entity chunk.
Returns: str: A unique identifier for the named entity chunk, encoded in base64. The identifier is truncated to the first 8 characters.
Note: This method uses the uuid3 function from the uuid module to generate a unique identifier based on the provided parameters. The identifier is then encoded in base64 and truncated to the first 8 characters.
- run(results, return_relation_entities=False)#
Parse the pipeline output.
- Parameters:
results (list) – The output of the pipeline from fullAnnotate.
- Returns:
document_identifier (str): The informations of the document.
document_id (str): The ID of the document.
document_text (str): The text of the document.
- entities (list): A list of dictionaries, where each dictionary represents a named entity, containing the following keys:
chunk_id (str): The unique identifier for the chunk.
chunk (str): The text of the entity.
begin (int): The index of the beginning of the entity in the text.
end (int): The index of the end of the entity in the text.
ner_label (str): The label of the entity, as determined by the NER model.
ner_source (str): The source of the NER model that generated this entity.
ner_confidence (float): The confidence score of the NER model in identifying this entity.
- assertions (list): A list of dictionaries, where each dictionary represents a assertion, containing the following keys:
chunk_id (str): The unique identifier for the chunk.
chunk (str): The text of the chunk.
assertion (str): The assertion status.
assertion_confidence (str): The confidence score of the assertion model in identifying assertion status.
assertion_source (str): The source of the assertion information
- resolutions (list): A list of dictionaries, where each dictionary represents a resolutions, containing the following keys:
vocab (str): The vocablary information about teminology codes.
chunk_id (str): The unique identifier for the chunk.
chunk (str): The text of the chunk.
term_code (str): The code of the term.
resolutions (str): The resolved text of the term.
all_k_codes (list): A list of all k results.
all_k_resolutions (list): A list of all k resolutions.
all_k_aux_labels (list): A list of all k aux labels.
all_k_distances (list): A list of all k distances.
confidence (float): The confidence score of the resolutions.
- relations (list): A list of dictionaries, where each dictionary represents a relation, containing the following keys:
relation (str): The type of relation.
entity1 (str): The first entity envolved in the relation.
chunk1_id (str): The unique identifier for the first chunk.
entity1_begin (int): The index of the beginning of the first chunk.
entity1_end (int): The index of the end of the first chunk.
chunk1 (str): The text of the first chunk.
entity2 (str): The second entity envolved in the relation.
chunk2_id (str): The unique identifier for the second chunk.
entity2_begin (int): The index of the beginning of the second chunk.
entity2_end (int): The index of the end of the second chunk.
chunk2 (str): The text of the second chunk.
confidence (float): The confidence score of the relation.
direction (str): The direction of the relation.
- deidentifications: (list) A list of dictionaries, where each dictionary represents a deidentification, containing the following keys:
original (list): A list of original text.
obfuscated (list): A list of deidentified text.
masked (list): A list of masked text.
- classifications (list): A list of dictionaries, where each dictionary represents a classification result.
category (str): The category of the classification.
sentence (str): The sentence from which the classification was made.
sentence_id (int): The ID of the sentence.
- Return type:
A dictionary containing the parsed results, where the key is “result” and the value is a list of dictionaries, where each dictionary represents a document, containing the following keys
- update_columns_maps(columns_maps=None)#
Update the column mapping dictionary.
- Parameters:
columns_maps (dict) – A dictionary containing the column mappings.