Spark NLP for Healthcare Release Notes 4.2.3

 

4.2.3

Highlights

  • 3 new chunk mapper models to mapping Drugs and Diseases from the KEGG Database as well as mapping abbreviations to their categories
  • New utility & helper Relation Extraction modules to handle preprocess
  • New utility & helper OCR modules to handle annotate
  • New utility & helper NER log parser
  • Adding flexibility chunk merger prioritization
  • Core improvements and bug fixes
  • New and updated notebooks
  • 3 new clinical models and pipelines added & updated in total

3 New Hhunk Mapper Models to Mapping Drugs and Diseases from the KEGG Database as well as Mapping Abbreviations to Their Categories

  • kegg_disease_mapper: This pretrained model maps diseases with their corresponding category, description, icd10_code, icd11_code, mesh_code, and hierarchical brite_code. This model was trained with the data from the KEGG database.

Example:

chunkerMapper = ChunkMapperModel.pretrained("kegg_disease_mapper", "en", "clinical/models")\
    .setInputCols(["ner_chunk"])\
    .setOutputCol("mappings")\
    .setRels(["description", "category", "icd10_code", "icd11_code", "mesh_code", "brite_code"])

text= "A 55-year-old female with a history of myopia, kniest dysplasia and prostate cancer. She was on glipizide , and dapagliflozin for congenital nephrogenic diabetes insipidus."

Result:

+-----------------------------------------+--------------------------------------------------+-----------------------+----------+----------+---------+-----------------------+
|                                ner_chunk|                                       description|               category|icd10_code|icd11_code|mesh_code|             brite_code|
+-----------------------------------------+--------------------------------------------------+-----------------------+----------+----------+---------+-----------------------+
|                                   myopia|Myopia is the most common ocular disorder world...| Nervous system disease|     H52.1|    9D00.0|  D009216|            08402,08403|
|                         kniest dysplasia|Kniest dysplasia is an autosomal dominant chond...|Congenital malformation|     Q77.7|    LD24.3|  C537207|            08402,08403|
|                          prostate cancer|Prostate cancer constitutes a major health prob...|                 Cancer|       C61|      2C82|     NONE|08402,08403,08442,08441|
|congenital nephrogenic diabetes insipidus|Nephrogenic diabetes insipidus (NDI) is charact...| Urinary system disease|     N25.1|   GB90.4A|  D018500|            08402,08403|
+-----------------------------------------+--------------------------------------------------+-----------------------+----------+----------+---------+-----------------------+
  • kegg_drug_mapper: This pretrained model maps drugs with their corresponding efficacy, molecular_weight as well as CAS, PubChem, ChEBI, LigandBox, NIKKAJI, PDB-CCD codes. This model was trained with the data from the KEGG database.

Example:

chunkerMapper = ChunkMapperModel.pretrained("kegg_drug_mapper", "en", "clinical/models")\
    .setInputCols(["ner_chunk"])\
    .setOutputCol("mappings")\
    .setRels(["efficacy", "molecular_weight", "CAS", "PubChem", "ChEBI", "LigandBox", "NIKKAJI", "PDB-CCD"])

text= "She is given OxyContin, folic acid, levothyroxine, Norvasc, aspirin, Neurontin"

Result:

+-------------+--------------------------------------------------+----------------+----------+-----------+-------+---------+---------+-------+
|    ner_chunk|                                          efficacy|molecular_weight|       CAS|    PubChem|  ChEBI|LigandBox|  NIKKAJI|PDB-CCD|
+-------------+--------------------------------------------------+----------------+----------+-----------+-------+---------+---------+-------+
|    OxyContin|     Analgesic (narcotic), Opioid receptor agonist|        351.8246|  124-90-3|  7847912.0| 7859.0|   D00847|J281.239H|   NONE|
|   folic acid|Anti-anemic, Hematopoietic, Supplement (folic a...|        441.3975|   59-30-3|  7847138.0|27470.0|   D00070|  J1.392G|    FOL|
|levothyroxine|                     Replenisher (thyroid hormone)|          776.87|   51-48-9|9.6024815E7|18332.0|   D08125|  J4.118A|    T44|
|      Norvasc|Antihypertensive, Vasodilator, Calcium channel ...|        408.8759|88150-42-9|5.1091781E7| 2668.0|   D07450| J33.383B|   NONE|
|      aspirin|Analgesic, Anti-inflammatory, Antipyretic, Anti...|        180.1574|   50-78-2|  7847177.0|15365.0|   D00109|  J2.300K|    AIN|
|    Neurontin|                     Anticonvulsant, Antiepileptic|        171.2368|60142-96-3|  7847398.0|42797.0|   D00332| J39.388F|    GBN|
+-------------+--------------------------------------------------+----------------+----------+-----------+-------+---------+---------+-------+
  • abbreviation_category_mapper: This pretrained model maps abbreviations and acronyms of medical regulatory activities with their definitions and categories. Predicted categories: general, problem, test, treatment, medical_condition, clinical_dept, drug, nursing, internal_organ_or_component, hospital_unit, drug_frequency, employment, procedure.

Example:

chunkerMapper = ChunkMapperModel.pretrained("abbreviation_category_mapper", "en", "clinical/models")\
     .setInputCols(["abbr_ner_chunk"])\
     .setOutputCol("mappings")\
     .setRels(["definition", "category"])\

text = ["""Gravid with estimated fetal weight of 6-6/12 pounds.
         LABORATORY DATA: Laboratory tests include a CBC which is normal.
         VDRL: Nonreactive
         HIV: Negative. One-Hour Glucose: 117. Group B strep has not been done as yet."""]

Result:

| chunk   | category          | definition                             |
|:--------|:------------------|:---------------------------------------|
| CBC     | general           | complete blood count                   |
| VDRL    | clinical_dept     | Venereal Disease Research Laboratories |
| HIV     | medical_condition | Human immunodeficiency virus           |

New Utility & Helper Relation Extraction Modules to Handle Preprocess

This process is standard and training column should be same in all RE trainings. We can simplify this process with helper class. With proposed changes it can be done as follows:

Example:

from sparknlp_jsl.training import REDatasetHelper

# map entity columns to dataset columns
column_map = {
    "begin1": "firstCharEnt1",
    "end1": "lastCharEnt1",
    "begin2": "firstCharEnt2",
    "end2": "lastCharEnt2",
    "chunk1": "chunk1",
    "chunk2": "chunk2",
    "label1": "label1",
    "label2": "label2"
}

# apply preprocess function to dataframe
data = REDatasetHelper(data).create_annotation_column(
    column_map,
    ner_column_name="train_ner_chunks" # optional, default train_ner_chunks
)

New Utility & Helper OCR Modules to Handle Annotations

This modeule can generates an annotated PDF file using input PDF files. style: PDF file proccess style that has 3 options;

  • black_band: Black bands over the chunks detected by NER pipeline.
  • bounding_box: Colorful bounding boxes around the chunks detected by NER pipeline. Each color represents a different NER label.
  • highlight: Colorful highlights over the chunks detected by NER pipeline. Each color represents a different NER label.
  • You can check Spark OCR Utility Module notebook for more examples.

Example:

from sparknlp_jsl.utils.ocr_nlp_processor import  ocr_entity_processor

path='/*.pdf'
        
box = "bounding_box"
ocr_entity_processor(spark=spark,file_path=path,ner_pipeline = nlp_model,chunk_col = "merged_chunk", black_list = ["AGE", "DATA", "PATIENT"],
                    style = box, save_dir = "colored_box",label= True, label_color = "red",color_chart_path = "label_colors.png", display_result=True)

box = "highlight"
ocr_entity_processor(spark=spark,file_path=path, ner_pipeline = nlp_model, chunk_col = "merged_chunk", black_list = ["AGE", "DATE", "PATIENT"],
                    style = box, save_dir = "colored_box", label= True, label_color = "red", color_chart_path = "label_colors.png", display_result=True)

box = "black_band"
ocr_entity_processor(spark=spark,file_path=path, ner_pipeline = nlp_modelchunk_col = "merged_chunk", 
                     style = box, save_dir = "black_band",label= True, label_color = "red", display_result = True)

Results:

  • Bounding box with labels and black list

image

  • Highlight with labels and black_list

image

  • black_band with labels

image

New Utility & Helper NER Log Parser

ner_utils: This new module is used after NER training to calculate mertic chunkbase and plot training logs.

Example:

nerTagger = NerDLApproach()\
              .setInputCols(["sentence", "token", "embeddings"])\
              .setLabelColumn("label")\
              .setOutputCol("ner")\
              ...  
              .setOutputLogsPath('ner_logs')
    
ner_pipeline = Pipeline(stages=[glove_embeddings,
                                graph_builder,
                                nerTagger])

ner_model = ner_pipeline.fit(training_data)

  • evaluate: if verbose, returns overall performance, as well as performance per chunk type; otherwise, simply returns overall precision, recall, f1 scores

Example:

from sparknlp_jsl.utils.ner_utils import evaluate

metrics = evaluate(preds_df['ground_truth'].values, preds_df['prediction'].values)

Result:

processed 14133 tokens with 1758 phrases; found: 1779 phrases; correct: 1475.
accuracy:  83.45%; (non-O)
accuracy:  96.67%; precision:  82.91%; recall:  83.90%; FB1:  83.40
              LOC: precision:  91.41%; recall:  85.69%; FB1:  88.46  524
             MISC: precision:  78.15%; recall:  62.11%; FB1:  69.21  151
              ORG: precision:  61.86%; recall:  74.93%; FB1:  67.77  430
              PER: precision:  90.80%; recall:  93.58%; FB1:  92.17  674
  • loss_plot: Plots the figure of loss vs epochs

Example:

from sparknlp_jsl.utils.ner_utils import loss_plot

loss_plot('./ner_logs/'+log_files[0])

Results:

image

  • get_charts : Plots the figures of metrics ( precision, recall, f1) vs epochs

Example:

from sparknlp_jsl.utils.ner_utils import get_charts

get_charts('./ner_logs/'+log_files[0])

Results:

image

image

Adding Flexibility Chunk Merger Prioritization

orderingFeatures: Array of strings specifying the ordering features to use for overlapping entities. Possible values are ChunkBegin, ChunkLength, ChunkPrecedence, ChunkConfidence

selectionStrategy: Whether to select annotations sequentially based on annotation order Sequential or using any other available strategy, currently only DiverseLonger are available.

defaultConfidence: When ChunkConfidence ordering feature is included and a given annotation does not have any confidence the value of this param will be used.

chunkPrecedence: When ChunkPrecedence ordering feature is used this param contains the comma separated fields in metadata that drive prioritization of overlapping annotations. When used by itself (empty chunkPrecedenceValuePrioritization) annotations will be prioritized based on number of metadata fields present. When used together with chunkPrecedenceValuePrioritization param it will prioritize based on the order of its values.

chunkPrecedenceValuePrioritization: When ChunkPrecedence ordering feature is used this param contains an Array of comma separated values representing the desired order of prioritization for the VALUES in the metadata fields included from chunkPrecedence.

Example:

text = """A 63 years old man presents to the hospital with a history of recurrent infections 
that include cellulitis, pneumonias, and upper respiratory tract infections..."""

+-------------------------------------------------------------------------------------+
|ner_deid_chunk                                                                       |
+-------------------------------------------------------------------------------------+
|[{chunk, 2, 3, 63, {entity -> AGE, sentence -> 0, chunk -> 0, confidence -> 0.9997}}]|
+-------------------------------------------------------------------------------------+

+----------------------------------------------------------------------------------------------------+
|jsl_ner_chunk                                                                                       |     
+----------------------------------------------------------------------------------------------------+
|[{chunk, 2, 13, 63 years old, {entity -> Age, sentence -> 0, chunk -> 0, confidence -> 0.85873336}}]|
+----------------------------------------------------------------------------------------------------+

  • Merging overlapped chunks by considering their lenght
    If we set setOrderingFeatures(["ChunkLength"]) and setSelectionStrategy("DiverseLonger") parameters, the longest chunk will be prioritized in case of overlapping.

Example:

chunk_merger = ChunkMergeApproach()\
    .setInputCols('ner_deid_chunk', "jsl_ner_chunk")\
    .setOutputCol('merged_ner_chunk')\
    .setOrderingFeatures(["ChunkLength"])\
    .setSelectionStrategy("DiverseLonger")

Results:

|begin|end|        chunk|         entity|
+-----+---+-------------+---------------+
|    2| 13| 63 years old|            Age|
|   15| 17|          man|         Gender|
|   35| 42|     hospital|  Clinical_Dept|

  • Merging overlapped chunks by considering custom values that we set
    setChunkPrecedence() parameter contains an Array of comma separated values representing the desired order of prioritization for the VALUES in the metadata fields included from setOrderingFeatures(["chunkPrecedence"]).

Example:

chunk_merger = ChunkMergeApproach()\
    .setInputCols('ner_deid_chunk', "jsl_ner_chunk")\
    .setOutputCol('merged_ner_chunk')\
    .setMergeOverlapping(True) \
    .setOrderingFeatures(["ChunkPrecedence"]) \
    .setChunkPrecedence('ner_deid_chunk,AGE') \
#    .setChunkPrecedenceValuePrioritization(["ner_deid_chunk,AGE", "jsl_ner_chunk,Age"]) 

Results:

|begin|end|        chunk|         entity|
+-----+---+-------------+---------------+
|    2|  3|           63|            AGE|
|   15| 17|          man|         Gender|
|   35| 42|     hospital|  Clinical_Dept|

You can check NER Chunk Merger notebook for more examples.

Core improvements and bug fixes

  • AssertionDL IncludeConfidence() parameters default value set by True
  • Fixed NaN outputs in RelationExtraction
  • Fixed loadSavedModel method that we use for importing transformers into Spark NLP
  • Fixed replacer with setUseReplacement(True) parameter
  • Added overall confidence score to MedicalNerModel when setIncludeAllConfidenceScore is True
  • Fixed in InternalResourceDownloader showAvailableAnnotators

New and Updated Notebooks

3 New Clinical Models and Pipelines Added & Updated in Total

  • kegg_disease_mapper
  • kegg_drug_mapper
  • abbreviation_category_mapper

For all Spark NLP for healthcare models, please check: Models Hub Page

Versions

Last updated