Healthcare NLP v5.1.2 Release Notes

 

5.1.2

Highlights

We are delighted to announce remarkable enhancements and updates in our latest release of Spark NLP for Healthcare. This release comes with a new parameter to AssertionDLModel for renaming assertion labels, a new parameter to DeIdentification for obfuscating ages based on the HIPAA Privacy Rule, enhanced cloud support for ContextualParser, as well as 22 new clinical pretrained models and pipelines.

  • 13 new augmented NER models by leveraging the capabilities of the LangTest library to boost their robustness significantly
  • 3 new clinical NER models for extracting clinical entities in the Arabic, Finnish, and Bulgarian languages
  • 3 new multi-label text classification for respiratory disease, heart disease, and mental disorder
  • 3 New ChunkMapper models to map UMLS codes to their MeSH, SNOMED, and ICD-10-CM codes.
  • A new parameter to AssertionDLModel to rename the assertion labels
  • A new parameter to DeIdentification to obfuscate ages based on HIPAA (Health Insurance Portability and Accountability Act) Privacy Rule
  • Enhanced cloud support for ContextualParser with setJsonPath and setDictionary parameters
  • New features to the SparkNLP Healthcare CMS-HCC Risk-Adjustment module
  • New functionalities for the OCR Utility Module
  • Various core improvements; bug fixes, enhanced overall robustness and reliability of SparkNLP for Healthcare
    • The issue with pretrained models that included the ChunkConverter() when loading them locally using from_disk() has been resolved.
    • The incorrect exception message in ocr_entity_processor() has been corrected
    • Resolved the problem with day-shifting in DeIdentification, specifically related to masking <AGE> when using the setAgeRange feature.
  • New and updated demos
  • The addition and update of numerous new clinical models and pipelines continue to reinforce our offering in the healthcare domain

These enhancements will elevate your experience with Spark NLP for Healthcare, enabling more efficient, accurate, and streamlined healthcare-related natural language data analysis.

13 New Augmented NER Models by Leveraging the Capabilities of the LangTest Library to Boost Their Robustness Significantly

Model Name Predicted Entities
ner_biomarker_langtest Oncogenes, Tumor_Finding, ResponseToTreatment, Biomarker, HormonalTherapy, Staging, Drug, CancerDx, Radiotherapy, CancerSurgery, TargetedTherapy, CancerModifier, Biomarker_Measurement, Metastasis, Chemotherapy, Test, Dosage, Test_Result, Immunotherapy
ner_bionlp_langtest Amino_acid, Anatomical_system, Cancer, Cell, Cellular_component, Developing_anatomical_Structure, Gene_or_gene_product, Immaterial_anatomical_entity, Multi-tissue_structure, Organ, Organism, Organism_subdivision, Simple_chemical, Tissue
ner_clinical_large_langtest PROBLEM, TEST, TREATMENT
ner_living_species_langtest HUMAN, SPECIES
ner_vop_langtest Gender, Employment, Age, Substance, Form, PsychologicalCondition, Vaccine, Drug, DateTime, ClinicalDept, Test, AdmissionDischarge, Disease, Dosage, Duration, RelationshipStatus, Symptom, Procedure, HealthStatus, InjuryOrPoisoning, Modifier, Treatment,
ner_chemprot_clinical_langtest CHEMICAL, GENE-Y, GENE-N
ner_bacterial_species_langtest SPECIES
ner_cellular_langtest DNA, Cell_type, Cell_line, RNA, Protein
ner_deid_enriched_langtest AGE, CITY, COUNTRY, DATE, DOCTOR, HOSPITAL, IDNUM, MEDICALRECORD, ORGANIZATION, PATIENT, PHONE, PROFESSION, STATE, STREET, USERNAME, ZIP
ner_deid_large_langtest AGE, CONTACT, DATE, ID, LOCATION, NAME, PROFESSION
ner_diseases_langtest Disease
ner_oncology_langtest Staging, Cancer_Score, Tumor_Finding, Site_Lymph_Node, Response_To_Treatment, Smoking_Status, Tumor_Size, Cycle_Count, Adenopathy Biomarker_Result, Chemotherapy, Cancer_Surgery, Line_Of_Therapy, Pathology_Result, Hormonal_Therapy, Biomarker, Immunotherapy, Metastasis, Cancer_Dx, Grade
ner_deid_generic_augmented_allUpperCased_langtest DATE, NAME, LOCATION, PROFESSION, CONTACT, AGE, ID
  • The table below shows the robustness of overall test results for 13 different models.
model names original robustness new robustness
ner_biomarker_langtest 45.49% 78.84%
ner_bionlp_langtest 49.56% 76.70%
ner_clinical_large_langtest 50.66% 77.64%
ner_living_species_langtest 68.42% 90.86%
ner_vop_langtest 50.74% 78.21%
ner_chemprot_clinical_langtest 49.53% 81.74%
ner_bacterial_species_langtest 76.04% 90.65%
ner_cellular_langtest 33.74% 77.52%
ner_deid_enriched_langtest 94.78% 97.48%
ner_deid_large_langtest 86.98% 95.22%
ner_diseases_langtest 60.01% 86.03%
ner_oncology_langtest 52.13% 79.73%
ner_deid_generic_augmented_allUpperCased_langtest 94.73% 97.60%

3 New Clinical NER Models for Extracting Clinical Entities in the Arabic, Finnish, and Bulgarian Languages

3 new Clinical NER models provide valuable tools for processing and analyzing multi-language clinical texts. They assist in automating the extraction of important clinical information, facilitating research, medical documentation, and other applications within the multi-language healthcare domain.

Model Name Predicted Entities Language
ner_clinical PROBLEM TEST TREATMENT ar
ner_clinical PROBLEM TEST TREATMENT bg
ner_clinical PROBLEM TEST TREATMENT fi

Example:

embeddings = WordEmbeddingsModel.pretrained("arabic_w2v_cc_300d", "ar")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

ner_model = MedicalNerModel.pretrained("ner_clinical", "ar", "clinical/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

sample_text = """تاريخ الزيارة: 25 سبتمبر 2023 - المريضة: فاطمة علي - العمر: 48 سنة - الجنس: أنثى - المشاكل: 1. مشكل ارتفاع في مستويات في الدم 2. ارتفاع مستوى الكوليستيرول في الدم 3. اضطراب في وظائف الغدة الدرقية - الفحوصات: 1. قياس مستوى السكر في الدم 2. تحليل مستوى الكوليستيرول في الدم 3. اختبار وظائف الغدة الدرقية - العلاجات: 1. وصف دواء لمراقبة وتنظيم مستوى السكر في الدم (ميتفورمين 500 ملغ يوميا) 2. وصف دواء لتخفيض مستوى الكوليستيرول (ستاتين 20 ملغ يوميا) 3. وصف العلاج اللازم لتحسين وظائف الغدة الدرقية (ليفوتيروكسين 50 ميكروغرام يوميا)، بالإضافة إلى توجيهات بشأن نمط حياة صحي تتضمن اتباع نظام غذائي مناسب وممارسة الرياضة بانتظام."""

Result:

chunk ner_label
مشكل ارتفاع في مستويات في الدم PROBLEM
ارتفاع مستوى الكوليستيرول في الدم PROBLEM
اضطراب في وظائف الغدة الدرقية PROBLEM
قياس مستوى السكر في الدم TEST
تحليل مستوى الكوليستيرول في الدم TEST
اختبار وظائف الغدة الدرقية TEST
ميتفورمين TREATMENT
ليفوتيروكسين TREATMENT
نظام غذائي مناسب TREATMENT

3 New Multilabel Text Classification For Respiratory Disease, Heart Disease, and Mental Disorder

The PHS-BERT Respiratory Disease Classifier Model is a specialized text classification system, engineered to accurately identify and categorize textual mentions of four prominent respiratory diseases: Asthma, Chronic Obstructive Pulmonary Disease (COPD), Emphysema, and Chronic bronchitis

Model Description Predicted Labels
multiclassifierdl_respiratory_disease This model identifies and categorizes textual mentions of four prominent respiratory diseases: Asthma, Chronic Obstructive Pulmonary Disease (COPD), Emphysema, and Chronic bronchitis. Astham, COPD, Emphysema, Chronic bronchitis, Other/Unknown, No
multiclassifierdl_heart_disease This model identifies and categorize textual mentions of three prominent cardiovascular diseases: Hypertension, Coronary Artery Disease, and Myocardial Infarction. Hypertension, MI, CAD, Other/Unknown, No
multiclassifierdl_mental_disorder This model classifies text based on the following mental disorders. (Schizophrenia, Depression, Bipolar disorder, Anxiety disorder). Anxiety Disorder, No, Schizophrenia, Depression, Other/Unknown

Example:

multiclassifierdl = MultiClassifierDLModel.pretrained("multiclassifierdl_respiratory_disease", "en", "clinical/models")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("predicted_class")

text = [
 ["""The patient takes inhalers for COPD management, weight loss medications, and disease-modifying antirheumatic drugs (DMARDs) for rheumatoid arthritis."""],
        ["""The patient was on Metformin for DM2, mood stabilizers for Bipolar II Disorder, and inhaled corticosteroids for Asthma."""],
        ["""The patient was diagnosed with Chronic Bronchitis after a series of pulmonary function tests."""],
        ["""Chest CT imaging revealed significant bullae and airspace enlargement, consistent with a diagnosis of emphysema."""],
    ]

Result:

+----------------------------------------------------------------------------------------------------+--------------------+
|                                                                                                text|              result|
+----------------------------------------------------------------------------------------------------+--------------------+
|The patient takes inhalers for COPD management, weight loss medications, and disease-modifying an...|              [COPD]|
|The patient was on Metformin for DM2, mood stabilizers for Bipolar II Disorder, and inhaled corti...|            [Asthma]|
|       The patient was diagnosed with Chronic Bronchitis after a series of pulmonary function tests.|[Chronic bronchitis]|
|Chest CT imaging revealed significant bullae and airspace enlargement, consistent with a diagnosi...|         [Emphysema]|
+----------------------------------------------------------------------------------------------------+--------------------+

Please see the Respiratory Disease, Heart Disease, and Mental Disorder demos

3 New ChunkMapper Models to Map UMLS Codes to Their MeSH, SNOMED, and ICD-10-CM Codes.

We have introduced 3 new ChunkMapper models to map UMLS codes to their MeSH, SNOMED, and ICD-10-CM codes.

Model Name Relations Description
umls_mesh_mapper mesh_code Maps UMLS codes to MeSH codes
umls_snomed_mapper snomed_code Maps UMLS codes to SNOMED codes
umls_icd10cm_mapper icd10cm_code Maps UMLS codes to ICD-10-CM codes

Example:

chunkerMapper = DocMapperModel.pretrained("umls_icd10cm_mapper", "en", "clinical/models")\
      .setInputCols(["document"])\
      .setOutputCol("mappings")\
      .setRels(["icd10cm_code"])

text= """["C0000744"], ["C2875181"]"""

Result:

+---------+------------+------------+
|umls_code|icd10cm_code|relation    |
+---------+------------+------------+
|C0000744 |E786        |icd10cm_code|
|C2875181 |G4381       |icd10cm_code|
+---------+------------+------------+

New Parameter to AssertionDLModel to Rename the Assertion Labels

We have introduced a new parameter called ReplaceLabels(dict[str, str]) to AssertionDLModel. This parameter enables users to customize the assertion labels.

Example:

clinical_assertion = AssertionDLModel.pretrained("assertion_dl_large", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion") \
    .setReplaceLabels({"PRESENT": "available", "absent": "none", "Conditional": "Optional"})

text= """Patient with severe fever and sore throat. He shows no stomach pain and he maintained on an epidural and PCA
for pain control. He also became short of breath with climbing a flight of stairs."""

Result:

chunks entities assertion
severe fever PROBLEM available
sore throat PROBLEM available
stomach pain PROBLEM none
an epidural TREATMENT available
PCA TREATMENT available
pain control PROBLEM hypothetical
short of breath PROBLEM Optional

Please see Clinical Assertion Notebook for more information

A New Parameter to DeIdentification to Obfuscate Ages Based on HIPAA (Health Insurance Portability and Accountability Act) Privacy Rule

We have introduced a new parameter called AgeRangesByHipaa() which determines whether to obfuscate ages in compliance with the HIPAA (Health Insurance Portability and Accountability Act) Privacy Rule.

The HIPAA Privacy Rule mandates that ages of patients older than 90 years must be obfuscated, while the age for patients 90 years or younger can remain unchanged. If the parameter is set as True, age entities larger than 90 will be obfuscated as per HIPAA Privacy Rule, and the others will remain unchanged. If the parameter is set as False, AgeRanges parameter is considered for obfuscation. The default value of the AgeRangesByHipaa is False.

Example:

obfuscation = DeIdentification()\
    .setInputCols(["sentence", "token", "age_chunk"]) \
    .setOutputCol("obfuscation") \
    .setMode("obfuscate")\
    .setObfuscateDate(True)\
    .setObfuscateRefSource("faker") \
    .setAgeRangesByHipaa(True)

dates = [
  '1 year old baby',
  '4 year old kids',
  'Record date: 2093-01-13, Age: 25',
  'A 92 year old female with',
  'Patient is 108 years-old',
  'He is 120 years-old male',
]

Result:

text age_chunk obfuscation
1 year old baby 1 1 year old baby
4 year old kids 4 4 year old kids
Record date: 2093-01-13, Age: 25 25 Record date: 2093-03-10, Age: 25
A 92 year old female with 92 A 99 year old female with
Patient is 108 years-old 108 Patient is 119 years-old
He is 120 years-old male 120 He is 140 years-old male

Please see Clinical DeIdentification Notebook for more information

Enhanced Cloud Support for ContextualParser with setJsonPath and setDictionary

Explore the new capabilities of the ContextualParserApproach, featuring extended cloud support for path configuration using setJsonPath and setDictionary. This example demonstrates how to leverage these enhancements to improve entity recognition in your NLP projects, providing flexibility and scalability with cloud-based resources.

Example:

contextual_parser = ContextualParserApproach() \
    .setInputCols(["sentence", "token"])\
    .setOutputCol("entity")\
    .setCaseSensitive(True)\
    .setJsonPath("s3a://YOUR_S3_BUCKET/data/cities.json")\
    .setDictionary("s3a://YOUR_S3_BUCKET/data/cities.tsv", 
                   read_as = "SPARK",
                   options={"orientation": "vertical",
                            "format": "text"})

New Features to the SparkNLP Healthcare CMS-HCC Risk-Adjustment Module

We are introducing 3 new functions SparkNLP Healthcare CMS-HCC Risk-Adjustment Module.

  • HCC_from_ICD: A mapping of ICD-10 codes to their corresponding RXHCC or HCC codes. Retrieves the mapping of Risk Adjustment Hierarchical Condition Categories (RXHCC) or Hierarchical Condition Categories (HCC) based on a list of International Classification of Diseases, 10th Revision (ICD-10) codes for a specific measurement year. This method allows for the retrieval of RXHCC or HCC associated with the provided ICD-10 codes for a particular year.

Example:

from sparknlp_jsl.utils.risk_adjustment_utils import RiskAdjustmentUtil

RiskAdjustmentUtil.HCC_from_ICD("hcc", "ESRDv21", 2019, ["A021", "I209", "E103559"])

OUTPUT:{'A021': ['HCC2'], 'I209': ['HCC88'], 'E103559': ['HCC18', 'HCC122']}
  • HCC_labels: A mapping of RXHCCs and HCCs to their respective medical descriptions. Retrieves the medical descriptions associated with a given list of Risk Adjustment Hierarchical Condition Categories (RXHCCs) and Hierarchical Condition Categories (HCCs) for a specified measurement year. This method allows for the retrieval of medical descriptions corresponding to the provided RXHCCs and HCCs for a particular year.

Example:

from sparknlp_jsl.utils.risk_adjustment_utils import RiskAdjustmentUtil

RiskAdjustmentUtil.HCC_labels("hcc", "24", 2021, ["HCC1", "HCC37", "HCC321"])

OUTPUT:{'HCC1': 'HIV/AIDS'}
  • diff_between_HCCs: Calculates the difference between two lists of Hierarchical Condition Categories (HCCs) or Risk Adjustment Hierarchical Condition Categories (RXHCCs) for a specific measurement year. This method identifies and categorizes the added, deleted, and upgraded HCCs between the “before_HCC_list” and “after_HCC_list” states. HCCs and RXHCCs evolve over time. New conditions emerge with age, causing additions and removals of CCs. Some CCs might escalate to higher levels of severity, representing an upgrade.

Example:

from sparknlp_jsl.utils.risk_adjustment_utils import RiskAdjustmentUtil

RiskAdjustmentUtil.diff_between_HCCs("rxhcc", "08", 2023, ["RXHCC77", "RXHCC262"], ["RXHCC1", "RXHCC78", "RXHCC261"])

OUTPUT :{'added_list': ['RXHCC1', 'RXHCC78', 'RXHCC261'], 'deleted_list': ['RXHCC77']}

Please see Calculate Medicare Risk Adjustment Score Notebook for more information

New Functionalities for the OCR Utility Module

Explore the latest enhancements in the OCR Utility Module, which now allow you to insert custom text onto colored bands using the text_band parameter. Customize your document annotations even further by configuring specific colors with RGB tuples through the outline_color parameter. Additionally, you can adjust the outline width for bounding boxes using the outline_width parameter. Elevate your OCR processing capabilities with these new functionalities.

  • Introduced the option to configure a desired specific color using RGB tuples with the outline_color parameter for the bounding box style.

Example:

path='content/*.pdf'
box = "bounding_box"

ocr_entity_processor(spark=spark,
                    file_path = path,
                    ner_pipeline = nlp_model,
                    chunk_col = "merged_chunk",
                    ...
                    label_color = "blue",
                    display_result = True,
                    outline_color = (155,0,0)) ##Takes Tuple
  • Users can now adjust the outline width of the bounding box style using the outline_width parameter.

Example:

path='content/*.pdf'
box = "highlight"
ocr_entity_processor(spark=spark,
                    file_path = path,
                    ner_pipeline = nlp_model,
                    style = box,
                    ...
                    label_color = "red",
                    outline_width = 6 )  ## Width of the outline

Please see Spark OCR Utility Module Notebook for more information

Various Core Improvements; Bug Fixes, Enhanced Overall Robustness and Reliability of SparkNLP for Healthcare

  • The issue with pretrained models that included the ChunkConverter() when loading them locally using from_disk() has been resolved.
  • The incorrect exception message in ocr_entity_processor() has been corrected
  • Resolved the problem with day-shifting in DeIdentification, specifically related to masking <AGE> when using the setAgeRange feature.

Updated Notebooks And Demonstrations For making Spark NLP For Healthcare Easier To Navigate And Understand

We Have Added And Updated A Substantial Number Of New Clinical Models And Pipelines, Further Solidifying Our Offering In The Healthcare Domain.

  • ner_clinical -> ar
  • ner_clinical -> bg
  • ner_clinical -> fi
  • multiclassifierdl_respiratory_disease
  • multiclassifierdl_mental_disorder
  • multiclassifierdl_heart_disease
  • ner_biomarker_langtest
  • ner_bionlp_langtest
  • ner_clinical_large_langtest
  • ner_living_species_langtest
  • ner_cellular_langtest
  • ner_vop_langtest
  • ner_chemprot_clinical_langtest
  • ner_bacterial_species_langtest
  • ner_deid_enriched_langtest
  • ner_deid_large_langtest
  • ner_diseases_langtest
  • ner_oncology_langtest
  • ner_deid_generic_augmented_allUpperCased_langtest
  • umls_mesh_mapper
  • umls_snomed_mapper
  • umls_icd10cm_mapper

For all Spark NLP for Healthcare models, please check: Models Hub Page

Versions

Last updated