5.1.2
Highlights
We are delighted to announce remarkable enhancements and updates in our latest release of Spark NLP for Healthcare. This release comes with a new parameter to AssertionDLModel for renaming assertion labels, a new parameter to DeIdentification for obfuscating ages based on the HIPAA Privacy Rule, enhanced cloud support for ContextualParser, as well as 22 new clinical pretrained models and pipelines.
- 13 new augmented NER models by leveraging the capabilities of the
LangTest
library to boost their robustness significantly - 3 new clinical NER models for extracting clinical entities in the Arabic, Finnish, and Bulgarian languages
- 3 new multi-label text classification for respiratory disease, heart disease, and mental disorder
- 3 New
ChunkMapper
models to map UMLS codes to their MeSH, SNOMED, and ICD-10-CM codes. - A new parameter to
AssertionDLModel
to rename the assertion labels - A new parameter to
DeIdentification
to obfuscate ages based on HIPAA (Health Insurance Portability and Accountability Act) Privacy Rule - Enhanced cloud support for
ContextualParser
withsetJsonPath
andsetDictionary
parameters - New features to the SparkNLP Healthcare CMS-HCC Risk-Adjustment module
- New functionalities for the
OCR Utility Module
- Various core improvements; bug fixes, enhanced overall robustness and reliability of SparkNLP for Healthcare
- The issue with pretrained models that included the
ChunkConverter()
when loading them locally usingfrom_disk()
has been resolved. - The incorrect exception message in
ocr_entity_processor()
has been corrected - Resolved the problem with day-shifting in
DeIdentification
, specifically related to masking <AGE> when using the setAgeRange feature.
- The issue with pretrained models that included the
- New and updated demos
- New LANGTEST NER for the most popular clinical NER models
- New Date Shifting and Date Normalization Demo demonstrates the most popular deidentification date operation
- New Respiratory Disease Demo with new
multiclassifierdl_respiratory_disease
model - New Mental Disorder demo with new
multiclassifierdl_mental_disorder
model - New Heart Disease demo with new
multiclassifierdl_heart_disease
model
- The addition and update of numerous new clinical models and pipelines continue to reinforce our offering in the healthcare domain
These enhancements will elevate your experience with Spark NLP for Healthcare, enabling more efficient, accurate, and streamlined healthcare-related natural language data analysis.
13 New Augmented NER Models by Leveraging the Capabilities of the LangTest Library to Boost Their Robustness Significantly
Model Name | Predicted Entities |
---|---|
ner_biomarker_langtest |
Oncogenes , Tumor_Finding , ResponseToTreatment , Biomarker , HormonalTherapy , Staging , Drug , CancerDx , Radiotherapy , CancerSurgery , TargetedTherapy , CancerModifier , Biomarker_Measurement , Metastasis , Chemotherapy , Test , Dosage , Test_Result , Immunotherapy … |
ner_bionlp_langtest |
Amino_acid , Anatomical_system , Cancer , Cell , Cellular_component , Developing_anatomical_Structure , Gene_or_gene_product , Immaterial_anatomical_entity , Multi-tissue_structure , Organ , Organism , Organism_subdivision , Simple_chemical , Tissue |
ner_clinical_large_langtest |
PROBLEM , TEST , TREATMENT |
ner_living_species_langtest |
HUMAN , SPECIES |
ner_vop_langtest |
Gender , Employment , Age , Substance , Form , PsychologicalCondition , Vaccine , Drug , DateTime , ClinicalDept , Test , AdmissionDischarge , Disease , Dosage , Duration , RelationshipStatus , Symptom , Procedure , HealthStatus , InjuryOrPoisoning , Modifier , Treatment , |
ner_chemprot_clinical_langtest |
CHEMICAL , GENE-Y , GENE-N |
ner_bacterial_species_langtest |
SPECIES |
ner_cellular_langtest |
DNA , Cell_type , Cell_line , RNA , Protein |
ner_deid_enriched_langtest |
AGE , CITY , COUNTRY , DATE , DOCTOR , HOSPITAL , IDNUM , MEDICALRECORD , ORGANIZATION , PATIENT , PHONE , PROFESSION , STATE , STREET , USERNAME , ZIP |
ner_deid_large_langtest |
AGE , CONTACT , DATE , ID , LOCATION , NAME , PROFESSION |
ner_diseases_langtest |
Disease |
ner_oncology_langtest |
Staging , Cancer_Score , Tumor_Finding , Site_Lymph_Node , Response_To_Treatment , Smoking_Status , Tumor_Size , Cycle_Count , Adenopathy Biomarker_Result , Chemotherapy , Cancer_Surgery , Line_Of_Therapy , Pathology_Result , Hormonal_Therapy , Biomarker , Immunotherapy , Metastasis , Cancer_Dx , Grade … |
ner_deid_generic_augmented_allUpperCased_langtest |
DATE , NAME , LOCATION , PROFESSION , CONTACT , AGE , ID |
- The table below shows the robustness of overall test results for 13 different models.
model names | original robustness | new robustness |
---|---|---|
ner_biomarker_langtest | 45.49% | 78.84% |
ner_bionlp_langtest | 49.56% | 76.70% |
ner_clinical_large_langtest | 50.66% | 77.64% |
ner_living_species_langtest | 68.42% | 90.86% |
ner_vop_langtest | 50.74% | 78.21% |
ner_chemprot_clinical_langtest | 49.53% | 81.74% |
ner_bacterial_species_langtest | 76.04% | 90.65% |
ner_cellular_langtest | 33.74% | 77.52% |
ner_deid_enriched_langtest | 94.78% | 97.48% |
ner_deid_large_langtest | 86.98% | 95.22% |
ner_diseases_langtest | 60.01% | 86.03% |
ner_oncology_langtest | 52.13% | 79.73% |
ner_deid_generic_augmented_allUpperCased_langtest | 94.73% | 97.60% |
3 New Clinical NER Models for Extracting Clinical Entities in the Arabic, Finnish, and Bulgarian Languages
3 new Clinical NER models provide valuable tools for processing and analyzing multi-language clinical texts. They assist in automating the extraction of important clinical information, facilitating research, medical documentation, and other applications within the multi-language healthcare domain.
Model Name | Predicted Entities | Language |
---|---|---|
ner_clinical | PROBLEM TEST TREATMENT |
ar |
ner_clinical | PROBLEM TEST TREATMENT |
bg |
ner_clinical | PROBLEM TEST TREATMENT |
fi |
Example:
embeddings = WordEmbeddingsModel.pretrained("arabic_w2v_cc_300d", "ar")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner_model = MedicalNerModel.pretrained("ner_clinical", "ar", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
sample_text = """تاريخ الزيارة: 25 سبتمبر 2023 - المريضة: فاطمة علي - العمر: 48 سنة - الجنس: أنثى - المشاكل: 1. مشكل ارتفاع في مستويات في الدم 2. ارتفاع مستوى الكوليستيرول في الدم 3. اضطراب في وظائف الغدة الدرقية - الفحوصات: 1. قياس مستوى السكر في الدم 2. تحليل مستوى الكوليستيرول في الدم 3. اختبار وظائف الغدة الدرقية - العلاجات: 1. وصف دواء لمراقبة وتنظيم مستوى السكر في الدم (ميتفورمين 500 ملغ يوميا) 2. وصف دواء لتخفيض مستوى الكوليستيرول (ستاتين 20 ملغ يوميا) 3. وصف العلاج اللازم لتحسين وظائف الغدة الدرقية (ليفوتيروكسين 50 ميكروغرام يوميا)، بالإضافة إلى توجيهات بشأن نمط حياة صحي تتضمن اتباع نظام غذائي مناسب وممارسة الرياضة بانتظام."""
Result:
chunk | ner_label |
---|---|
مشكل ارتفاع في مستويات في الدم | PROBLEM |
ارتفاع مستوى الكوليستيرول في الدم | PROBLEM |
اضطراب في وظائف الغدة الدرقية | PROBLEM |
قياس مستوى السكر في الدم | TEST |
تحليل مستوى الكوليستيرول في الدم | TEST |
اختبار وظائف الغدة الدرقية | TEST |
ميتفورمين | TREATMENT |
ليفوتيروكسين | TREATMENT |
نظام غذائي مناسب | TREATMENT |
3 New Multilabel Text Classification For Respiratory Disease, Heart Disease, and Mental Disorder
The PHS-BERT Respiratory Disease Classifier Model is a specialized text classification system, engineered to accurately identify and categorize textual mentions of four prominent respiratory diseases: Asthma, Chronic Obstructive Pulmonary Disease (COPD), Emphysema, and Chronic bronchitis
Model | Description | Predicted Labels |
---|---|---|
multiclassifierdl_respiratory_disease |
This model identifies and categorizes textual mentions of four prominent respiratory diseases: Asthma, Chronic Obstructive Pulmonary Disease (COPD), Emphysema, and Chronic bronchitis. | Astham , COPD , Emphysema , Chronic bronchitis , Other/Unknown , No |
multiclassifierdl_heart_disease |
This model identifies and categorize textual mentions of three prominent cardiovascular diseases: Hypertension, Coronary Artery Disease, and Myocardial Infarction. | Hypertension , MI , CAD , Other/Unknown , No |
multiclassifierdl_mental_disorder |
This model classifies text based on the following mental disorders. (Schizophrenia, Depression, Bipolar disorder, Anxiety disorder). | Anxiety Disorder , No , Schizophrenia , Depression , Other/Unknown |
Example:
multiclassifierdl = MultiClassifierDLModel.pretrained("multiclassifierdl_respiratory_disease", "en", "clinical/models")\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("predicted_class")
text = [
["""The patient takes inhalers for COPD management, weight loss medications, and disease-modifying antirheumatic drugs (DMARDs) for rheumatoid arthritis."""],
["""The patient was on Metformin for DM2, mood stabilizers for Bipolar II Disorder, and inhaled corticosteroids for Asthma."""],
["""The patient was diagnosed with Chronic Bronchitis after a series of pulmonary function tests."""],
["""Chest CT imaging revealed significant bullae and airspace enlargement, consistent with a diagnosis of emphysema."""],
]
Result:
+----------------------------------------------------------------------------------------------------+--------------------+
| text| result|
+----------------------------------------------------------------------------------------------------+--------------------+
|The patient takes inhalers for COPD management, weight loss medications, and disease-modifying an...| [COPD]|
|The patient was on Metformin for DM2, mood stabilizers for Bipolar II Disorder, and inhaled corti...| [Asthma]|
| The patient was diagnosed with Chronic Bronchitis after a series of pulmonary function tests.|[Chronic bronchitis]|
|Chest CT imaging revealed significant bullae and airspace enlargement, consistent with a diagnosi...| [Emphysema]|
+----------------------------------------------------------------------------------------------------+--------------------+
Please see the Respiratory Disease, Heart Disease, and Mental Disorder demos
3 New ChunkMapper
Models to Map UMLS Codes to Their MeSH, SNOMED, and ICD-10-CM Codes.
We have introduced 3 new ChunkMapper
models to map UMLS codes to their MeSH, SNOMED, and ICD-10-CM codes.
Model Name | Relations | Description |
---|---|---|
umls_mesh_mapper | mesh_code | Maps UMLS codes to MeSH codes |
umls_snomed_mapper | snomed_code | Maps UMLS codes to SNOMED codes |
umls_icd10cm_mapper | icd10cm_code | Maps UMLS codes to ICD-10-CM codes |
Example:
chunkerMapper = DocMapperModel.pretrained("umls_icd10cm_mapper", "en", "clinical/models")\
.setInputCols(["document"])\
.setOutputCol("mappings")\
.setRels(["icd10cm_code"])
text= """["C0000744"], ["C2875181"]"""
Result:
+---------+------------+------------+
|umls_code|icd10cm_code|relation |
+---------+------------+------------+
|C0000744 |E786 |icd10cm_code|
|C2875181 |G4381 |icd10cm_code|
+---------+------------+------------+
New Parameter to AssertionDLModel
to Rename the Assertion Labels
We have introduced a new parameter called ReplaceLabels(dict[str, str])
to AssertionDLModel
.
This parameter enables users to customize the assertion labels.
Example:
clinical_assertion = AssertionDLModel.pretrained("assertion_dl_large", "en", "clinical/models") \
.setInputCols(["sentence", "ner_chunk", "embeddings"]) \
.setOutputCol("assertion") \
.setReplaceLabels({"PRESENT": "available", "absent": "none", "Conditional": "Optional"})
text= """Patient with severe fever and sore throat. He shows no stomach pain and he maintained on an epidural and PCA
for pain control. He also became short of breath with climbing a flight of stairs."""
Result:
chunks | entities | assertion |
---|---|---|
severe fever | PROBLEM | available |
sore throat | PROBLEM | available |
stomach pain | PROBLEM | none |
an epidural | TREATMENT | available |
PCA | TREATMENT | available |
pain control | PROBLEM | hypothetical |
short of breath | PROBLEM | Optional |
Please see Clinical Assertion Notebook for more information
A New Parameter to DeIdentification
to Obfuscate Ages Based on HIPAA (Health Insurance Portability and Accountability Act) Privacy Rule
We have introduced a new parameter called AgeRangesByHipaa()
which determines whether to obfuscate ages in compliance with the HIPAA (Health Insurance Portability and Accountability Act) Privacy Rule.
The HIPAA Privacy Rule mandates that ages of patients older than 90 years must be obfuscated,
while the age for patients 90 years or younger can remain unchanged.
If the parameter is set as True
, age entities larger than 90 will be obfuscated as per HIPAA Privacy Rule, and the others will remain unchanged.
If the parameter is set as False
, AgeRanges
parameter is considered for obfuscation. The default value of the AgeRangesByHipaa
is False
.
Example:
obfuscation = DeIdentification()\
.setInputCols(["sentence", "token", "age_chunk"]) \
.setOutputCol("obfuscation") \
.setMode("obfuscate")\
.setObfuscateDate(True)\
.setObfuscateRefSource("faker") \
.setAgeRangesByHipaa(True)
dates = [
'1 year old baby',
'4 year old kids',
'Record date: 2093-01-13, Age: 25',
'A 92 year old female with',
'Patient is 108 years-old',
'He is 120 years-old male',
]
Result:
text | age_chunk | obfuscation |
---|---|---|
1 year old baby | 1 | 1 year old baby |
4 year old kids | 4 | 4 year old kids |
Record date: 2093-01-13, Age: 25 | 25 | Record date: 2093-03-10, Age: 25 |
A 92 year old female with | 92 | A 99 year old female with |
Patient is 108 years-old | 108 | Patient is 119 years-old |
He is 120 years-old male | 120 | He is 140 years-old male |
Please see Clinical DeIdentification Notebook for more information
Enhanced Cloud Support for ContextualParser with setJsonPath
and setDictionary
Explore the new capabilities of the ContextualParserApproach, featuring extended cloud support for path configuration using setJsonPath and setDictionary. This example demonstrates how to leverage these enhancements to improve entity recognition in your NLP projects, providing flexibility and scalability with cloud-based resources.
Example:
contextual_parser = ContextualParserApproach() \
.setInputCols(["sentence", "token"])\
.setOutputCol("entity")\
.setCaseSensitive(True)\
.setJsonPath("s3a://YOUR_S3_BUCKET/data/cities.json")\
.setDictionary("s3a://YOUR_S3_BUCKET/data/cities.tsv",
read_as = "SPARK",
options={"orientation": "vertical",
"format": "text"})
New Features to the SparkNLP Healthcare CMS-HCC Risk-Adjustment Module
We are introducing 3 new functions SparkNLP Healthcare CMS-HCC Risk-Adjustment Module.
HCC_from_ICD
: A mapping of ICD-10 codes to their corresponding RXHCC or HCC codes. Retrieves the mapping of Risk Adjustment Hierarchical Condition Categories (RXHCC) or Hierarchical Condition Categories (HCC) based on a list of International Classification of Diseases, 10th Revision (ICD-10) codes for a specific measurement year. This method allows for the retrieval of RXHCC or HCC associated with the provided ICD-10 codes for a particular year.
Example:
from sparknlp_jsl.utils.risk_adjustment_utils import RiskAdjustmentUtil
RiskAdjustmentUtil.HCC_from_ICD("hcc", "ESRDv21", 2019, ["A021", "I209", "E103559"])
OUTPUT:{'A021': ['HCC2'], 'I209': ['HCC88'], 'E103559': ['HCC18', 'HCC122']}
HCC_labels
: A mapping of RXHCCs and HCCs to their respective medical descriptions. Retrieves the medical descriptions associated with a given list of Risk Adjustment Hierarchical Condition Categories (RXHCCs) and Hierarchical Condition Categories (HCCs) for a specified measurement year. This method allows for the retrieval of medical descriptions corresponding to the provided RXHCCs and HCCs for a particular year.
Example:
from sparknlp_jsl.utils.risk_adjustment_utils import RiskAdjustmentUtil
RiskAdjustmentUtil.HCC_labels("hcc", "24", 2021, ["HCC1", "HCC37", "HCC321"])
OUTPUT:{'HCC1': 'HIV/AIDS'}
diff_between_HCCs
: Calculates the difference between two lists of Hierarchical Condition Categories (HCCs) or Risk Adjustment Hierarchical Condition Categories (RXHCCs) for a specific measurement year. This method identifies and categorizes the added, deleted, and upgraded HCCs between the “before_HCC_list” and “after_HCC_list” states. HCCs and RXHCCs evolve over time. New conditions emerge with age, causing additions and removals of CCs. Some CCs might escalate to higher levels of severity, representing an upgrade.
Example:
from sparknlp_jsl.utils.risk_adjustment_utils import RiskAdjustmentUtil
RiskAdjustmentUtil.diff_between_HCCs("rxhcc", "08", 2023, ["RXHCC77", "RXHCC262"], ["RXHCC1", "RXHCC78", "RXHCC261"])
OUTPUT :{'added_list': ['RXHCC1', 'RXHCC78', 'RXHCC261'], 'deleted_list': ['RXHCC77']}
Please see Calculate Medicare Risk Adjustment Score Notebook for more information
New Functionalities for the OCR Utility Module
Explore the latest enhancements in the OCR Utility Module, which now allow you to insert custom text onto colored bands using the text_band
parameter. Customize your document annotations even further by configuring specific colors with RGB tuples through the outline_color
parameter. Additionally, you can adjust the outline width for bounding boxes using the outline_width
parameter. Elevate your OCR processing capabilities with these new functionalities.
- Introduced the option to configure a desired specific color using RGB tuples with the
outline_color
parameter for the bounding box style.
Example:
path='content/*.pdf'
box = "bounding_box"
ocr_entity_processor(spark=spark,
file_path = path,
ner_pipeline = nlp_model,
chunk_col = "merged_chunk",
...
label_color = "blue",
display_result = True,
outline_color = (155,0,0)) ##Takes Tuple
- Users can now adjust the outline width of the bounding box style using the
outline_width
parameter.
Example:
path='content/*.pdf'
box = "highlight"
ocr_entity_processor(spark=spark,
file_path = path,
ner_pipeline = nlp_model,
style = box,
...
label_color = "red",
outline_width = 6 ) ## Width of the outline
Please see Spark OCR Utility Module Notebook for more information
Various Core Improvements; Bug Fixes, Enhanced Overall Robustness and Reliability of SparkNLP for Healthcare
- The issue with pretrained models that included the
ChunkConverter()
when loading them locally usingfrom_disk()
has been resolved. - The incorrect exception message in
ocr_entity_processor()
has been corrected - Resolved the problem with day-shifting in
DeIdentification
, specifically related to masking <AGE> when using the setAgeRange feature.
Updated Notebooks And Demonstrations For making Spark NLP For Healthcare Easier To Navigate And Understand
- New LANGTEST NER for the most popular clinical NER models
- New Date Shifting and Date Normalization Demo demonstrates the most popular deidentification date operation
- New Respiratory Disease Demo with new
multiclassifierdl_respiratory_disease
model - New Mental Disorder demo with new
multiclassifierdl_mental_disorder
model - New Heart Disease demo with new
multiclassifierdl_heart_disease
model - Updated Spark OCR Utility Module Notebook with latest improvements
- Clinical DeIdentification Notebook with new
setAgeRangesByHipaa
examples - Calculate Medicare Risk Adjustment Score Notebook with new Risk-Adjustment feature examples
We Have Added And Updated A Substantial Number Of New Clinical Models And Pipelines, Further Solidifying Our Offering In The Healthcare Domain.
ner_clinical
->ar
ner_clinical
->bg
ner_clinical
->fi
multiclassifierdl_respiratory_disease
multiclassifierdl_mental_disorder
multiclassifierdl_heart_disease
ner_biomarker_langtest
ner_bionlp_langtest
ner_clinical_large_langtest
ner_living_species_langtest
ner_cellular_langtest
ner_vop_langtest
ner_chemprot_clinical_langtest
ner_bacterial_species_langtest
ner_deid_enriched_langtest
ner_deid_large_langtest
ner_diseases_langtest
ner_oncology_langtest
ner_deid_generic_augmented_allUpperCased_langtest
umls_mesh_mapper
umls_snomed_mapper
umls_icd10cm_mapper
For all Spark NLP for Healthcare models, please check: Models Hub Page
Versions
- 5.5.2
- 5.5.1
- 5.5.0
- 5.4.1
- 5.4.0
- 5.3.3
- 5.3.2
- 5.3.1
- 5.3.0
- 5.2.1
- 5.2.0
- 5.1.4
- 5.1.3
- 5.1.2
- 5.1.1
- 5.1.0
- 5.0.2
- 5.0.1
- 5.0.0
- 4.4.4
- 4.4.3
- 4.4.2
- 4.4.1
- 4.4.0
- 4.3.2
- 4.3.1
- 4.3.0
- 4.2.8
- 4.2.4
- 4.2.3
- 4.2.2
- 4.2.1
- 4.2.0
- 4.1.0
- 4.0.2
- 4.0.0
- 3.5.3
- 3.5.2
- 3.5.1
- 3.5.0
- 3.4.2
- 3.4.1
- 3.4.0
- 3.3.4
- 3.3.2
- 3.3.1
- 3.3.0
- 3.2.3
- 3.2.2
- 3.2.1
- 3.2.0
- 3.1.3
- 3.1.2
- 3.1.1
- 3.1.0
- 3.0.3
- 3.0.2
- 3.0.1
- 3.0.0
- 2.7.6
- 2.7.5
- 2.7.4
- 2.7.3
- 2.7.2
- 2.7.1
- 2.7.0
- 2.6.2
- 2.6.0
- 2.5.5
- 2.5.3
- 2.5.2
- 2.5.0
- 2.4.6
- 2.4.5
- 2.4.2
- 2.4.1
- 2.4.0