Healthcare NLP v5.0.1 Release Notes

5.0.1

Highlights

We are delighted to announce a suite of remarkable enhancements and updates in our latest release of Spark NLP for Healthcare. This release comes with the first NER models that are augmented by LangTest library for robustness and bias as well as a support for RxHCC risk score calculation in latest versions.

Integrated the Risk Adjustment for Prescription Drug Hierarchical Condition Categories (RxHCC) model into our risk adjustment score calculation engine
Advanced entity detection for Section Headers and Diagnoses entities in clinical notes
Augmented NER models by leveraging the capabilities of the LangTest library
Enhanced Sentence Entity Resolver Models for associating clinical entities with LOINC
Strengthen the performance of assertion status detection by reinforcing it with entity type constraints
Entity blacklisting in AssertionFilterer to manage assertion status effectively
Enhanced ChunkMergeApproach and ChunkFilterer with case sensitivity settings
New feature for ChunkMergeApproach to enable filtering chunks according to confidence thresholds
Included sentence ID information in Relation Extraction Model metadata
Various core improvements; bug fixes, enhanced overall robustness and reliability of Spark NLP for Healthcare
- Improved deidentification regex pattern for Romanian language
- Fixed exploded sentences issue in RelationExtractionDLModel
Updated notebooks and demonstrations for making Spark NLP for Healthcare easier to navigate and understand
The addition and update of numerous new clinical models and pipelines continue to reinforce our offering in the healthcare domain

We believe that these enhancements will elevate your experience with Spark NLP for Healthcare, enabling more efficient, accurate, and streamlined analysis of healthcare-related natural language data.

Integrated The Risk Adjustment For Prescription Drug Hierarchical Condition Categories (RxHCC) Model Into Our Risk Adjustment Score Calculation Engine

We have integrated the RxHCC into our existing risk adjustment score calculation module. This means more accurate and comprehensive risk adjustment scores, especially for patients whose healthcare costs are significantly influenced by prescription drug usage. This enhancement brings a holistic view of a patient’s healthcare needs, further improving the precision of risk assessment.

We are pleased to introduce support for RxHCC risk score calculation in two new versions: v05 (applicable for 2020, 2021, 2022, and 2023) and v08 (applicable for 2022 and 2023). To utilize these versions with specific years, simply use the following formats: profileRxHCCV05YXX for v05 and profileRxHCCV08YXX for v08.

Example:

Input Data Frame:

filename	Age	icd10_code	Extracted_Entities_vs_ICD_Codes	Gender	eligibility	orec	esrd
patient_01.txt	66	C49.9, J18.9, C49.9, D61.81, I26, M06.9	{leiomyosarcoma, C49.9}, {pneumonia, J18.9}, …	F	CE_NoLowAged	1	false
patient_02.txt	59	C50.92, P61.4, C80.1	{breast cancer, C50.92}, {dysplasia, P61.4}, …	F	CE_NoLowNoAged	0	true

# v08 year 2023
from sparknlp_jsl.functions import profileRxHCCV08Y23

df = df.withColumn("rxhcc_profile", profileRxHCCV08Y23(df.icd10_code, df.Age, df.Gender, df.eligibility, df.orec, df.esrd))

df = df.withColumn("rxhcc_profile", F.from_json(F.col("rxhcc_profile"), schema))
df = df.withColumn("risk_score", df.rxhcc_profile.getItem("risk_score"))\
       .withColumn("parameters", df.rxhcc_profile.getItem("parameters"))\
       .withColumn("details", df.rxhcc_profile.getItem("details"))\

Results (V08-Y23):

filename	Age	icd10_code	Extracted_Entities_vs_ICD_Codes	Gender	eligibility	orec	esrd	rxhcc_profile	risk_score	parameters	details
patient_01.txt	66	C49.9, J18.9, C49.9, D61.81, I26, M06.9	{leiomyosarcoma, C49.9}, {pneumonia, J18.9}, …	F	CE_NoLowAged	1	false	{0.575, null, {“elig”:”CE_NoLowAged”,”age”:66, …	0.575	{“elig”:”CE_NoLowAged”,”age”: …	{“Rx_CE_NoLowAged_F65_69”…
patient_02.txt	59	C50.92, P61.4, C80.1	{breast cancer, C50.92}, {dysplasia, P61.4}, …	F	CE_NoLowNoAged	0	true	{0.367, null, {“elig”:”CE_NoLowNoAged”,”age”:59…	0.367	{“elig”:”CE_NoLowNoAged”,”age”…	{ Rx_CE_NoLowNoAged_F55_5…

Advanced Entity Detection For Section Headers And Diagnoses Entities In Clinical Notes

We have a new state-of-the-art NER model that is specifically designed to extract vital data from clinical documents, focusing on two key aspects: Section Headers and Diagnoses. By accurately identifying and labeling various medical conditions like heart disease, diabetes, and Alzheimer’s disease, this model provides unparalleled insights into diagnosis and treatment trends.

Example:

clinical_ner = MedicalNerModel.pretrained("ner_section_header_diagnosis", "en","clinical/models")\
    .setInputCols(["sentence","token","embeddings"])\
    .setOutputCol("ner")\
    .setLabelCasing("upper")

text = """
Medical History:
Patient has a history of Chronic respiratory disease.
Clinical History:
Patient presented with shortness of breath and chest pain.
Chief Complaint:
Patient complained of chest pain and difficulty breathing.
History of Present Illness:
Patient has been experiencing chest pain and shortness of breath for the past week. Symptoms were relieved by medication at first but became worse over time.
Past Medical History:
Patient has a history of Asthma and was previously diagnosed with Bronchitis.
Medications:
Patient is currently taking Albuterol, Singulair, and Advair for respiratory issues.
Allergies:
Patient has a documented allergy to Penicillin.
"""

Result:

chunks	entities	confidence
Medical History	MEDICAL_HISTORY_HEADER	0.81
Chronic respiratory disease	RESPIRATORY_DISEASE	0.74
Clinical History	CLINICAL_HISTORY_HEADER	0.77
Chief Complaint	CHIEF_COMPLAINT_HEADER	0.85
History of Present Illness	HISTORY_PRES_ILNESS_HEADER	0.99
Past Medical History	MEDICAL_HISTORY_HEADER	0.71
Asthma	RESPIRATORY_DISEASE	0.99
Bronchitis	RESPIRATORY_DISEASE	0.84
Medications	MEDICATIONS_HEADER	0.99
Allergies	ALLERGIES_HEADER	0.99

Please check: ner_section_header_diagnosis model card for more information.

Augmented NER Models Leveraging LangTest Library Capabilities

Newly introduced augmented NER models, namely ner_posology_langtest, ner_jsl_langtest, ner_ade_clinical_langtest, and ner_sdoh_langtest, are powered by the innovative LangTest library. This cutting-edge NLP toolkit is at the forefront of language processing advancements, incorporating state-of-the-art techniques and algorithms to enhance the capabilities of our models significantly.

Example:

clinical_ner = MedicalNerModel.pretrained("ner_sdoh_langtest", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")


text = """Smith is 55 years old, living in New York, a divorced Mexcian American woman with financial problems. She speaks Spanish and Portuguese. She lives in an apartment. She has been struggling with diabetes for the past 10 years and has recently been experiencing frequent hospitalizations due to uncontrolled blood sugar levels. Smith works as a cleaning assistant and cannot access health insurance or paid sick leave."""

Result:

chunk	begin	end	ner_label
55 years old	9	20	Age
New York	33	40	Geographic_Entity
divorced	45	52	Marital_Status
Mexcian American	54	69	Race_Ethnicity
woman	71	75	Gender
financial problems	82	99	Financial_Status
She	102	104	Gender
Spanish	113	119	Language
Portuguese	125	134	Language
She	137	139	Gender
apartment	153	161	Housing
She	164	166	Gender
diabetes	193	200	Other_Disease
hospitalizations	268	283	Other_SDoH_Keywords
cleaning assistant	342	359	Employment
access health ins…	372	394	Insurance_Status

Enhanced Sentence Entity Resolver Models For Associating Clinical Entities With LOINC

Introducing the new sbiobertresolve_loinc_numeric model and improving the sbiobertresolve_loinc_augmented model, both offering enhanced accuracy for mapping medical laboratory observations and clinical measurements to their corresponding Logical Observation Identifiers Names and Codes (LOINC). The sbiobertresolve_loinc_numeric model is specialized in numeric LOINC codes, as it was trained without the inclusion of LOINC “Document Ontology” codes starting with the letter “L”. On the other hand, the sbiobertresolve_loinc_augmented model offers broader functionality, capable of returning both numeric and document ontology codes for enhanced versatility.

Example:

resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_loinc_numeric","en", "clinical/models") \
  .setInputCols(["sbert_embeddings"]) \
  .setOutputCol("loinc_code")\
  .setDistanceFunction("EUCLIDEAN")

sample_text = "The patient is a 22-year-old female with a history of obesity. She has a Body mass index (BMI) of 33.5 kg/m2, aspartate aminotransferase 64, and alanine aminotransferase 126."

Results:

chunk	entity	loinc_code	all_codes	resolutions
BMI	Test	39156-5	39156-5, 89270-3, 100847-3…	[BMI [Body mass index], BMI Est [Body mass index], BldA [Gas & ammonia panel], …
aspartate aminotransferase	Test	14409-7	14409-7, 1916-6, 16324-6, …	[Aspartate aminotransferase [Aspartate aminotransferase], Aspartate aminotransf…
alanine aminotransferase	Test	16324-6	16324-6, 16325-3, 1916-6, …	[Alanine aminotransferase [Alanine aminotransferase], Alanine aminotransferase/…

Strengthen The Performance Of Assertion Status Detection By Reinforcing With Entity Type Constraints

Introducing the latest enhancements to our AssertionDLModel - the setEntityAssertion and setEntityAssertionCaseSensitive parameters. Now, you can effortlessly constrain assertions based on specific entity types using a convenient dictionary format: {"entity": [assertion_label1, assertion_label2, .. assertion_labelN]}. When an entity is not found in the dictionary, no constraints are applied, ensuring flexibility in your data processing. With the setEntityAssertionCaseSensitive parameter, you can control the case sensitivity for both entities and assertion labels. Unleash the full potential of your NLP model with these cutting-edge additions to the AssertionDLModel.

Example:

clinical_assertion = AssertionDLModel.pretrained("assertion_jsl_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")\
    .setEntityAssertionCaseSensitive(False)\
    .setEntityAssertion({
        "PROBLEM": ["hypothetical", "absent"],
        "treAtment": ["present"],
        "TEST": ["POssible"],
    })

text = '''
A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, and associated with an acute hepatitis, presented with a one-week history of polyuria, poor appetite, and vomiting.
She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months at the time of presentation.
Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness, guarding, or rigidity. Pertinent laboratory findings on admission were: serum glucose 111 mg/dl,  creatinine 0.4 mg/dL, triglycerides 508 mg/dL, total cholesterol 122 mg/dL, and venous pH 7.27.
'''

Result:

idx	chunks	entities	assertion	confidence
0	metformin	TREATMENT	Present	0.54
1	glipizide	TREATMENT	Present	0.99
2	dapagliflozin	TREATMENT	Present	1.0
3	HTG	PROBLEM	Hypothetical	1.0
4	Physical examination	TEST	Possible	0.94
5	tenderness	PROBLEM	Absent	1.0
6	guarding	PROBLEM	Absent	1.0
7	rigidity	PROBLEM	Hypothetical	0.99

Entity Blacklisting In `AssertionFilterer` For Effective Assertion Status Management

With the setBlackList option in the AssertionFilterer annotator, you can now blacklist specific entities based on their assertion labels.

Example:

clinical_assertion = AssertionDLModel.pretrained("assertion_jsl_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")


assertion_filterer = AssertionFilterer()\
    .setInputCols("sentence","ner_chunk","assertion")\
    .setOutputCol("assertion_filtered")\
    .setBlackList(["Hypothetical"])\

text = """Patient has a headache for the last 2 weeks, needs to get a head CT, and appears anxious when she walks fast. No alopecia and pain noted"""

Without Filtering Results:

	chunks	entities	assertion	confidence
0	a headache	PROBLEM	Present	1
1	a head CT	TEST	Hypothetical	1
2	anxious	PROBLEM	SomeoneElse	0.77
3	alopecia	PROBLEM	Hypothetical	0.97
4	pain	PROBLEM	Hypothetical	0.99

Filtered Results:

	chunks	entities	assertion	confidence
0	a headache	PROBLEM	Present	0.97
1	anxious	PROBLEM	SomeoneElse	0.99

Enhanced `ChunkMergeApproach` And `ChunkFilterer` With Case Sensitivity Settings

The setCaseSensitive parameter now applies to the whitelist and blacklist functionalities. As part of the enhancement, this parameter has been included in the filtering feature, which serves as a superclass for, ChunkFilterer and ChunkMergeApproach. With this update, the caseSensitive setting can be conveniently utilized across these classes, offering improved control and consistency in the filtering process.

Example:

posology_ner = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverterInternal()\
    .setInputCols(["sentence","token","ner"])\
    .setOutputCol("ner_chunk")

chunk_filterer = ChunkFilterer()\
    .setInputCols("sentence","ner_chunk")\
    .setOutputCol("chunk_filtered")\
    .setCriteria("isin")\
    .setWhiteList(['ADVIL','Metformin', 'Insulin Lispro'])\
    .setCaseSensitive(False)

text ="""The patient was prescribed 1 capsule of Advil for 5 days . She was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , metformin 1000 mg two times a day."""

Result:

# detected ner chunks
['1', 'capsule', 'Advil', 'for 5 days', '40 units', 'insulin glargine', 'at night', '12 units', 'insulin lispro', 'with meals', 'metformin', '1000 mg', 'two times a day']

# filtered ner chunks
['Advil', 'insulin lispro', 'metformin']

New Feature For `ChunkMergeApproach` To Enable Filtering Chunks According To Confidence Thresholds

We have added a new setEntitiesConfidence parameter to ChunkMergeApproach annotator that enables filtering the chunks according to the confidence thresholds. The only thing you need to do is provide a csv file that has the NER labels as keys and the confidence thresholds as values.

Example:

conf_dict = """DRUG,0.99
FREQUENCY,0.99
DOSAGE,0.99
DURATION,0.99
STRENGTH,0.99
"""
with open('conf_dict.csv', 'w') as f:
    f.write(conf_dict)

chunk_merger = ChunkMergeApproach()\
    .setInputCols("posology_ner_chunk")\
    .setOutputCol('merged_ner_chunk')\
    .setEntitiesConfidenceResource("conf_dict.csv")


sample_text = 'The patient was prescribed 1 capsule of Advil for 5 days. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night.'  

Detected chunks:

chunks	begin	end	entities	confidence
1	27	27	DOSAGE	0.99
capsule	29	35	FORM	0.99
Advil	40	44	DRUG	0.99
for 5 days	46	55	DURATION	0.71
40 units	125	132	DOSAGE	0.85
insulin glargine	137	152	DRUG	0.83
at night	154	161	FREQUENCY	0.81

Filtered by confidence scores:

chunks	begin	end	entities	confidence
1	27	27	DOSAGE	0.99
capsule	29	35	FORM	0.99
Advil	40	44	DRUG	0.99

Included Sentence Id Information In `RelationExtractionModel` Metadata

Our Relation Extraction Models have been upgraded with the inclusion of sentence information in the metadata. This enhancement offers a deeper understanding of the extracted relationships and facilitates more precise analysis and interpretation of the results.

Example:

re_dl_model = RelationExtractionDLModel.pretrained('redl_bodypart_direction_biobert', "en", "clinical/models")\
    .setInputCols(["re_ner_chunks", "sentences"]) \
    .setOutputCol("relations_dl")\
    .setPredictionThreshold(0.5)

text = '''MRI demonstrated infarction in the upper brain stem , and  right basil ganglia.
No neurologic deficits other than some numbness in his left hand.
there is a problem at right chest.'''

Result:

idx	sentence	chunk1	entity1	chunk2	entity2	relation	confidence
0	0	upper	Direction	brain stem	Internal_organ_or_component	1	1.0
1	0	upper	Direction	basil ganglia	Internal_organ_or_component	0	0.99
2	0	right	Direction	basil ganglia	Internal_organ_or_component	1	1.0
3	1	left	Direction	hand	External_body_part_or_region	1	1.0
4	2	right	Direction	chest	External_body_part_or_region	1	1.0

Various Core Improvements: Bug Fixes, Enhanced Overall Robustness, And Reliability Of Spark NLP For Healthcare

Improved deidentification regex pattern for Romanian language
Fixed exploded sentences issue in Relation Extraction DL (when .setExplodeSentences(True) is used in SentenceDetector, RelationExtractionDLModel’s relation output has only the sentence#0 relations, other sentences’ relations are not displayed.)

Updated Notebooks And Demonstrations For making Spark NLP For Healthcare Easier To Navigate And Understand

Updated Clinical_Named_Entity_Recognition_Model notebook according to latest improvement in ChunkFilterer
Updated Clinical_Assertion_Model notebook according to latest improvement in AssertionFilterer
Updated Clinical_NER_Chunk_Merger notebook according to latest improvement in ChunkMergerApproach
Updated Clinical_Relation_Extraction notebook according to latest improvement in RelationExtractionModel’s metadata
Updated Calculate_Medicare_Risk_Adjustment_Score notebook according to latest improvement in HCC implementation

We Have Added And Updated A Substantial Number Of New Clinical Models And Pipelines, Further Solidifying Our Offering In The Healthcare Domain.

ner_section_header_diagnosis
ner_posology_langtest
ner_jsl_langtest
ner_sdoh_langtest
ner_ade_clinical_langtest
sbiobertresolve_loinc_numeric
sbiobertresolve_loinc_augmented

For all Spark NLP for Healthcare models, please check: Models Hub Page

Versions

Version
Version
Version

PREVIOUSVersion Compatibility

5.0.1

Highlights

Integrated The Risk Adjustment For Prescription Drug Hierarchical Condition Categories (RxHCC) Model Into Our Risk Adjustment Score Calculation Engine

Advanced Entity Detection For Section Headers And Diagnoses Entities In Clinical Notes

Augmented NER Models Leveraging LangTest Library Capabilities

Enhanced Sentence Entity Resolver Models For Associating Clinical Entities With LOINC

Strengthen The Performance Of Assertion Status Detection By Reinforcing With Entity Type Constraints

Entity Blacklisting In AssertionFilterer For Effective Assertion Status Management

Enhanced ChunkMergeApproach And ChunkFilterer With Case Sensitivity Settings

New Feature For ChunkMergeApproach To Enable Filtering Chunks According To Confidence Thresholds

Included Sentence Id Information In RelationExtractionModel Metadata

Various Core Improvements: Bug Fixes, Enhanced Overall Robustness, And Reliability Of Spark NLP For Healthcare

Updated Notebooks And Demonstrations For making Spark NLP For Healthcare Easier To Navigate And Understand

We Have Added And Updated A Substantial Number Of New Clinical Models And Pipelines, Further Solidifying Our Offering In The Healthcare Domain.

Versions

Entity Blacklisting In `AssertionFilterer` For Effective Assertion Status Management

Enhanced `ChunkMergeApproach` And `ChunkFilterer` With Case Sensitivity Settings

New Feature For `ChunkMergeApproach` To Enable Filtering Chunks According To Confidence Thresholds

Included Sentence Id Information In `RelationExtractionModel` Metadata