Healthcare NLP v4.4.1 Release Notes

4.4.1

Highlights

We are pleased to announce the latest enhancements and features for Spark NLP for Healthcare. This release showcases significant improvements and updates, including:

Introducing a new biogpt-chat-jsl model (LLM), fine-tuned for clinical conversations in healthcare settings.
A specialized medical summarizer model (LLM) designed specifically for radiology report analysis.
Eight new Voice of Patient (VOP) Named Entity Recognition (NER) models for detecting clinical terms expressed in patients’ own words.
Advanced Chunk Mapper models for precise mapping of NDC and HCPCS codes.
Innovative Social Determinants of Health (SDOH) text classification models.
Enhanced NER profiling with updated pre-trained pipelines, enabling the simultaneous execution of 100 clinical NER models.
The implementation of a negative label feature for increased accuracy in the Zero Shot Relation Extraction Model.
Allowing users to select specific entity tags in NameChunkObfuscator,
Multi-mode deidentification & obfuscation support in a single pass with the streamlined Deid module.
Core improvements and bug fixes:
- Aligning ChunkMapperModel output metadata with SentenceEntityResolverModel metadata for seamless compatibility.
- Resolving issues with the MedicalNerApproach setTagsMapping parameter.
- Removing “non-sense” UNK tokens from text generators (e.g. BioGPt) for enhanced output quality.
New and updated notebooks
New and updated demos
17 new clinical models and pipelines added & updated in total

We are committed to delivering exceptional tools and resources for healthcare professionals and researchers, and we look forward to your valuable feedback on these latest updates.

Introducing A New biogpt-chat-jsl Model (LLM), Fine-Tuned For Clinical Conversations In Healthcare Settings.

We are excited to present the biogpt_chat_jsl_conversational text generator model, an advanced adaptation of the BioGPT-JSL model, meticulously fine-tuned with authentic medical conversations from clinical environments. This model is adept at answering clinical queries related to symptoms, medications, diagnostic tests, and various diseases.

In comparison to its predecessor, the biogpt_chat_jsl model, the new biogpt_chat_jsl_conversational model generates more succinct and focused responses, significantly enhancing the efficiency and user experience of our software. This cutting-edge model is poised to revolutionize the way healthcare professionals and researchers engage with clinical information.

Example:

gpt_qa = MedicalTextGenerator.pretrained("biogpt_chat_jsl_conversational", "en", "clinical/models")\
    .setInputCols("documents")\
    .setOutputCol("answer")\
    .setMaxNewTokens(100)

sample_text = "How to treat asthma?"

Result:

answer: You have to take montelukast + albuterol tablet once or twice in day according to severity of symptoms. Montelukast is used as a maintenance therapy to relieve symptoms of asthma. Albuterol is used as a rescue therapy when symptoms are severe. You can also use inhaled corticosteroids ( ICS ) like budesonide or fluticasone for long term treatment.

A Specialized Medical Summarizer Model (LLM) Designed Specifically For Radiology Report Analysis.

We are proud to unveil the summarizer_radiology model, a highly specialized tool engineered to efficiently distill radiology reports by pinpointing and retaining the most crucial information. This model enables users to rapidly access a succinct synopsis of a report’s key findings without compromising on essential details.

The summarizer_radiology model represents a significant advancement in the field of medical text analysis, offering unparalleled support to healthcare professionals in swiftly grasping the salient points of complex radiology reports and ultimately enhancing patient care outcomes.

Example:

summarizer = MedicalSummarizer.pretrained("summarizer_radiology", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("summary")\
    .setMaxTextLength(512)\
    .setMaxNewTokens(512)

sample_text = """INDICATIONS: Peripheral vascular disease with claudication.
RIGHT:
  1. Normal arterial imaging of right lower extremity.
  2. Peak systolic velocity is normal.
  3. Arterial waveform is triphasic.
  4. Ankle brachial index is 0.96.
LEFT:
  1. Normal arterial imaging of left lower extremity.
  2. Peak systolic velocity is normal.
  3. Arterial waveform is triphasic throughout except in posterior tibial artery where it is biphasic.
  4. Ankle brachial index is 1.06.
IMPRESSION:
  Normal arterial imaging of both lower lobes.
"""

Result:

The patient has peripheral vascular disease with claudication. The right lower extremity shows normal arterial imaging, but the peak systolic velocity is normal. The arterial waveform is triphasic throughout, except for the posterior tibial artery, which is biphasic. The ankle brachial index is 0.96. The impression is normal arterial imaging of both lower lobes.

Eight New Voice of Patient (VOP) Named Entity Recognition (NER) Models For Detecting Clinical Terms Expressed In Patients’ Own Words.

We are thrilled to introduce eight innovative Voice of Patient (VOP) Named Entity Recognition (NER) models, meticulously crafted to extract clinical terms from patients’ unique linguistic expressions. These models empower healthcare professionals to analyze patient data with enhanced accuracy and efficiency, paving the way for more precise diagnoses and tailored treatment plans.

By leveraging the capabilities of these VOP NER models, healthcare providers can better understand patients’ perspectives, bridging the communication gap and fostering more effective patient-centered care.

model name	description	predicted entities
`ner_vop_anatomy_wip`	Detecting anatomical terms expressed in patients’ own words.	`Laterality`, `BodyPart`
`ner_vop_clinical_dept_wip`	Detecting medical devices and clinical department mentions terms expressed in patients’ own words.	`MedicalDevice`, `AdmissionDischarge`, `ClinicalDept`
`ner_vop_demographic_wip`	Detecting demographic terms expressed in patients’ own words.	`SubstanceQuantity`, `RaceEthnicity`, `RelationshipStatus`, `Substance`, `Age`, `Employment`, `Gender`
`ner_vop_problem_reduced_wip`	Detecting clinical condition terms expressed in patients’ own words.	`Modifier`, `HealthStatus`, `Problem`
`ner_vop_problem_wip`	Detecting clinical condition terms expressed in patients’ own words using a granular taxonomy.	`InjuryOrPoisoning`, `Modifier`, `HealthStatus`, `Symptom`, `Disease`, `PsychologicalCondition`
`ner_vop_temporal_wip`	Detecting temporal references terms expressed in patients’ own words.	`Frequency`, `Duration`, `DateTime`
`ner_vop_test_wip`	Detecting test mention terms expressed in patients’ own words.	`Measurements`, `TestResult`, `Test`, `VitalTest`
`ner_vop_treatment_wip`	Detecting treatment terms expressed in patients’ own words.	`Treatment`, `Frequency`, `Procedure`, `Route`, `Duration`, `Dosage`, `Drug`, `Form`

Example:

ner = MedicalNerModel.pretrained("ner_vop_problem_wip", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

sample_text = """I"ve been experiencing joint pain and fatigue lately, so I went to the rheumatology department. After some tests, they diagnosed me with rheumatoid arthritis and started me on a treatment plan to manage the symptoms."""

Results:

chunk	ner_label
pain	Symptom
fatigue	Symptom
rheumatoid arthritis	Disease

Advanced Chunk Mapper Models For Precise Mapping Of NDC And HCPCS Codes:

We are delighted to present our cutting-edge chunk mapper models, meticulously crafted for the accurate mapping of National Drug Code (NDC) and Healthcare Common Procedure Coding System (HCPCS) codes. These innovative models enable users to swiftly and effortlessly identify the relevant codes, optimizing the coding and billing process while bolstering accuracy.

The introduction of these advanced chunk mapper models demonstrates our commitment to delivering state-of-the-art solutions that streamline healthcare administration tasks, ultimately contributing to improved efficiency and patient care outcomes.

hcpcs_ndc_mapper model maps Healthcare Common Procedure Coding System (HCPCS) codes to their corresponding National Drug Codes (NDC) and their drug brand names.

Example:

...
chunkerMapper = DocMapperModel.pretrained("hcpcs_ndc_mapper", "en", "clinical/models")\
      .setInputCols(["hcpcs_chunk"])\
      .setOutputCol("mappings")\
      .setRels(["ndc_code", "brand_name"])

text= ["Q5106", "J9211", "J7508"]

Result:

hcpcs_chunk	mappings	relation
Q5106	59353-0003-10	ndc_code
Q5106	RETACRIT (PF) 3000 U/1 ML	brand_name
J9211	59762-2596-01	ndc_code
J9211	IDARUBICIN HYDROCHLORIDE (PF) 1 MG/ML	brand_name
J7508	00469-0687-73	ndc_code
J7508	ASTAGRAF XL 5 MG	brand_name

ndc_hcpcs_mapper model maps NDC with their corresponding HCPCS codes and their descriptions.

Example:

...
chunkerMapper = DocMapperModel.pretrained("ndc_hcpcs_mapper", "en", "clinical/models")\
      .setInputCols(["ndc_chunk"])\
      .setOutputCol("hcpcs")\
      .setRels(["hcpcs_code", "hcpcs_description"])

text= ["16714-0892-01", "00990-6138-03", "43598-0650-11"]

Result:

ndc_chunk	mappings	relation
16714-0892-01	J0878	hcpcs_code
16714-0892-01	INJECTION, DAPTOMYCIN, 1 MG	hcpcs_description
00990-6138-03	A4217	hcpcs_code
00990-6138-03	STERILE WATER/SALINE, 500 ML	hcpcs_description
43598-0650-11	J9340	hcpcs_code
43598-0650-11	INJECTION, THIOTEPA, 15 MG	hcpcs_description

We are excited to announce the release of three new Social Determinants of Health (SDOH) text classification models, specifically tailored to analyze and classify information related to insurance status, insurance coverage, and SDOH insurance type. These cutting-edge models enable healthcare professionals and researchers to better understand the nuanced interplay of insurance factors that influence health outcomes and access to care.

By leveraging these innovative SDOH classification models, stakeholders can gain valuable insights into the insurance landscape and its impact on health disparities, ultimately informing more targeted interventions and policies to improve patient care and well-being.

model name	description	predicted entities
`genericclassifier_sdoh_insurance_status_sbiobert_cased_mli`	Detecting whether the patient has insurance or not	`Insured`, `Uninsured`, `Unknown`
`genericclassifier_sdoh_insurance_coverage_sbiobert_cased_mli`	Detecting insurance coverage	`Good`, `Poor`, `Unknown`
`genericclassifier_sdoh_insurance_type_sbiobert_cased_mli`	Detecting insurance type	`Employer`, `Medicaid`, `Medicare`, `Military`, `Private`, `Other`

Example:

features_asm = FeaturesAssembler()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("features")

 generic_classifier = GenericClassifierModel.pretrained("genericclassifier_sdoh_insurance_type_sbiobert_cased_mli", 'en', 'clinical/models')\
    .setInputCols(["features"])\
    .setOutputCol("prediction")

text_list = [
"""The patient has VA insurance.""", 
"""She is under Medicare insurance""",
"""The patient has good coverage of Private insurance""",
"""Medical File for John Smith, Male, Age 42 Chief Complaint: Patient complains of nausea, vomiting, and shortness of breath...""",
"""Certainly, here is an example case study for a patient with private insurance: Case Study for Emily Chen, Female, Age 38 ...""",
"""Medical File for John Doe, Male, Age 72 Chief Complaint: Patient reports shortness of breath and fatigue. History of Pres..."""]

Result:

text	result
The patient has VA insurance.	Military
She is under Medicare insurance	Medicare
Medical File for John Smith, Male, Age 42 Chief Complaint: Patient complains of nausea, vomiti…	Medicaid
Certainly, here is an example case study for a patient with private insurance: Case Study for …	Private
Medical File for John Doe, Male, Age 72 Chief Complaint: Patient reports shortness of breath…	Medicare

Updated NER Profiling Pretrained Pipelines With New NER Models to Allow Running 100 Clinical NER Models At Once

We are proud to announce the latest updates to our ner_profiling_clinical and ner_profiling_biobert pre-trained pipelines, which now feature the integration of new Named Entity Recognition (NER) models. When executing these pipelines on your text, you can now benefit from the predictions generated by an impressive 100 clinical NER models in ner_profiling_clinical and 22 clinical NER models in ner_profiling_biobert.

These enhancements to our pre-trained pipelines showcase our commitment to providing healthcare professionals and researchers with state-of-the-art tools, enabling more efficient and accurate analysis of clinical text to support data-driven decision-making and improved patient care outcomes.

You can check ner_profiling_clinical and ner_profiling_biobert Models Hub pages for more details and the NER model lists that these pipelines include.

The Implementation Of A Negative Label Feature For Increased Accuracy In The Zero Shot Relation Extraction Model

We are pleased to introduce the addition of a new setNegativeRelationships parameter to the ZeroShotRelationExtractionModel annotator, empowering users to exercise more effective control over the model’s predictions for enhanced accuracy. This innovative parameter generates negative examples of relations and subsequently removes them, resulting in improved precision for positive labels.

This advanced feature demonstrates our ongoing commitment to delivering state-of-the-art solutions for healthcare professionals and researchers, facilitating more accurate analysis of complex relationships within clinical text and ultimately contributing to better patient care and outcomes.

Example:

re_model = sparknlp_jsl.annotator.ZeroShotRelationExtractionModel \
    .pretrained() \
    .setRelationalCategories({
        "CURE": ["{TREATMENT} cures {PROBLEM}."],
        "IMPROVE": ["{TREATMENT} improves {PROBLEM}.", "{TREATMENT} cures {PROBLEM}."],
        "REVEAL": ["{TEST} reveals {PROBLEM}."]})\
    .setMultiLabel(False)\
    .setInputCols(["re_ner_chunks", "sentences"]) \
    .setOutputCol("relations")\
    .setNegativeRelationships(["IMPROVE"])

sample_text = "Paracetamol can alleviate headache or sickness. An MRI test can be used to find cancer."

Without Setting setNegativeRelationships:

relation	chunk1	entity1	chunk2	entity2	hypothesis	confidence
REVEAL	An MRI test	TEST	cancer	PROBLEM	An MRI test reveals cancer.	0.9760039
IMPROVE	Paracetamol	TREATMENT	sickness	PROBLEM	Paracetamol improves sickness.	0.98819494
IMPROVE	Paracetamol	TREATMENT	headache	PROBLEM	Paracetamol improves headache.	0.9929625

After Setting setNegativeRelationships:

relation	chunk1	entity1	chunk2	entity2	hypothesis	confidence
REVEAL	An MRI test	TEST	cancer	PROBLEM	An MRI test reveals cancer.	0.9760039

Allowing users To Select Specific Entity Tags In NameChunkObfuscator

We are excited to introduce the new setNameEntities parameter for the NameChunkObfuscator annotator, enabling users to specify the labels they wish to obfuscate using an array list. The default value is set to [“NAME”], offering greater flexibility and customization when working with sensitive information.

This enhancement to the NameChunkObfuscator reflects our dedication to providing user-centric tools that cater to the diverse needs of healthcare professionals and researchers, ensuring the protection of sensitive data while maintaining the utility of the information for analysis and decision-making.

Example:

nameChunkObfuscator = NameChunkObfuscatorApproach()\
  .setInputCols("ner_chunk")\
  .setOutputCol("replacement")\
  .setNameEntities(["PATIENT","DOCTOR","NAME"])

sample_text = '''John Davies Hendrickson is a 62 y.o. patient admitted. 
Dr. Lorand was scheduled for emergency assessment. 
John Davies Hendrickson is a teacher and Dr. Lorand is a Doctor.
Olivera is 25 years-old.
Dr. Roland offered his patient Olivera a healthy diet.  
John Davies Hendrickson Lorand has biggest name'''

As can be seen in the table below, DOCTOR and PATIENT chunks are consistently replaced with the same obfuscation chunks.

ner_chunk	label	replacement
John Davies Hendrickson	PATIENT	Aesculapius Amalasuntha
Lorand	DOCTOR	Fulvia
John Davies Hendrickson	PATIENT	Aesculapius Amalasuntha
Lorand	DOCTOR	Fulvia
Olivera	PATIENT	Killian
Roland	DOCTOR	Rudolf
Olivera	DOCTOR	Killian
John Davies Hendrickson Lorand	PATIENT	Deipnosophistae Hermaphroditus

	Sentence	Obfuscated
0	John Davies Hendrickson is a 62 y.o. patient admitted.	Aesculapius Amalasuntha is a 62 y.o. patient admitted.
1	Dr. Lorand was scheduled for emergency assessment.	Dr. Fulvia was scheduled for emergency assessment.
2	John Davies Hendrickson is a teacher and Dr. Lorand is a Doctor.	Aesculapius Amalasuntha is a teacher and Dr. Fulvia is a Doctor.
3	Olivera is 25 years-old.	Killian is 25 years-old.
4	Dr. Roland offered his patient Olivera a healthy diet.	Dr. Rudolf offered his patient Killian a healthy diet.
5	John Davies Hendrickson Lorand has biggest name	Deipnosophistae Hermaphroditus has biggest name

Multi-mode Deidentification And Obfuscation Support In A Single Pass With The Streamlined Deid Module

We are proud to announce the enhancement of our Deid module with the introduction of a one-pass, multi-mode deidentification feature. This powerful new capability significantly improves the module’s functionality, enabling users to deidentify their data with increased efficiency, accuracy, and flexibility.

To utilize this feature for a single column, simply set the multi_mode_file_path parameter with the JSON file path describing the desired multi-mode configuration. This streamlined approach demonstrates our commitment to providing state-of-the-art tools that cater to the evolving needs of healthcare professionals and researchers, ensuring the protection of sensitive information while maintaining data utility for analysis and decision-making.

Example:

#json to choose deid mode
sample_json= {
	"obfuscate": ["NAME", "PHONE"] ,
	"mask_entity_labels": ["AGE"],
	"skip": ["SSN"],
	"mask_same_length_chars":["DATE"],
	"mask_fixed_length_chars":["ZIP", "LOCATION"]
}

import json
with open('sample_multi-mode.json', 'w', encoding='utf-8') as f:
    json.dump(sample_json, f, ensure_ascii=False, indent=4)

#Deidentification with multi mode for one column
deid_implementor= Deid(spark,
                       input_file_path="deid_data.csv",
                       output_file_path="deidentified.csv",
                       custom_pipeline=model,
                       multi_mode_file_path="sample_multi-mode.json")
                       

For multiple columns, we can set one specific JSON file path multi mode for each column.

Example:

#json to choose deid mode for the 2nd column
sample_json_column2= {
	"obfuscate": ["SSN", "AGE"] ,
	"mask_entity_labels": ["DATE"],
	"skip": ["ID"],
	"mask_same_length_chars":["NAME"],
	"mask_fixed_length_chars":["ZIP", "LOCATION"]
}

import json
with open('sample_multi-mode_column2.json', 'w', encoding='utf-8') as f:
    json.dump(sample_json_column2, f, ensure_ascii=False, indent=4)

#Deidentification with multi mode for multiple columns
deid_implementor= Deid(spark,
                       input_file_path="deid_multiple_data.csv",
                       output_file_path="deidentified.csv",
                       custom_pipeline=model,
                       fields={"text": "sample_multi-mode.json", "text_1":"sample_multi-mode_column2.json"}, masking_policy="fixed_length_chars", 
                       fixed_mask_length=2, separator=",")
                       

For more detail please check Clinical Deidentification Utility Module

Core Improvements and Bug Fixes

Aligning ChunkMapperModel output metadata with SentenceEntityResolverModel metadata for seamless compatibility.
Resolving issues with the MedicalNerApproach setTagsMapping parameter.
Removing “non-sense” UNK tokens from text generators (e.g. BioGPt) for enhanced output quality

New and Updated Notebooks

New Comparison Medical Text Summarization Notebook for summarization of clinical context can be used with new MedicalSummarizer annotator.
New Biogpt Chat JSL Notebook for test generation of clinical context can be used with new MedicalTextGenerator annotator.
New Text Classification with Contextual Window Splitting for text classification with contextual window splitting can be used with the new WindowedSentenceModel annotator.
New Review Functions of ALab Module Notebook for ALAB module review functions.
Updated Clinical Deidentification Utility Module with the latest improvement.

New and Updated Demos

Medical Large Language Modeling demo
Medical Summarization Radiology demo
BIOGPT CHAT JSL demo
MODELS demo
- With MODELS demos, you can select all healthcare models as Task and Annotator based and you can see information about the models.

17 New Clinical Models and Pipelines Added & Updated in Total

biogpt_chat_jsl_conversational
summarizer_radiology
ner_profiling_biobert
ner_profiling_clinical
ner_vop_anatomy_wip
ner_vop_clinical_dept_wip
ner_vop_demographic_wip
ner_vop_problem_reduced_wip
ner_vop_problem_wip
ner_vop_temporal_wip
ner_vop_test_wip
ner_vop_treatment_wip
ndc_hcpcs_mapper
hcpcs_ndc_mapper
genericclassifier_sdoh_insurance_status_sbiobert_cased_mli
genericclassifier_sdoh_insurance_coverage_sbiobert_cased_mli
genericclassifier_sdoh_insurance_type_sbiobert_cased_mli

For all Spark NLP for Healthcare models, please check: Models Hub Page

Versions

Version
Version
Version

PREVIOUSVersion Compatibility

4.4.1

Highlights

Introducing A New biogpt-chat-jsl Model (LLM), Fine-Tuned For Clinical Conversations In Healthcare Settings.

A Specialized Medical Summarizer Model (LLM) Designed Specifically For Radiology Report Analysis.

Eight New Voice of Patient (VOP) Named Entity Recognition (NER) Models For Detecting Clinical Terms Expressed In Patients’ Own Words.

Advanced Chunk Mapper Models For Precise Mapping Of NDC And HCPCS Codes:

Innovative Social Determinants Of Health (SDOH) Text Classification Models.

Updated NER Profiling Pretrained Pipelines With New NER Models to Allow Running 100 Clinical NER Models At Once

The Implementation Of A Negative Label Feature For Increased Accuracy In The Zero Shot Relation Extraction Model

Allowing users To Select Specific Entity Tags In NameChunkObfuscator

Multi-mode Deidentification And Obfuscation Support In A Single Pass With The Streamlined Deid Module

Core Improvements and Bug Fixes

New and Updated Notebooks

New and Updated Demos

17 New Clinical Models and Pipelines Added & Updated in Total

Versions