Healthcare NLP v5.5.2 Release Notes

5.5.2

Highlights

We are delighted to announce remarkable enhancements and updates in our latest release of Healthcare NLP. This release comes with brand new relational databases support for de-identification, improved context awareness for chunk embeddings, new customization parameters for flexible output modifications, and 59 new and updated clinical pretrained models and pipelines.

Calculate the embeddings of the neighboring context of a named entity (not just the chunk) with the BertSentenceChunkEmbeddings annotator for improved context awareness
De-identifying sensitive data in relational databases with a few lines of codes
Reduce false positives returned by NER models via possible and impossible context using ContextualEntityFilterer. This also refines entity extraction by leveraging regex-based contextual filtering
Enhace named entities with specific keywords by allowing greater control over pattern matching via ContextualEntityRuler
10 New PretrainedZeroShotNER named entity recognition models that are already finetuned on in-house annotations
Introducing clinical document analysis with one-liner pretrained pipelines for specific clinical tasks and concepts
Introducing 2 new named entity recognition and an assertion models for extracts gene and phenotype features
Introducing 2 new named entity recognition models for extracts mentions of cancer types and biomarker
Updated human phenotype ontology resolver model
Updated all unified medical language system® (UMLS) models.
New blog posts on various topics
Various core improvements; bug fixes, enhanced overall robustness and reliability of Spark NLP for Healthcare
- New filtering parameters for Assertion annotators: whiteList, blackList, and caseSensitive
- Bugfixes in StructuredDeidentification for improved fake chunk handling and formatting
- Bug fix for save and load functionality in DateNormalizer annotator
- PipelineTracer Improvements: Recursive support for ChunkMerger and AssertionMerger, and bug fix for getReplaceDict issue
- Corrected begin index calculation in exclude mode for ContextualEntityRulery
- Length-Controlled Fake Text Generation in Deidentification for a Better Consistency
Updated notebooks and demonstrations for making Spark NLP for Healthcare easier to navigate and understand
- New Oncology Use Cases Notebook
- New Clinical Deidentification for Structured Data Notebook
- Updated CLINICAL TEXT SUMMARIZATION Demo
- Updated DEID PHI TEXT MULTI Demo
- Updated NER GENE PHENOTYPES Demo
The addition and update of numerous new clinical models and pipelines continue to reinforce our offering in the healthcare domain

These enhancements will elevate your experience with Spark NLP for Healthcare, enabling more efficient, accurate, and streamlined analysis of healthcare-related natural language data.

Calculate the Embeddings of the Neighboring Context of a Named Entity (not just the chunk) with the `BertSentenceChunkEmbeddings` Annotator for Improved Context Awareness

The BertSentenceChunkEmbeddings annotator now includes advanced features and expanded support for ONNX models:

strategy: Defines how embeddings are computed, with the following options:
- “sentence_average”: Average of sentence and chunk embeddings.
- “scope_average”: Average of scope (defined by scopeWindow) and chunk embeddings.
- “chunk_only”: Embeddings based solely on chunks.
- “scope_only”: Embeddings based solely on scope (requires scopeWindow).
scopeWindow: Specifies the range of tokens used for scope embeddings, which are defined as two non-negative integers. The first integer indicates tokens before the chunk, and the second indicates tokens after. The default is (0, 0), meaning only chunk embeddings are used.
ONNX Model Support: The annotator now supports ONNX models, enabling integration with models.

Example:

chunk_only_embeddings = BertSentenceChunkEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "ner_chunk"])\
    .setOutputCol("sentence_embeddings")\
    .setCaseSensitive(False)\
    .setChunkWeight(0.5)\
    .setStrategy("chunk_only")

scope_average_embeddings = BertSentenceChunkEmbeddings.pretrained("sbiobert_base_cased_mli", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "ner_chunk"])\
    .setOutputCol("sentence_embeddings")\
    .setCaseSensitive(False)\
    .setChunkWeight(0.5)\
    .setStrategy("scope_average")\
    .setScopeWindow([5,5])

icd_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_augmented_billable_hcc","en", "clinical/models") \
    .setInputCols(["sentence_embeddings"]) \
    .setOutputCol("icd10cm_code")\
    .setDistanceFunction("EUCLIDEAN")

text = """he patient is a 42-year-old female and has diabetes mellitus with diabetic neuropathy since four years and she was treated by Center Hospital."""

Results:

Parameter	Chunk	ICD-10-CM Code	Resolution
chunk_only	diabetes mellitus	E10.9	diabetes mellitus [type 1 diabetes mellitus without complications]
scope_average	diabetes mellitus	E11.40	nervous system disorder due to diabetes mellitus [type 2 diabetes mellitus with diabetic neuropathy, unspecified]

De-identifying Sensitive Data in Relational Databases with a Few Lines of Codes

The RelationalDBDeidentification class provides a robust solution for de-identifying sensitive data in relational databases. It supports a variety of obfuscation techniques and integrates seamlessly with database systems. Key features include:

End-to-End De-Identification:
- deidentify(): Automates the de-identification process by:
  - Fetching tables.
  - Extracting schema information.
  - Detecting sensitive columns.
  - Applying obfuscation and masking techniques.
  - Exporting de-identified data as CSV files.
Database Connectivity:
- connect_to_db(): Establishes a connection to the MySQL database.
- get_all_tables(): Retrieves all table names from the connected database.
Schema and Data Processing:
- get_schema_info(table_name): Extracts schema details, including date columns, primary keys, and foreign keys, for a specified table.
Data Obfuscation:
- obfuscate_dates(df, date_columns): Shifts dates by a specified number of days.
- obfuscate_ages(df, age_columns, use_hipaa): Obfuscates age columns using HIPAA rules or predefined age groups.
- mask_other_sensitive_columns(df, other_columns): Masks sensitive columns by replacing their values with asterisks.

This class provides a complete framework for protecting sensitive information while maintaining data integrity for relational databases.

Example:

from sparknlp_jsl.utils.database_deidentification import RelationalDBDeidentification

config = {
    "db_config": {
        "host": "localhost",
        "user": "root",
        "password": "root",
        "database": "healthcare_db"
    },
    "deid_options": {
        "days_to_shift": 10,
        "age_groups": {
            "child": (0, 12),
            "teen": (13, 19),
            "adult": (20, 64),
            "senior": (65, 90)
        },
        "pk_fk_shift_value": 100,
        "use_hipaa": False,
        "output_path": "deidentified_output/"
    },
    "logging": {
        "level": "INFO",
        "file": "deidentification.log"
    }
}

deidentifier = RelationalDBDeidentification(spark, config)
deidentifier.deidentify()

Example for appointments:

appointment_id	patient_id	doctor_name	appointment_date	reason
1	1	Dr. Emily Carter	2024-01-15	Annual Checkup
2	2	Dr. Sarah Johnson	2024-02-10	Flu Symptoms
3	1	Dr. Emily Carter	2024-02-15	Follow-up Visit
4	1	Dr. James Wilson	2024-03-20	Routine Blood Test

Result for appointments (De-identified table):

appointment_id	patient_id	doctor_name	appointment_date	reason
101	101	*****	2024-01-25	Annual Checkup
102	102	*****	2024-02-20	Flu Symptoms
103	101	*****	2024-02-25	Follow-up Visit
104	101	*****	2024-03-30	Routine Blood Test

Example for patients:

patient_id	name	address	ssn	email	dob	age
1	John Doe	123 Main St, Springfield	123-45-6789	john.doe@example.com	1985-04-15	38
2	Jane Smith	456 Elm St, Shelbyville	987-65-4321	jane.smith@example.com	1990-07-20	33

Result for patients (De-identified table):

patient_id	name	address	ssn	email	dob	age
101	*****	*****	*****	*****	1985-04-25	39
102	*****	*****	*****	*****	1990-07-30	62

Please check the 4.8.Clinical_Deidentification_for_Structured_Data Notebook for more information

Reduce False Positives Returned by NER Models via Possible and Impossible Context Using `ContextualEntityFilterer`. This also Refines Entity Extraction by Leveraging Regex-Based Contextual Filtering

The ContextualEntityFilterer now includes two new parameters, possibleRegexContext and impossibleRegexContext, providing advanced filtering options for contextual entity recognition. These parameters offer granular control for refining entity extraction by leveraging regex-based contextual filtering.

possibleRegexContext: The possible regex context to filter the chunks. If the regex is found in the context(chunk), the chunk is kept.
impossibleRegexContext: The impossible regex context to filter the chunks. If the regex is found in the context(chunk), the chunk is removed. Important Note: When defining regex patterns in code, use double escape characters (e.g., \) to ensure proper handling of special characters.

Example:

 contextual_entity_filterer = ContextualEntityFilterer() \
	.setInputCols("sentence", "token", "ner_chunks") \
	.setOutputCol("filtered_ner_chunks") \
	.setRules([{
		"entity": "AGE",
		"scopeWindow": [3, 3],                
		"scopeWindowLevel": "token",
		"impossibleRegexContext" : "\\b(1[2-9]\\d|[2-9]\\d{2,}|\\d{4,})\\b"
	}])\
	.setRuleScope("sentence")\
	.setCaseSensitive(False)

text = "California, known for its beautiful beaches,and he is 366 years old. " \
        "The Grand Canyon in Arizona,  where the age is 37, is a stunning natural landmark." \
        "It was founded on September 9, 1850, and Arizona on February 14, 1912."

Result:

# NER Result
|            chunk|begin|end|ner_label|
|-----------------|-----|---|---------|
|       California|    0|  9| LOCATION|
|              366|   54| 56|      AGE| # this is an imposible age 
|     Grand Canyon|   73| 84| LOCATION|
|          Arizona|   89| 95| LOCATION|
|               37|  116|117|      AGE|
|September 9, 1850|  169|185|     DATE|
|February 14, 1912|  203|219|     DATE|

# Filtered Result
|            chunk|begin|end|ner_label|
|-----------------|-----|---|---------|
|       California|    0|  9| LOCATION|
|     Grand Canyon|   73| 84| LOCATION|
|          Arizona|   89| 95| LOCATION|
|               37|  116|117|      AGE|
|September 9, 1850|  169|185|     DATE|
|February 14, 1912|  203|219|     DATE|

Please check the ContextualEntityFilterer Notebook for more information

Enhace Named Entities with Specific Keywords by Allowing Greater Control Over Pattern Matching via `ContextualEntityRuler`

The ContextualEntityRuler has been updated with a new parameter, allowTokensInBetween, to enhance matching flexibility and address a bug in exclude mode’s begin indexes:

allowTokensInBetween: When True: Allows tokens between prefix/suffix patterns and the entity, enabling extended matches. When False: Tokens between patterns and entities prevent a match. Default: False
adding the “replace_label_only” option to the mode parameter

This update provides greater control over pattern matching while ensuring robust performance in entity recognition workflows.

Example:

rules = [
	{
		"entity": "Age",
		"scopeWindow" : [15,15],
		"scopeWindowLevel" : "char",
		"suffixPatterns" : ["years old", "year old", "months"],
		"replaceEntity": "Modified_Age",
		"mode": "exclude"
	},
	{
		"entity": "Diabetes",
		"scopeWindow" : [3,3],
		"scopeWindowLevel"  : "token",
		"suffixPatterns" : ["complications"],
		"replaceEntity": "Modified_Diabetes",
		"mode": "include"
	},
	{
		"entity": "NAME",
		"scopeWindow" : [3,3],
		"scopeWindowLevel" : "token",
		"prefixPatterns" : ["MD", "M.D", "Dr"],
		"replaceEntity": "Doctor_Name",
		"mode": "replace_only_labels"
	}   
]

contextual_entity_ruler = ContextualEntityRuler() \
            .setInputCols("sentence", "token", "ner_chunk") \
            .setOutputCol("ruled_ner_chunk") \
            .setRules(rules) \
            .setCaseSensitive(False)\
            .setDropEmptyChunks(True)\
            .setAllowPunctuationInBetween(False)\
            .setAllowTokensInBetween(True)
text = """ Dr. John Snow assessed the 36 years old who has a history of the diabetes mellitus with complications in May, 2006"""

NER Result:

chunk	begin	end	ner_label
John Snow	5	13	NAME
36 years old	28	39	Age
diabetes mellitus	66	82	Diabetes

Replaced Result:

chunk	begin	end	ner_label
Dr. John Snow	1	13	Doctor_Name
36	28	29	Modified_Age
diabetes mellitus with complications	66	101	Modified_Diabetes

10 New `PretrainedZeroShotNER` Named Entity Recognition Models that are Already Finetuned on In-house Annotations

Pretrained-Zero-Shot Named Entity Recognition (NER) enables the identification of entities in text with minimal effort. By leveraging pre-trained language models and contextual understanding, zero-shot NER extends entity recognition capabilities to new domains and languages. While the model card includes default labels as examples, it is important to highlight that users are not limited to these labels. The model is designed to support any set of entity labels, allowing users to adapt it to their specific use cases. For best results, it is recommended to use labels that are conceptually similar to the provided defaults.

Model Name	Description	Predicted Entites
`zeroshot_ner_oncology_biomarker_large`	This model extracts oncology biomarkers entities	`Biomarker`, `Biomarker_Result`
`zeroshot_ner_oncology_biomarker_medium`	This model extracts oncology biomarkers entities	`Biomarker`, `Biomarker_Result`
`zeroshot_ner_deid_generic_multi_large_xx`	This model extracts demographic entities	`AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION`
`zeroshot_ner_deid_generic_multi_medium_XX`	This model extracts demographic entities	`AGE`, `CONTACT`, `DATE`, `ID`, `LOCATION`, `NAME`, `PROFESSION`
`zeroshot_ner_deid_subentity_merged_large`	This model extracts demographic entities	`DOCTOR`, `PATIENT`, `AGE`, `DATE`, `HOSPITAL`, `CITY`, `STREET`, `STATE`, `COUNTRY`, `PHONE`, `IDNUM`, `EMAIL`, `ZIP`, `ORGANIZATION`, `PROFESSION`, `USERNAME`
`zeroshot_ner_jsl_large`	This model extracts general entities	`Admission_Discharge`, `Alcohol`, `Body_Part`, `Disease_Syndrome_Disorder`, `Drug`, `Injury_or_Poisoning`, `Oncological`, `Procedure`, `Section_Header`, `Smoking`, `Symptom`, `Test`, `Test_Result`, `Treatment`, …
`zeroshot_ner_jsl_medium`	This model extracts general entities	`Admission_Discharge`, `Alcohol`, `Body_Part`, `Disease_Syndrome_Disorder`, `Drug`, `Injury_or_Poisoning`, `Oncological`, `Procedure`, `Section_Header`, `Smoking`, `Symptom`, `Test`, `Test_Result`, `Treatment`, …
`zeroshot_ner_ade_clinical_large`	This model extracts general entities	`DRUG`, `ADE`, `PROBLEM`
`zeroshot_ner_sdoh_medium`	This model extracts general entities	`Access_To_Care`, `Alcohol`, `Disability`, `Financial_Status`, `Insurance_Status`, `Legal_Issues`, `Marital_Status`, `Mental_Health`, `Quality_Of_Life`, `Smoking`, `Social_Exclusion`, `Social_Support`, `Violence_Or_Abuse`, …
`zeroshot_ner_sdoh_large`	This model extracts general entities	`Access_To_Care`, `Alcohol`, `Disability`, `Financial_Status`, `Insurance_Status`, `Legal_Issues`, `Marital_Status`, `Mental_Health`, `Quality_Of_Life`, `Smoking`, `Social_Exclusion`, `Social_Support`, `Violence_Or_Abuse`, …

Example:

# You can change the labels
labels = ['Biomarker', 'Biomarker_Result']
pretrained_zero_shot_ner = PretrainedZeroShotNER().pretrained("zeroshot_ner_oncology_biomarker_medium", "en", "clinical/models")\
    .setInputCols("sentence", "token")\
    .setOutputCol("ner")\
    .setPredictionThreshold(0.5)\
    .setLabels(labels)

text = """The results of immunohistochemical examination showed that she tested negative for CK7, synaptophysin (Syn), chromogranin A (CgA),
Muc5AC, human epidermal growth factor receptor-2 (HER2), and Muc6; positive for CK20, Muc1, Muc2, E-cadherin, and p53; the Ki-67 index was about 87% .
"""

Result:

chunk	begin	end	ner_label	confidence
negative	71	78	Biomarker_Result	0.96627086
CK7	84	86	Biomarker	0.98598194
synaptophysin	89	101	Biomarker	0.97052944
Syn	104	106	Biomarker	0.5375477
chromogranin A	110	123	Biomarker	0.95293134
Muc5AC	132	137	Biomarker	0.9601343
human epidermal growth factor receptor-2	140	179	Biomarker	0.95500314
HER2	182	185	Biomarker	0.87689865
Muc6	193	196	Biomarker	0.9785201
positive	199	206	Biomarker_Result	0.99296826
CK20	212	215	Biomarker	0.99122345
Muc1	218	221	Biomarker	0.97516555
Muc2	224	227	Biomarker	0.9656944
E-cadherin	230	239	Biomarker	0.98840755
p53	246	248	Biomarker	0.9895884
Ki-67 index	255	265	Biomarker	0.90272933
87%	277	279	Biomarker_Result	0.84652114

Please check the ZeroShot Clinical NER Notebook for more information

Introducing Clinical Document Analysis with One-Liner Pretrained Pipelines for Specific Clinical Tasks and Concepts

We introduce a suite of advanced, hybrid pretrained pipelines, specifically designed to streamline the clinical document analysis process. These pipelines are built upon multiple state-of-the-art (SOTA) pretrained models, delivering a comprehensive solution for quickly extracting vital information.

What sets this release apart is the elimination of complexities typically involved in building and chaining models. Users no longer need to navigate the intricacies of constructing intricate pipelines from scratch or the uncertainty of selecting the most effective model combinations. Our new pretrained pipelines simplify these processes, offering a seamless, user-friendly experience.

Model Name	Description
`clinical_deidentification_zeroshot_large`	This pipeline is designed to extract all clinical/medical entities which may be considered as Deidentification entities from text.
`clinical_deidentification_zeroshot_medium`	his pipeline is designed to extract all clinical/medical entities which may be considered as Deidentification entities from text.
`clinical_deidentification_docwise_large_wip`	This pipeline is designed to extract all clinical/medical entities which may be considered as Deidentification entities from text.
`clinical_deidentification_docwise_medium_wip`	This pipeline is designed to extract all clinical/medical entities which may be considered as Deidentification entities from text.
`clinical_deidentification_light`	This pipeline can be used to deidentify PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text.
`clinical_deidentification_docwise_benchmark`	This pipeline can be used to deidentify PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text. This pipeline is prepared for benchmarking with cloud providers.

Example:

from sparknlp.pretrained import PretrainedPipeline

pipeline_sdoh = PretrainedPipeline("clinical_deidentification_zeroshot_medium", "en", "clinical/models")

text = """Dr. John Lee, from Royal Medical Clinic in Chicago,  attended to the patient on 11/05/2024.
The patient’s medical record number is 56467890.
The patient, Emma Wilson, is 50 years old,  her Contact number: 444-456-7890 ."
"""

Result:

Masked with entity labels
------------------------------
Dr. <DOCTOR>, from <HOSPITAL> in <CITY>,  attended to the patient on <DATE>.
The patient’s medical record number is <MEDICALRECORD>
patient, <PATIENT>, is <AGE> years old,  her Contact number: <PHONE> .

Obfuscated
------------------------------
Dr. Edwardo Graft, from MCBRIDE ORTHOPEDIC HOSPITAL in CLAMART,  attended to the patient on 14/06/2024.
The patient’s medical record number is 78295621.
The patient, Nathaneil Bakes, is 43 years old,  her Contact number: 308-657-8469 .

Please check the Task Based Clinical Pretrained Pipelines model for more information

Introducing 2 New Named Entity Recognition and an Assertion Models for Gene and Phenotype Features

These Named Entity Recognition and Assertion models are specifically trained to extract critical information related to genetics, their phenotypes, and associated information contained within any medical document.

NER Models

Model Name	Description
`ner_genes_phenotypes`	This pipeline is designed to extract all clinical/medical entities that may be considered as related to genetics, their phenotypes entities from text.
`ner_genes_phenotypes_wip`	This pipeline is designed to extract all clinical/medical entities which may be considered as related to genetics, their phenotypes entities from text.

Example:

ner_model = MedicalNerModel.pretrained('ner_genes_phenotypes', "en", "clinical/models")\
    .setInputCols(["sentence", "token","embeddings"])\
    .setOutputCol("ner")

text = """"The CFTR gene, situated on chromosome 7, encodes a chloride channel protein crucial for epithelial salt and water regulation. This gene is associated with cystic fibrosis, demonstrating autosomal recessive inheritance. Mutations like the classic ΔF508 (deletion of phenylalanine at position 508) significantly impair protein folding and cellular transport. The gene shows incomplete penetrance, with variable clinical manifestations ranging from mild respiratory complications to severe multi-organ dysfunction. Diagnostic approaches include genetic testing, sweat chloride analysis, and pulmonary function assessments. Treatment modalities have evolved, incorporating targeted therapies like CFTR modulators that address specific molecular defects. Gene interactions with environmental factors and modifier genes influence disease progression and severity. Prevalence is notably higher in populations of Northern European descent, with approximately 1 in 2,500-3,500 live births affected.
The FMR1 gene, located on the X chromosome, is critical in neurological development and synaptic function. This gene is associated with Fragile X syndrome, exhibiting X-linked dominant inheritance with variable penetrance. Molecular characterization reveals CGG trinucleotide repeat expansions causing potential intellectual disability and neurodevelopmental challenges. Penetrance is complex, with males typically more severely affected than females due to X-chromosome inactivation patterns. Clinical presentations include developmental delays, characteristic facial features, and potential autism spectrum disorder associations. Diagnostic strategies involve molecular genetic testing to quantify CGG repeat expansions. Treatment approaches are multidisciplinary, focusing on educational interventions, behavioral therapies, and management of associated neurological symptoms. Environmental interactions and epigenetic modifications significantly influence phenotypic expressions."""

Result:

chunk	begin	end	ner_label
CFTR gene	5	13	MPG
chromosome 7	28	39	Site
chloride channel protein	52	75	MPG
epithelial salt and water regulation	89	124	Gene_Function
cystic fibrosis	156	170	Phenotype_Disease
autosomal recessive	187	205	Inheritance_Pattern
ΔF508	247	251	Gene
deletion	254	261	Type_Of_Mutation
phenylalanine	266	278	MPG
incomplete penetrance	373	393	Gene_Penetrance
multi-organ dysfunction	488	510	Other_Disease
CFTR	694	697	MPG
Northern European descent	906	930	Prevalence
1 in 2,500-3,500	952	967	Incidence
FMR1 gene	996	1004	MPG
X chromosome	1022	1033	Site
neurological development and synaptic function	1051	1096	Gene_Function
Fragile X syndrome	1128	1145	Phenotype_Disease
X-linked dominant	1159	1175	Inheritance_Pattern
variable penetrance	1194	1212	Gene_Penetrance
CGG	1250	1252	Gene
intellectual disability	1304	1326	Clinical_Presentation
Penetrance is complex	1363	1383	Gene_Penetrance
males	1391	1395	Prevalence
females	1435	1441	Prevalence
X-chromosome	1450	1461	Site
developmental delays	1517	1536	Clinical_Presentation
autism spectrum disorder	1585	1608	Other_Disease
CGG	1692	1694	Gene

Assertion Models

Model Name	Assertion Status	Description
assertion_genomic_abnormality_wip	`Normal`, `Affected`, `Variant`	This assertion status detection model is trained to classify entities (Gene and MPG) extracted by the NER model `ner_genes_phenotypes`

Example:

assertion = AssertionDLModel.pretrained("assertion_genomic_abnormality_wip", "en", "clinical/models")\
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")

sample_texts = ["""
The ATP7B gene provides instructions for a copper-transporting ATPase essential for copper homeostasis. Mutations in the ATP7B gene cause Wilson disease, an autosomal recessive disorder of copper metabolism. 
Over 500 mutations have been identified, including missense, nonsense, and splice site mutations. The variant ATP7B protein leads to impaired copper excretion and accumulation in various organs, particularly the liver and brain. 
Clinical presentations of Wilson disease include hepatic dysfunction, neurological symptoms (e.g., tremors, dystonia), and psychiatric disturbances. 
Kayser-Fleischer rings, copper deposits in the cornea, are a characteristic sign. Gene-environment interactions are significant, with dietary copper intake and other environmental factors influencing disease progression. 
Diagnosis involves a combination of clinical symptoms, low serum ceruloplasmin, high urinary copper, and genetic testing. 
Treatment focuses on reducing copper accumulation through chelation therapy with drugs like penicillamine or trientine, and zinc supplementation to block copper absorption. 
Liver transplantation may be necessary in severe cases. The worldwide prevalence of Wilson disease is estimated at 1 in 30,000, with higher rates in certain isolated populations.
"""]

Result:

chunk	begin	end	ner_label	assertion	confidence
ATP7B gene	5	14	MPG	Normal	0.9835
ATPase	64	69	MPG	Normal	0.9979
ATP7B gene	122	131	MPG	Affected	0.9974
ATP7B protein	319	331	MPG	Affected	0.9713
ceruloplasmin	873	885	MPG	Affected	0.9707

Introducing 2 New Named Entity Recognition Models for Extracts Mentions of Cancer Types and Biomarker

Explore two advanced NER models specifically trained to extract critical oncology-related information from clinical and biomedical texts. The ner_cancer_types model identifies mentions of six primary cancer types and Tumors. Meanwhile, the ner_oncology_biomarker_docwise model focuses on extracting biomarkers and biomarker results.

Model Name	Description
`ner_cancer_types`	This model is designed to extract critical information from clinical and biomedical text related to oncology. The model recognizes 6 main cancer types (`CNS Tumor`, `Carcinoma`, `Leukemia`, `Lymphoma`, `Melanoma`, `Sarcoma`, `Other_Tumors`)
`ner_oncology_biomarker_docwise`	This model is designed to extracts mentions of biomarkers and biomarker results from oncology texts. During training, a doc-wise method was used.

Example:

ner_model = MedicalNerModel.pretrained('ner_cancer_types', "en", "clinical/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

text =  """
    We report a case of CD3 negative, CD20 positive T-cell prolymphocytic leukemia (T-PLL). The leukemic cells were negative for surface CD3, CD2, and CD7 and strongly positive for CD20. 
    T-cell lineage markers such as CD4, CD5, and cytoplasmic CD3 were also positive. A monoclonal rearrangement of the T-cell receptor (TCR) β chain gene was detected. 
    CD3 negative T-PLL has been reported often, but CD20 positive T-PLL has not. We reviewed seven cases of CD20 positive immature and mature T-cell leukemias, including the present case. 
    Three were immature T-cell leukemias (acute lymphoblastic leukemia), and four were mature T-cell leukemias (granular lymphocytic leukemia, small lymphocytic lymphoma/chronic lymphocytic leukemia, 
    adult T-cell leukemia, and the present case). 
"""

Result:

chunk	begin	end	ner_label
CD3	21	23	Biomarker
negative	25	32	Biomarker_Result
CD20	35	38	Biomarker
positive	40	47	Biomarker_Result
T-cell prolymphocytic leukemia	49	78	Leukemia_Type
T-PLL	81	85	Leukemia_Type
negative	113	120	Biomarker_Result
CD3	134	136	Biomarker
CD2	139	141	Biomarker
CD7	148	150	Biomarker
positive	165	172	Biomarker_Result
CD20	178	181	Biomarker
CD4	215	217	Biomarker
CD5	220	222	Biomarker
CD3	241	243	Biomarker
positive	255	262	Biomarker_Result
CD3	348	350	Biomarker
negative	352	359	Biomarker_Result
T-PLL	361	365	Leukemia_Type
CD20	396	399	Biomarker
positive	401	408	Biomarker_Result
T-PLL	410	414	Leukemia_Type
CD20	452	455	Biomarker
positive	457	464	Biomarker_Result
mature T-cell leukemias	479	501	Leukemia_Type
T-cell leukemias	552	567	Leukemia_Type
acute lymphoblastic leukemia	570	597	Leukemia_Type
mature T-cell leukemias	615	637	Leukemia_Type
granular lymphocytic leukemia	640	668	Leukemia_Type
small lymphocytic lymphoma/chronic lymphocytic leukemia	671	725	Leukemia_Type
adult T-cell leukemia	728	748	Leukemia_Type

Updated Human Phenotype Ontology Resolver Model

This model maps phenotypic abnormalities, medical terms associated with hereditary diseases, encountered in Human Phenotype Ontology (HPO) codes using sbiobert_base_cased_mli Sentence Bert Embeddings, and has faster load time, with a speedup of about 6X when compared to previous versions. Also, the load process now is more memory friendly meaning that the maximum memory required during load time is smaller, reducing the chances of OOM exceptions, and thus relaxing hardware requirements

This model returns Human Phenotype Ontology (HPO) codes for phenotypic abnormalities encountered in human diseases. It also returns associated codes from the following vocabularies for each HPO code: - SNOMEDCT_US - UMLS (Unified Medical Language System ) - ORPHA (international reference resource for information on rare diseases and orphan drugs) - EPCC (European Paediatric Cardiac Code - another region-specific or discipline-specific coding system related to healthcare or medical classification) - Fyler (unique identifier used within a specific coding system or database)

Example:

resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_HPO", "en", "clinical/models") \
    .setInputCols(["sentence_embeddings"]) \
    .setOutputCol("hpo")\
    .setDistanceFunction("EUCLIDEAN")

text =  """She is followed by Dr. X in our office and has a history of severe tricuspid regurgitation. On 05/12/08, preserved left and right ventricular systolic function, aortic sclerosis with apparent mild aortic stenosis. She has previously had a Persantine Myoview nuclear rest-stress test scan completed at ABCD Medical Center in 07/06 that was negative. She has had significant mitral valve regurgitation in the past being moderate, but on the most recent echocardiogram on 05/12/08, that was not felt to be significant. She does have a history of significant hypertension in the past. She has had dizzy spells and denies clearly any true syncope. She has had bradycardia in the past from beta-blocker therapy."""

Result:

chunk	begin	end	ner_label	resolution	description	all_codes
tricuspid regurgitation	67	89	HP	HP:0005180	tricuspid regurgitation	Fyler:1161	SNOMEDCT_US:111287006	UMLS:C0040961:::EPCC:06.01.92	ICD-10:Q22….
aortic stenosis	197	211	HP	HP:0001650	aortic stenosis	Fyler:1411	SNOMEDCT_US:60573004	UMLS:C0003507:::SNOMEDCT_US:204368006	UMLS…
mitral valve regurgitation	373	398	HP	HP:0001653	mitral valve regurgitation	Fyler:1151	SNOMEDCT_US:48724000	UMLS:C0026266	UMLS:C3551535:::EPCC:06.02.9…
hypertension	555	566	HP	HP:0000822	hypertension	SNOMEDCT_US:24184005	SNOMEDCT_US:38341003	UMLS:C0020538	UMLS:C0497247:::-:…
bradycardia	655	665	HP	HP:0001662	bradycardia	SNOMEDCT_US:48867003	UMLS:C0428977:::Fyler:7013	SNOMEDCT_US:49710005	UMLS:…

Please check the sbiobertresolve_HPO model for more information

Updated all Unified Medical Language System® (UMLS) Models.

The 2024AB release of the Unified Medical Language System® (UMLS) has updated all resolvers, mappers, and pretrained pipelines related to UMLS.

Resolver Model:

Model Name	Description
`biolordresolve_umls_general_concepts`	This model maps clinical entities and concepts to 4 UMLS CUI code categories
`sbiobertresolve_umls_major_concepts`	This model maps clinical entities and concepts to 4 major categories of UMLS CUI codes
`sbiobertresolve_umls_clinical_drugs`	This model maps drug entities to UMLS CUI codes.
`sbiobertresolve_umls_disease_syndrome`	This model maps clinical entities to UMLS CUI codes.
`sbiobertresolve_umls_drug_substance`	This model maps clinical entities to UMLS CUI codes.
`sbiobertresolve_umls_findings`	This model maps clinical entities to UMLS CUI codes.
`sbiobertresolve_umls_general_concepts`	This model maps clinical entities to UMLS CUI codes.

Mapper Model:

Model Name	Description
`umls_clinical_drugs_mapper`	This model maps entities (Clinical Drugs) with their corresponding UMLS CUI codes.
`umls_icd10cm_mapper`	This model maps UMLS codes to corresponding ICD10CM codes.
`cpt_umls_mapper`	This model maps CPT codes to corresponding UMLS codes.
`icd10cm_umls_mapper`	This model maps ICD10CM codes to corresponding UMLS codes under the Unified Medical Language System (UMLS).
`umls_cpt_mapper`	This model maps UMLS codes to corresponding CPT codes.
`rxnorm_umls_mapper`	This This pretrained model maps RxNorm codes to corresponding UMLS codes.
`snomed_umls_mapper`	This model maps SNOMED codes to corresponding UMLS codes.
`umls_rxnorm_mapper`	This model maps UMLS codes to corresponding RxNorm codes.
`umls_snomed_mapper`	This model maps UMLS codes to corresponding SNOMED codes.
`mesh_umls_mapper`	This model maps MESH codes to corresponding UMLS codes.
`umls_mesh_mapper`	This model maps UMLS codes to corresponding MESH codes.
`umls_disease_syndrome_mapper`	This model maps entities (Disease or Syndrome) with corresponding UMLS CUI codes.
`umls_clinical_findings_mapper`	This model maps clinical entities and concepts to 4 major categories of UMLS CUI codes.
`umls_drug_substance_mapper`	This model maps entities (Drug Substances) with their corresponding UMLS CUI codes.
`umls_major_concepts_mapper`	This model maps entities (Major Clinical Concepts) with corresponding UMLS CUI codes.
`loinc_umls_mapper`	This model maps LOINC codes to corresponding UMLS codes.
`umls_loinc_mapper`	This model maps UMLS codes to corresponding LOINC codes.

Pretrained Pipeline:

Model Name	Description
`medication_resolver_pipeline`	This pipeline to extract medications and resolve their adverse reactions (ADE), RxNorm, UMLS, NDC, SNOMED CT codes, and action/treatments in clinical text.
`medication_resolver_transform_pipeline`	This pipeline to extract medications and resolve their adverse reactions (ADE), RxNorm, UMLS, NDC, SNOMED CT codes, and action/treatments in clinical text.
`snomed_multi_mapper_pipeline`	This pipeline maps SNOMED codes to their corresponding ICD-10, ICD-O, and UMLS codes.
`umls_clinical_findings_resolver_pipeline`	This pipeline maps entities (Clinical Findings) with their corresponding UMLS CUI codes.
`umls_drug_substance_resolver_pipeline`	This pipeline maps entities (Drug Substances) with their corresponding UMLS CUI codes.
`umls_disease_syndrome_resolver_pipeline`	This pipeline maps entities (Diseases and Syndromes) with their corresponding UMLS CUI codes.
`umls_drug_resolver_pipeline`	This pipeline maps entities (Clinical Drugs) with their corresponding UMLS CUI codes.
`umls_major_concepts_resolver_pipeline`	This pipeline maps entities (Clinical Major Concepts) with their corresponding UMLS CUI codes.

Example:

resolver_pipeline = PretrainedPipeline("medication_resolver_pipeline", "en", "clinical/models")
text = """The patient was prescribed Amlodopine Vallarta 10-320mg, Eviplera. The other patient is given Lescol 40 MG and Everolimus 1.5 mg tablet"""

Result:

chunk	ner_label	ADE	RxNorm	Action	Treatment	UMLS	SNOMED_CT	NDC_Product	NDC_Package
Amlodopine Vallarta 10-320mg	DRUG	Gynaecomastia	722131	NONE	NONE	C1949334	1153435009	00093-7693	00093-7693-56
Eviplera	DRUG	Anxiety	217010	Inhibitory Bone Resorption	Osteoporosis	C0720318	NONE	NONE	NONE
Lescol 40 MG	DRUG	NONE	103919	Hypocholesterolemic	Heterozygous Familial Hypercholesterolemia	C0353573	NONE	00078-0234	00078-0234-05
Everolimus 1.5 mg tablet	DRUG	Acute myocardial infarction	2056895	NONE	NONE	C4723581	1029521000202102	00054-0604	00054-0604-21

Please check the Task_Based_Clinical_Pretrained_Pipelines Notebook for more information

New Blog Posts On Various Topics

Dive into our latest blog series exploring cutting-edge advancements in healthcare NLP. Discover how innovative technologies like LangTest are transforming the field by enhancing the robustness of AI models. From ensuring precision and stability in foundation models to leveraging Databricks for robust LLM evaluation, these posts offer valuable insights into creating resilient, reliable, and impactful AI applications in healthcare and beyond

For Foundation Models, Precision Matters — But Stability Matters More: This blog post discusses the crucial role of robustness in AI models, particularly foundation models, which are essential for applications like healthcare, finance, and autonomous systems. It emphasizes that while accuracy is important, robustness—ensuring models perform well under various conditions and adversarial inputs—is paramount for safe and reliable AI deployment. The article introduces LangTest by John Snow Labs, a tool that helps test and enhance model robustness through simulations of real-world variations like typos and slang. By prioritizing robustness alongside accuracy, the article advocates for a more comprehensive approach to AI model evaluation to ensure they are not only intelligent but also resilient and trustworthy in real-world applications.
Robustness Testing of LLM Models Using LangTest in Databricks: This blog posthighlights the significance of evaluating the robustness of large language models (LLMs) like GPT-4 in NLP applications. These models power various tools, from chatbots to advanced data analysis systems, and ensuring their reliability with diverse, unpredictable inputs is critical. LangTest, an open-source evaluation tool, is introduced as a solution for assessing and enhancing the robustness of these models. The post explains how to leverage LangTest within the Databricks environment to evaluate and improve the performance of foundation models effectively.

Various Core Improvements: Bug Fixes, Enhanced Overall Robustness, and Reliability of Spark NLP for Healthcare

New filtering parameters for Assertion annotators: whiteList, blackList, and caseSensitive
Bugfixes in StructuredDeidentification for improved fake chunk handling and formatting
Bug fix for save and load functionality in DateNormalizer annotator
PipelineTracer Improvements: Recursive support for ChunkMerger and AssertionMerger, and bug fix for getReplaceDict issue
Corrected begin index calculation in exclude mode for ContextualEntityRulery
Length-Controlled Fake Text Generation in Deidentification for a Better Consistency

Updated Notebooks And Demonstrations For making Spark NLP For Healthcare Easier To Navigate And Understand

New Oncology Use Cases Notebook
New Clinical Deidentification for Structured Data Notebook
Updated CLINICAL TEXT SUMMARIZATION Demo
Updated DEID PHI TEXT MULTI Demo
Updated NER GENE PHENOTYPES Demo

We have added and updated a substantial number of new clinical models and pipelines, further solidifying our offering in the healthcare domain.

zeroshot_ner_oncology_biomarker_large
zeroshot_ner_oncology_biomarker_medium
clinical_deidentification_zeroshot_large
clinical_deidentification_zeroshot_medium
clinical_deidentification_docwise_wip
clinical_deidentification_v2_wip
clinical_deidentification_docwise_large_wip
clinical_deidentification_docwise_medium_wip
zeroshot_ner_deid_generic_multi_large
zeroshot_ner_deid_generic_multi_medium
zeroshot_ner_deid_generic_multi_large -> xx
zeroshot_ner_deid_generic_multi_medium -> xx
biolordresolve_umls_general_concepts
sbiobertresolve_umls_major_concepts
sbiobertresolve_umls_clinical_drugs
sbiobertresolve_umls_disease_syndrome
sbiobertresolve_umls_drug_substance
sbiobertresolve_umls_findings
sbiobertresolve_umls_general_concepts
umls_clinical_drugs_mapper
umls_icd10cm_mapper
cpt_umls_mapper
icd10cm_umls_mapper
umls_cpt_mapper
rxnorm_umls_mapper
snomed_umls_mapper
umls_rxnorm_mapper
umls_snomed_mapper
mesh_umls_mapper
umls_mesh_mapper
umls_disease_syndrome_mapper
umls_clinical_findings_mapper
umls_drug_substance_mapper
umls_major_concepts_mapper
loinc_umls_mapper
umls_loinc_mapper
medication_resolver_pipeline
medication_resolver_transform_pipeline
snomed_multi_mapper_pipeline
umls_clinical_findings_resolver_pipeline
umls_drug_substance_resolver_pipeline
umls_disease_syndrome_resolver_pipeline
umls_drug_resolver_pipeline
umls_major_concepts_resolver_pipeline
zeroshot_ner_jsl_large
zeroshot_ner_jsl_medium
ner_genes_phenotypes_wip
ner_genes_phenotypes
zeroshot_ner_ade_clinical_large
zeroshot_ner_deid_subentity_merged_large
clinical_deidentification_multi_mode_output
clinical_deidentification_light
zeroshot_ner_sdoh_medium
zeroshot_ner_sdoh_large
sbiobertresolve_HPO
ner_oncology_biomarker_docwise
ner_cancer_types
clinical_deidentification_docwise_benchmark
assertion_genomic_abnormality_wip

For all Spark NLP for Healthcare models, please check: Models Hub Page

Versions

Version
Version
Version

PREVIOUSVersion Compatibility

5.5.2

Highlights

Calculate the Embeddings of the Neighboring Context of a Named Entity (not just the chunk) with the BertSentenceChunkEmbeddings Annotator for Improved Context Awareness

De-identifying Sensitive Data in Relational Databases with a Few Lines of Codes

Reduce False Positives Returned by NER Models via Possible and Impossible Context Using ContextualEntityFilterer. This also Refines Entity Extraction by Leveraging Regex-Based Contextual Filtering

Enhace Named Entities with Specific Keywords by Allowing Greater Control Over Pattern Matching via ContextualEntityRuler

10 New PretrainedZeroShotNER Named Entity Recognition Models that are Already Finetuned on In-house Annotations

Introducing Clinical Document Analysis with One-Liner Pretrained Pipelines for Specific Clinical Tasks and Concepts

Introducing 2 New Named Entity Recognition and an Assertion Models for Gene and Phenotype Features

Introducing 2 New Named Entity Recognition Models for Extracts Mentions of Cancer Types and Biomarker

Updated Human Phenotype Ontology Resolver Model

Updated all Unified Medical Language System® (UMLS) Models.

New Blog Posts On Various Topics

Various Core Improvements: Bug Fixes, Enhanced Overall Robustness, and Reliability of Spark NLP for Healthcare

Updated Notebooks And Demonstrations For making Spark NLP For Healthcare Easier To Navigate And Understand

We have added and updated a substantial number of new clinical models and pipelines, further solidifying our offering in the healthcare domain.

Versions

Calculate the Embeddings of the Neighboring Context of a Named Entity (not just the chunk) with the `BertSentenceChunkEmbeddings` Annotator for Improved Context Awareness

Reduce False Positives Returned by NER Models via Possible and Impossible Context Using `ContextualEntityFilterer`. This also Refines Entity Extraction by Leveraging Regex-Based Contextual Filtering

Enhace Named Entities with Specific Keywords by Allowing Greater Control Over Pattern Matching via `ContextualEntityRuler`

10 New `PretrainedZeroShotNER` Named Entity Recognition Models that are Already Finetuned on In-house Annotations