Healthcare NLP v5.3.0 Release Notes

5.3.0

Highlights

We are delighted to announce remarkable enhancements and updates in our latest release of Spark NLP for Healthcare. This release comes with the new 3 rule-based entity recognition/matcher modules to extract named entities with rules and a Flattener module to flatten the pipeline outputs effortlessly, as well as 41 new clinical pretrained models and pipelines including new De-Identification pipelines at various sizes.

4 new Opioid NER models to extract opioid-related entities from 22 classes and 3 assertion models to detect the status of opioid drug usage and underlying symptoms.
Multi-Lingual NER model for Deidentification to detect sensitive entities (name, date, location etc.) from multiple languages.
New Age classification model to detect age groups from clinical texts without any mention of age.
Biomarker text classification model to detect sentences/ phrases that may contain biomarker-related terms.
New NER model for SNOMED term extraction regardless of its type.
New 6 ChunkMapper models for medical code mapping to map various medical terminologies across each other.
Curated pretrained pipelines to analyze clinical documents for specific clinical tasks and concepts at once.
Enhanced data exploration with the new Flattener annotator to prettify the pipeline outputs in a tabulated format.
Rule-based entity recognition/matcher modules (TextMatcher, RegexMatcher and EntityRuler) to extract named entities with rules and dictionaries
Deidentification now supports masking and obfuscation at the same time without an additional stage
ChunkMerger now supports dictionary format for the selective merging
MedicalQuestionAnswering returns Score in metadata
New speed benchmarks for various pipelines across different platforms under various settings (EMR, Databricks, etc.)
Various core improvements; bug fixes, enhanced overall robustness and reliability of Spark NLP for Healthcare
- Consistent obfuscation is supported in StructuredDeidentification too
- Added deid_source field to the metadata to infer the source of entity chunks coming from internal or external NER models and stages
- Refactoring the Deidentification module for improved functionality
- Flushing the temporary files dumped by the SentenceEntityResolver
- Fixed IOBTagger was returning zero instead of ‘O’
- Lighter jars for Spark NLP leading optimized Spark’s sessions
- Resolution2Chunk documentation updated
- Updated the default value of the customBoundsStrategy parameter in the InternalDocumentSplitter
- Enhanced InternalDocumentSplitter with UUID Metadata Field
Updated notebooks and demonstrations for making Spark NLP for Healthcare easier to navigate and understand
- New Flattener Notebook
- New Rule Based Entity Matchers Notebook
- New Opioid Notebook
- Updated Clinical Deidentification Improvement Notebook
- New OPIOID Demo
- New Biomarker Classification Demo
- New SNOMED Term NER Demo
- New Multi Language NER Demo
- Updated Age Classification Demo
The addition and update of numerous new clinical models and pipelines continue to reinforce our offering in the healthcare domain

These enhancements will elevate your experience with Spark NLP for Healthcare, enabling more efficient, accurate, and streamlined analysis of healthcare-related natural language data.

4 New `Opioid` NER Model to Extract Opioid-Related Entities from 22 Classes and 3 Assertion Models to Detect the Status of Opioid Drug Usage and Underlying Symptoms.

NER Model

Model Name	Predicted Entities	Description
ner_opioid	`communicable_disease`, `general_symptoms`, `substance_use_disorder`, `drug_duration`, `psychiatric_issue`, `drug_strength`, `drug_quantity`, `other_drug`, `drug_form`, `drug_frequency`, `opioid_drug`, `drug_route`, `employment`, `violence`, `legal_issue`, `other_disease`, `alcohol_use`, `test`, `marital_status`, `test_result`, `antidote`, `sexual_orientation`	Detects opioid-related entities within text data

This ner_opioid model is designed to detect and label opioid-related entities within text data. Opioids are a class of drugs that include the illegal drug heroin, synthetic opioids such as fentanyl, and pain relievers available legally by prescription. The model has been trained using advanced deep-learning techniques on a diverse range of text sources and can accurately recognize and classify a wide range of opioid-related entities. The model’s accuracy and precision have been carefully validated against expert-labeled data to ensure reliable and consistent results.

Example:

ner_model = MedicalNerModel.pretrained("ner_opioid", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")


sample_texts = ["""The patient, unmarried and with a significant history of substance abuse involving the illicit consumption of various opioids such as heroin, fentanyl, and oxycodone, presented with a headache and was diagnosed PTSD. Despite denying the use of alcohol, smoking, or marijuana, the patient, who has been unemployed for several months, required administration of Narcan for suspected opioid overdose. A recent toxicology screen confirmed the presence of opioids, and showed negative results for benzodiazepines, cocaine, amphetamines, barbiturates, and tricyclic substances."""]

Result:

chunk	begin	end	ner_label
unmarried	13	21	marital_status
substance abuse	57	71	substance_use_disorder
opioids	118	124	opioid_drug
heroin	134	139	opioid_drug
fentanyl	142	149	opioid_drug
oxycodone	156	164	opioid_drug
headache	184	191	general_symptoms
PTSD	211	214	psychiatric_issue
alcohol	244	250	alcohol_use
marijuana	265	273	other_drug
unemployed	302	311	employment
Narcan	360	365	antidote
opioid	381	386	opioid_drug
overdose	388	395	other_disease
toxicology screen	407	423	test
opioids	451	457	test
negative	471	478	test_result
benzodiazepines	492	506	test
cocaine	509	515	test
amphetamines	518	529	test
barbiturates	532	543	test
tricyclic substances	550	569	test

Assertion Models

Model Name	Assertion Status	Description
assertion_opioid_wip	`present`, `history`, `absent`, `hypothetical`, `past`, `family_or_someoneelse`	Detects the assertion status of entities related to opioid
assertion_opioid_drug_status_wip	`opioid_medical_use`, `opioid_abuse`, `opioid_overdose`, `drug_medical_use`, `drug_abuse`, `drug_overdose`	Detects the assertion status of drug entities related to opioid (including opioid_drug and other_drug)
assertion_opioid_general_symptoms_status_wip	`underlying_pain`, `withdrawal_symptom`, `overdose_symptom`	Detects the assertion status of general symptoms entity related to opioid.

Example:

assertion = AssertionDLModel.pretrained("assertion_opioid_wip" "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")

sample_texts = [
    """The patient with a history of substance abuse presented with clinical signs indicative of opioid overdose, including constricted pupils, cyanotic lips, drowsiness, and confusion. Immediate assessment and intervention were initiated to address the patient's symptoms and stabilize their condition. Close monitoring for potential complications, such as respiratory depression, was maintained throughout the course of treatment.""",
    """The patient presented to the rehabilitation facility with a documented history of opioid abuse, primarily stemming from misuse of prescription percocet pills intended for their partner's use. Initial assessment revealed withdrawal symptoms consistent with opioid dependency."""]

Result:

chunk	begin	end	ner_label	assertion	confidence
substance abuse	30	44	substance_use_disorder	history	0.9644
opioid	90	95	opioid_drug	hypothetical	0.7974
overdose	97	104	other_disease	hypothetical	0.9961
constricted pupils	117	134	general_symptoms	past	0.732
cyanotic lips	137	149	general_symptoms	past	0.8501
drowsiness	152	161	general_symptoms	past	0.9469
confusion	168	176	general_symptoms	past	0.9686
respiratory depression	351	372	other_disease	hypothetical	0.5921
opioid	82	87	opioid_drug	history	0.735
percocet	143	150	opioid_drug	present	0.905
pills	152	156	drug_form	present	0.9363
withdrawal	220	229	general_symptoms	present	0.9929
opioid	256	261	opioid_drug	present	0.9348

Please check Opioid Notebook for more information

`Multi-Lingual` NER Model for `Deidentification` to Detect Sensitive Entities (`name`, `date`, `location` etc.) from Multiple Languages.

Introducing our latest innovation: a Multilingual Named Entity Recognition (NER) model designed for deidentification purposes. This new model is capable of annotating text in English, German, French, Italian, Spanish, Portuguese, and Romanian. It excels at detecting sensitive entities such as AGE, CONTACT, DATE, ID, LOCATION, NAME, and PROFESSION. Using this model, data protection can be achieved in multiple languages and domains.

Example:

embeddings = XlmRoBertaEmbeddings.pretrained("xlm_roberta_base", "xx") \
    .setInputCols("sentence", "token") \
    .setOutputCol("embeddings")\
    .setMaxSentenceLength(512)

ner = MedicalNerModel.pretrained("ner_deid_multilingual", "xx", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

text_list = [
"""Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 .""",
"""J'ai vu en consultation Michel Martinez (49 ans) adressé au Centre Hospitalier De Plaisir pour un diabète mal contrôlé avec des symptômes datant de Mars 2015.""",
"""Michael Berger wird ins St. Elisabeth-Krankenhaus in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.""",
"""Ho visto Gastone Montanariello (49 anni) riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015.""",
"""Antonio Miguel Martínez, un varón de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España.""",
"""Detalhes do paciente. Nome do paciente:  Pedro Gonçalves. Endereço: Rua Das Flores 23. Cidade/ Província: Porto.""",
"""Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România"""
]

Result:

doc_id	begin	end	chunk	ner_label
1	14	23	2093-01-13	DATE
1	26	35	David Hale	NAME
1	51	61	Hendrickson	NAME
1	74	80	7194334	ID
1	89	96	01/13/93	DATE
2	24	38	Michel Martinez	NAME
2	41	46	49 ans	AGE
2	60	88	Centre Hospitalier De Plaisir	LOCATION
2	148	156	Mars 2015	DATE
3	0	13	Michael Berger	NAME
3	53	65	Bad Kissingen	LOCATION
3	86	91	Berger	NAME
3	128	129	76	AGE
4	9	29	Gastone Montanariello	NAME
4	32	33	49	AGE
4	55	74	Ospedale San Camillo	LOCATION
4	128	137	marzo 2015	DATE
5	0	22	Antonio Miguel Martínez	NAME
5	37	38	35	AGE
5	67	88	auxiliar de enfermería	PROFESSION
5	102	106	Cadiz	LOCATION
5	109	114	España	LOCATION
6	41	55	Pedro Gonçalves	NAME
6	68	71	Rua Das Flores	NAME
6	106	110	Porto	LOCATION
7	0	27	Spitalul Pentru Ochi de Deal	LOCATION
7	30	44	Drumul Oprea Nr	LOCATION
7	47	49	972	LOCATION
7	51	56	Vaslui	LOCATION
7	59	64	737405	LOCATION
7	66	72	România	LOCATION

Please see the model card ner_deid_multilingual for more information about the model

New `Age` Classification Model to Detect Age Groups from Clinical Texts without any Mention of Age

Introducing a new age classification model is a sophisticated text classification tool tailored to identify and categorize text according to different age groups. This model distinguishes among Old Adult, Adult, Child and Teen, and Other/Unknown contexts, providing valuable insights into textual data that references age-specific scenarios or concerns.

Example:

generic_classifier = GenericClassifierModel.pretrained("genericclassifier_age_e5", "en", "clinical/models")\
    .setInputCols(["features"])\
    .setOutputCol("prediction")

sample_texts = [
"""The patient presents with conditions often associated with the stresses and lifestyle of early career and possibly higher education stages, including sleep irregularities and repetitive stress injuries. There's a notable emphasis on preventative care, with discussions around lifestyle choices that can impact long-term health, such as smoking cessation, regular exercise, and balanced nutrition. The patient is also counseled on mental health, particularly in managing stress and anxiety that may arise from personal and professional responsibilities and ambitions at this stage of life.""",
"""The senior patient presents with age-related issues such as reduced hearing and vision, arthritis, and memory lapses. Emphasis is on managing chronic conditions, maintaining social engagement, and adapting lifestyle to changing physical abilities. Discussions include medication management, dietary adjustments to suit older digestion, and the importance of regular, low-impact exercise.""",
"""The late teenage patient is dealing with final growth spurts, the stress of impending adulthood, and decisions about higher education or career paths. Health discussions include maintaining a balanced diet, the importance of regular sleep patterns, and managing academic and social pressures. Mental health support is considered crucial at this stage, with a focus on building resilience and coping mechanisms.""",
"""The patient, faces adjustments to a new lifestyle with changes in daily routines and social interactions. Health concerns include managing the transition from an active work life to more leisure time, which may impact physical and mental health. Preventative health measures are emphasized, along with the importance of staying mentally and physically active and engaged in the community."""
]

Result:

text	result
The patient presents with conditions often associated with the stresses and lifestyle of early ca…	Adult
The senior patient presents with age-related issues such as reduced hearing and vision, arthritis…	Old Adult
The late teenage patient is dealing with final growth spurts, the stress of impending adulthood, …	Child and Teen
The patient, faces adjustments to a new lifestyle with changes in daily routines and social inter…	Other/Unknown

Please check: genericclassifier_age_e5

`Biomarker` Text Classification Model to Detect Sentences/Phrases that may Contain Biomarker-related Terms

We are thrilled to introduce our latest advancement: a cutting-edge text classification model specifically tailored for biomarkers. This state-of-the-art model is designed to analyze clinical sentences and accurately determine whether they contain terms associated with biomarkers.

1: Contains biomarker related terms.
0: Doesn’t contain biomarker related terms.

Example:

sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_biomarker", "en", "clinical/models")\
    .setInputCols(["sentence",'token'])\
    .setOutputCol("prediction")

sample_texts = [
"""In the realm of cancer research, several biomarkers have emerged as crucial indicators of disease progression and treatment response. For instance, the expression levels of HER2/neu, a protein receptor, have been linked to aggressive forms of breast cancer. Additionally, the presence of prostate-specific antigen (PSA) is often monitored to track the progression of prostate cancer. Moreover, in cardiovascular health, high-sensitivity C-reactive protein (hs-CRP) serves as a biomarker for inflammation and potential risk of heart disease. Meanwhile, elevated levels of troponin T are indicative of myocardial damage, commonly observed in acute coronary syndrome. In the field of diabetes management, glycated hemoglobin is a widely used to assess long-term blood sugar control. Its levels reflect the average blood glucose concentration over the past two to three months, offering valuable insights into disease management strategies."""
]

Result:

sentence	prediction
In the realm of cancer research, several biomarkers have emerged as crucial indicators of disease progression and treatment response.	0
For instance, the expression levels of HER2/neu, a protein receptor, have been linked to aggressive forms of breast cancer.	1
Additionally, the presence of prostate-specific antigen (PSA) is often monitored to track the progression of prostate cancer.	1
Moreover, in cardiovascular health, high-sensitivity C-reactive protein (hs-CRP) serves as a biomarker for inflammation and potential risk of heart disease.	1
Meanwhile, elevated levels of troponin T are indicative of myocardial damage, commonly observed in acute coronary syndrome.	0
In the field of diabetes management, glycated hemoglobin is a widely used to assess long-term blood sugar control.	0
Its levels reflect the average blood glucose concentration over the past two to three months, offering valuable insights into disease management strategies.	0

Please check: bert_sequence_classifier_biomarker

New NER Model for SNOMED Term Extraction Regardless of its Type

We are excited to introduce our latest Name Entity Recognition (NER) model, designed specifically to extract SNOMED terms from clinical text. This cutting-edge model offers enhanced accuracy and efficiency in identifying and categorizing SNOMED concepts within medical documents, aiding in comprehensive data analysis and clinical decision-making processes. With its advanced capabilities, this NER model promises to revolutionize the way healthcare professionals extract valuable insights from clinical narratives.

Example:

embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols("sentence", "token")\
    .setOutputCol("embeddings")

ner = MedicalNerModel.pretrained("ner_snomed_term", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

text_list = ["""The patient was diagnosed with acute appendicitis and scheduled for immediate surgery.""",
"""Due to experiencing chronic pain, the patient was referred to a fibromyalgia specialist for further evaluation.""",
"""His hypertension is currently managed with a combination of lifestyle modifications and medication.""",
"""The child was brought in with symptoms of acute otitis, including ear pain and fever.""",
"""Laboratory tests indicate the individual has hyperthyroidism, requiring further endocrinological assessment.""",
"""The radiograph showed evidence of a distal radius fracture from a recent fall."""]

Result:

doc_id	begin	end	chunk	ner_label
1	31	48	acute appendicitis	snomed_term
2	20	31	chronic pain	snomed_term
2	63	74	fibromyalgia	snomed_term
3	4	15	hypertension	snomed_term
4	48	53	otitis	snomed_term
4	65	72	ear pain	snomed_term
5	45	59	hyperthyroidism	snomed_term
6	4	13	radiograph	snomed_term
6	43	57	radius fracture	snomed_term

Please check: ner_snomed_term

New 6 ChunkMapper Models for Medical Code Mapping to Map Various Medical Terminologies Across Each Other

Introducing a suite of new ChunkMapper models designed to streamline medical code mapping tasks. These models include mappings between UMLS, LOINC, CPT, and SNOMED codes, offering a comprehensive solution for interoperability within medical systems.

Model Name	Description
`umls_loinc_mapper`	Maps UMLS codes to corresponding LOINC codes.
`umls_cpt_mapper`	Maps UMLS codes to corresponding CPT codes.
`umls_snomed_mapper`	Maps UMLS codes to corresponding SNOMED codes.
`snomed_umls_mapper`	Maps SNOMED codes to corresponding UMLS codes.
`cpt_umls_mapper`	Maps CPT codes to corresponding UMLS codes.
`loinc_umls_mapper`	Maps LOINC codes to corresponding UMLS codes.

Example:

chunkerMapper = ChunkMapperModel.pretrained("umls_loinc_mapper", "en", "clinical/models")\
    .setInputCols(["umls_code"])\
    .setOutputCol("mappings")\
    .setRels(["loinc_code"])

text = "acebutolol"

Result:

chunk	UMLS	LOINC	relation
acebutolol	C0000946	LP16015-7	loinc_code

Curated Pretrained Pipelines to Analyse Clinical Documents for Specific Clinical Tasks and Concepts at Once

We introduce a suite of advanced, hybrid pretrained pipelines, specifically designed to streamline the process of analyzing clinical documents. These pipelines are built upon multiple state-of-the-art (SOTA) pretrained models, delivering a comprehensive solution for extracting vital information with unprecedented ease.

What sets this release apart is the elimination of complexities typically involved in building and chaining models. Users no longer need to navigate the intricacies of constructing intricate pipelines from scratch or the uncertainty of selecting the most effective model combinations. Our new pretrained pipelines simplify these processes, offering a seamless, user-friendly experience.

Pipeline Name	Description
`hcpcs_resolver_pipeline`	This pipeline extracts `PROCEDURE` entities and maps them to their corresponding Healthcare Common Procedure Coding System (HCPCS) codes.
`hgnc_resolver_pipeline`	This pipeline extracts `GENE` entities and maps them to their corresponding HUGO Gene Nomenclature Committee (HGNC) codes.
`icd10cm_generalised_resolver_pipeline`	This pipeline extracts the following entities and maps them to their ICD-10-CM codes. It predicts ICD-10-CM codes up to 3 characters (according to ICD-10-CM code structure the first three characters represent the general type of injury or disease).
`loinc_numeric_resolver_pipeline`	This pipeline extracts `TEST` entities and maps them to their corresponding Logical Observation Identifiers Names and Codes(LOINC) codes. It is trained with the numeric LOINC codes, without the inclusion of LOINC “Document Ontology” codes starting with the letter “L”. It also provides the official resolution of the codes within the brackets.
`snomed_procedures_measurements_resolver_pipeline`	This pipeline extracts `Procedure` and measurement (`Test`) entities and maps them to their corresponding SNOMED codes.
`ncit_resolver_pipeline`	This advanced pipeline extracts oncological entities from clinical texts to map these entities to their corresponding National Cancer Institute Thesaurus (NCIt) codes.
`rxcui_resolver_pipeline`	This advanced pipeline extracts medication entities from clinical texts to map these entities to their corresponding RxNorm Concept Unique Identifier (RxCUI) codes.
`icd10pcs_resolver_pipeline`	This pipeline extracts `Procedure` entities from clinical texts and map them to their corresponding ICD-10-PCS codes.
`icdo_resolver_pipeline`	This pipeline extracts oncological entities from clinical texts and maps them to their corresponding ICD-O codes.
`loinc_resolver_pipeline`	This pipeline extracts `Test` entities from clinical texts and maps them to their corresponding Logical Observation Identifiers Names and Codes (LOINC) codes.
`clinical_deidentification_obfuscation_medium`	This pipeline can be used to detect the PHI information from medical texts and obfuscate (replace them with fake ones) in the resulting text.
`clinical_deidentification_obfuscation_small`	This pipeline can be used to detect the PHI information from medical texts and obfuscate (replace them with fake ones) in the resulting text.

Enhanced Data Exploration with the New Flattener Annotator to Prettify the Pipeline Outputs in a Tabulated Format

Introducing the latest addition to our annotation toolkit: the Flattener Annotator. This powerful tool facilitates data exploration by returning exploded columns for each specified field containing annotation data. With customizable settings, users can select fields of interest and effortlessly explode their content for deeper analysis.

Example:

ner_converter = NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk") \
    .setWhiteList(["SYMPTOM","VS_FINDING","DISEASE_SYNDROME_DISORDER","ADMISSION_DISCHARGE","PROCEDURE"])
    
clinical_assertion = AssertionDLModel.pretrained("assertion_jsl_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion") \

# returns exploded columns for each specified field containing annotation data.
flattener = Flattener()\
    .setInputCols("ner_chunk", "assertion") \
    .setExplodeSelectedFields({"ner_chunk": ["result as ner_chunks",
                                             "begin as begins",
                                             "end as ends",
                                             "metadata.entity as entity"],
                               "assertion":["result as assertions",
                                            "metadata.confidence as confidence"]
                               })

text = """
GENERAL: He is an elderly gentleman in no acute distress. He is sitting up in bed eating his breakfast. He is alert and oriented and answering questions appropriately.
HEENT: Sclerae showed mild arcus senilis in the right. Left was clear. Pupils are equally round and reactive to light. Extraocular movements are intact. Oropharynx is clear.
NECK: Supple. Trachea is midline. No jugular venous pressure distention is noted. No adenopathy in the cervical, supraclavicular, or axillary areas.
ABDOMEN: Soft and not tender. There may be some fullness in the left upper quadrant, although I do not appreciate a true spleen with inspiration.
EXTREMITIES: There is some edema, but no cyanosis and clubbing .
"""

Result:

ner_chunks	begins	ends	entity	assertions	confidence
distress	49	56	SYMPTOM	Absent	0.9999
arcus senilis	196	208	DISEASE_SYNDROME_DISORDER	Past	1.0
jugular venous pressure distention	380	413	SYMPTOM	Absent	1.0
adenopathy	428	437	SYMPTOM	Absent	1.0
tender	514	519	SYMPTOM	Absent	1.0
fullness	540	547	SYMPTOM	Possible	1.0
edema	665	669	SYMPTOM	Present	1.0
cyanosis	679	686	VS_FINDING	Absent	1.0
clubbing	692	699	SYMPTOM	Absent	1.0

Please check Flattener Notebook for more information.

Rule-based Entity Recognition/matcher Modules (`TextMatcher`, `RegexMatcher` and `EntityRuler`) to Extract Mamed Entities with Rules and Dictionaries

Efficient Regex Matching with the new RegexMatcherInternal annotator

The RegexMatcherInternal class implements an internal annotator approach to match a set of regular expressions with a provided entity. This approach is utilized for associating specific patterns within text data with predetermined entities, such as dates, mentioned within the text.

The class allows users to define rules using regular expressions paired with entities, offering flexibility in customization. These rules can either be directly set using the setRules method, with a specified delimiter, or loaded from an external file using the setExternalRules method.

Example:

rules = '''
(\d{1,3}\.){3}\d{1,3}~IPADDR
\d{4}-\d{2}-\d{2}|\d{2}/\d{2}/\d{2}|\d{2}/\d{2}/\d{2}~DATE
'''

with open('./rules/regex_rules.txt', 'w') as f:
    f.write(rules)

regex_matcher_internal = RegexMatcherInternal()\
    .setInputCols('document')\
    .setStrategy("MATCH_ALL")\
    .setOutputCol("regex_matches")\
    .setExternalRules(path='./rules/regex_rules.txt',
                      delimiter='~')

text = """Name : Hendrickson, Ora, Record date: 2093-01-13, MR #719435.
Dr. John Green, ID: 1231511863, IP 203.120.223.13
He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93
Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no: A334455B.
Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com."""

Result:

regex_result	begin	end	ner_label
2093-01-13	38	47	DATE
203.120.223.13	97	110	IPADDR
01/13/93	188	195	DATE

Please check Rule Based Entity Matchers Notebook for more information.

Efficient Text Matching with the new TextMatcherInternal annotator

The TextMatcherInternal annotator provides a robust solution for matching exact phrases against a given document. Users can specify phrases of interest in a source file, where each phrase is paired with its corresponding label or entity, separated by a delimiter. The annotator allows for fine-tuned control over the matching process through various parameters. Users can choose to enable case sensitivity, merge overlapping matched chunks, and customize entity metadata fields. Additionally, options like setting the delimiter and specifying whether the matcher should operate on chunks or tokens offer further flexibility.

Text Matcher Pretrained Models:

model	entities
drug_matcher	`DRUG`
biomarker_matcher	`Biomarker`

Example:

matcher_drug = """
Aspirin 100mg#Drug
aspirin#Drug
paracetamol#Drug
amoxicillin#Drug
ibuprofen#Drug
lansoprazole#Drug
"""

with open ('matcher_drug.csv', 'w') as f:
  f.write(matcher_drug)

entityExtractor = TextMatcherInternal()\
    .setInputCols(["document", "token"])\
    .setEntities("matcher_drug.csv")\
    .setOutputCol("matched_text")\
    .setCaseSensitive(False)\
    .setDelimiter("#")\
    .setMergeOverlapping(False)

text = """John's doctor prescribed aspirin 100mg for his heart condition, along with paracetamol for his fever, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD."""

Result:

chunk	begin	end	label
aspirin	25	31	Drug
aspirin 100mg	25	37	Drug
paracetamol	75	85	Drug
amoxicillin	102	112	Drug
ibuprofen	134	142	Drug
lansoprazole	170	181	Drug

Please check Rule Based Entity Matchers Notebook for more information.

Efficient Text Matching with the new EntityRulerInternal annotator

The EntityRulerInternal is a versatile annotator designed to match exact strings or regex patterns against a given document, assigning them named entities as specified. This powerful tool allows users to define custom rules in a file, accommodating any number of named entities. By leveraging this annotator, users can efficiently identify and classify specific text patterns within documents, enhancing the accuracy and efficiency of named entity recognition tasks.

Example:

data = [
    {
        "id": "drug-words",
        "label": "Drug",
        "patterns": ["paracetamol", "aspirin", "ibuprofen", "lansoprazol"]
    },
    {
        "id": "disease-words",
        "label": "Disease",
        "patterns": ["heart condition","tonsilitis","GORD"]
    },
        {
        "id": "symptom-words",
        "label": "Symptom",
        "patterns": ["fever","headache"]
    },
]

with open("entities.json", "w") as f:
    json.dump(data, f)

entityRuler = EntityRulerInternalApproach()\
    .setInputCols(["document", "token"])\
    .setOutputCol("entities")\
    .setPatternsResource("entities.json")\
    .setCaseSensitive(False)\

text = """John's doctor prescribed aspirin for his heart condition, along with paracetamol for his fever and headache, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD on 2023-12-01."""

Result:

chunk	begin	end	label
aspirin	25	31	Drug
heart condition	41	55	Disease
paracetamol	69	79	Drug
fever	89	93	Symptom
headache	99	106	Symptom
tonsilitis	129	138	Disease
ibuprofen	141	149	Drug
lansoprazol	177	187	Drug
GORD	198	201	Disease

Please check Rule Based Entity Matchers Notebook for more information.

Deidentification now Supports Masking and Obfuscation at the Same Time without an Additional Stage

Explore how metadata masking is implemented in deidentification processes through the setMetadataMaskingPolicy function. This feature enables users to add a mask option to DEID within the metadata, providing enhanced data privacy measures. Options such as entity_labels, same_length_chars, and fixed_length_chars offer flexibility in choosing the desired masking policy.

Example:

deidentification = DeIdentification()\
    .setInputCols(["sentence", "token", "ner_chunk"]) \
    .setOutputCol("deid") \
    .setMode("obfuscate") \
    .setObfuscateDate(True) \
    .setObfuscateRefSource('faker') \
    .setMetadataMaskingPolicy("entity_labels") # Options : 'entity_labels', 'same_length_chars', 'fixed_length_chars'

text = """Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 ."""

Result:

	sentence	deidentified	masked
0	Record date : 2093-01-13 , David Hale , M.D .	Record date : 2093-02-12 , Docia Chuck , M.D .	Record date : <DATE> , <DOCTOR> , M.D .
1	, Name : Hendrickson , Ora MR # 7194334 Date : 01/13/93 .	, Name : Marisue Humble MR # 7185162 Date : 02/12/93 .	, Name : <PATIENT> MR # <MEDICALRECORD> Date : <DATE> .
2	Patient : Oliveira, 25 years-old , Record date : 2079-11-09 .	Patient : Consuella Lose, 35 years-old , Record date : 2079-12-09 .	Patient : <PATIENT>, <AGE> years-old , Record date : <DATE> .
3	Cocke County Baptist Hospital .	SHELBY REGIONAL MEDICAL CENTER .	<HOSPITAL> .
4	0295 Keats Street	600 North Sioux Point Road	<STREET>

Please check Clinical DeIdentification Notebook for more information.

`ChunkMerger` now Supports Dictionary Format for the Selective Merging

ChunkMergeModel and ChunkMergeApproach now offer support for the Dictionary format, enhancing its selectiveness and flexibility in chunk merging operations. The ChunkMergeModel includes setReplaceDict for replacing entity labels and setFalsePositives for enabling precise control over chunk merging outcomes. Additionally, the ChunkMergeApproach has setEntitiesConfidence, allowing users to adjust entity confidence levels for further customization. These enhancements empower users to optimize their data processing pipelines, improving accuracy and efficiency in chunk merging tasks.

Example:

chunk_merge_model = ChunkMergeModel() \
    .setInputCols("clinical_ner_chunk","deid_ner_chunk") \
    .setOutputCol("merged_chunk") \
    .setReplaceDict({"DOCTOR": "NAME",
                     "PATIENT": "NAME"}) \
    .setFalsePositives([["metformin", "TREATMENT", "DRUG"],
                        ["glipizide","TREATMENT",""]])

text ='''
Jennifer is 58 years old. She was  seen by Dr. John Green and discharged on metformin, glipizide for T2DM and atorvastatin and gemfibrozil for HTG.
'''

Result for without rules:

chunk	begin	end	entity	confidence
Jennifer	1	8	PATIENT	0.9993
58	13	14	AGE	1.0
John Green	48	57	DOCTOR	0.7381
metformin	77	85	TREATMENT	0.9999
glipizide	88	96	TREATMENT	1.0
T2DM	102	105	PROBLEM	0.9988
atorvastatin	111	122	TREATMENT	0.9999
gemfibrozil	128	138	TREATMENT	1.0
HTG	144	146	PROBLEM	0.9991

Result for with rules:

In the example, “metformin” has been classified as a DRUG entity, while “glipizide” has been removed due to the setFalsePositives rules. Additionally, “Jennifer” and “John Green” chunks have been labeled as NAME according to the setReplaceDict rules.

chunk	begin	end	entity	confidence
Jennifer	1	8	NAME	0.9993
58	13	14	AGE	1.0
John Green	48	57	NAME	0.7381
metformin	77	85	DRUG	0.9999
T2DM	102	105	PROBLEM	0.9988
atorvastatin	111	122	TREATMENT	0.9999
gemfibrozil	128	138	TREATMENT	1.0
HTG	144	146	PROBLEM	0.9991

`MedicalQuestionAnswering` Returns Score in Metadata

Our Medical Question Answering system now includes a significant enhancement: the ability to return a score in metadata. This update provides users with valuable additional information, allowing them to gauge the relevance of the provided answers.

Example:

med_qa  = MedicalQuestionAnswering().pretrained("clinical_notes_qa_base_onnx", "en", "clinical/models")\
              .setInputCols(["document_question", "document_context"])\
              .setCustomPrompt("Context: {context} \n Question: {question} \n Answer: ")\
              .setOutputCol("answer")\

context = '''
Patient with a past medical history of hypertension for 15 years.
(Medical Transcription Sample Report)\nHISTORY OF PRESENT ILLNESS:
The patient is a 74-year-old white woman who has a past medical history of hypertension for 15 years, history of CVA with no residual hemiparesis and uterine cancer with pulmonary metastases, who presented for evaluation of recent worsening of the hypertension. According to the patient, she had stable blood pressure for the past 12-15 years on 10 mg of lisinopril.
'''

question = "What is the primary issue reported by patient?"   

Result:

Question	Answer	metadata
What is the primary issue reported by patient?	The primary issue reported by the patient is hypertension.	{score -> 0.97722054}

New Speed Benchmarks for Various Pipelines Across Different Platforms Under Various Settings (EMR, Databricks etc.)

Performance Evaluation of AWS EMR Cluster for Clinical Text Analysis

This study presents a benchmark assessment of an AWS EMR (Elastic MapReduce) cluster for analyzing clinical texts. The evaluation aims to assess the performance and scalability of the AWS EMR cluster configuration for clinical text analysis tasks.

Dataset: 340 Custom Clinical Texts, approx. 235 tokens per text Versions: - EMR Version: ERM.6.15.0 - spark-nlp Version: v5.2.2 - spark-nlp-jsl Version : v5.2.1 - Spark Version : v3.4.1 Instance Type: - Primary: m4.4xlarge, 16 vCore, 64 GiB memory - Worker : m4.4xlarge, 16 vCore, 64 GiB memory

ner_pipeline = Pipeline(stages = [
        document_assembler,
        sentence_detector,
        tokenizer,
        word_embeddings,
        ner_jsl,
        ner_jsl_converter])


resolver_pipeline = Pipeline(stages = [
        document_assembler,
        sentence_detector,
        tokenizer,
        word_embeddings,
        ner_jsl,
        ner_jsl_converter,
        chunk2doc,
        sbert_embeddings,
        snomed_resolver]) 

Results Table

partition	NER Timing	NER and Resolver Timing
4	24.7 seconds	1 minutes 8.5 seconds
8	23.6 seconds	1 minutes 7.4 seconds
16	22.6 seconds	1 minutes 6.9 seconds
32	23.2 seconds	1 minutes 5.7 seconds
64	22.8 seconds	1 minutes 6.7 seconds
128	23.7 seconds	1 minutes 7.4 seconds
256	23.9 seconds	1 minutes 6.1 seconds
512	23.8 seconds	1 minutes 8.4 seconds
1024	25.9 seconds	1 minutes 10.2 seconds

Performance Evaluation of ONNX and Base Embeddings in Resolver Benchmark

This study presents a benchmark evaluation of resolver performance using ONNX and base embeddings on clinical text datasets. The evaluation aims to assess the performance and efficiency of the resolver component under these different embedding configurations.

Dataset: 100 Custom Clinical Texts, approx. 595 tokens per text Versions: - spark-nlp Version: v5.2.2 - spark-nlp-jsl Version : v5.2.1 - Spark Version : v3.2.1 Instance Type: - 8 CPU Cores 52GiB RAM (Colab Pro - High RAM)

nlp_pipeline = Pipeline(
    stages = [
        document_assembler,
        sentenceDetectorDL,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter,
  ])

embedding_pipeline = PipelineModel(
    stages = [
        c2doc,
        sbiobert_embeddings # base or onnx version
  ])

resolver_pipeline = PipelineModel(
    stages = [
        rxnorm_resolver
  ])

Results Table

partition	preprocessing	embeddings	resolver	onnx_embeddings	resolver_with_onnx_embeddings
4	25 sec	25 sec	7 min 46 sec	9 sec	8 min 29 sec
8	21 sec	25 sec	5 min 12 sec	9 sec	4 min 53 sec
16	21 sec	25 sec	4 min 41 sec	9 sec	4 min 30 sec
32	20 sec	24 sec	5 min 4 sec	9 sec	4 min 34 sec
64	21 sec	24 sec	4 min 44 sec	9 sec	5 min 2 sec
128	20 sec	25 sec	5 min 4 sec	10 sec	4 min 51 sec
256	22 sec	26 sec	4 min 34 sec	10 sec	5 min 13 sec
512	24 sec	27 sec	4 min 46 sec	12 sec	4 min 22 sec
1024	29 sec	30 sec	4 min 24 sec	14 sec	4 min 29 sec

Please check Speed Benchmarks for more information.

Various Core Improvements: Bug Fixes, Enhanced Overall Robustness, And Reliability Of Spark NLP For Healthcare

Consistent obfuscation is supported in StructuredDeidentification too
Added deid_source field to the metadata to infer the source of entity chunks coming from internal or external NER models and stages
Refactoring the Deidentification module for improved functionality
Flushing the temporary files dumped by the SentenceEntityResolver
Fixed IOBTagger was returning zero instead of ‘O’
Lighter jars for Spark NLP leading optimized Spark’s sessions
Resolution2Chunk documentation updated
Updated the default value of the customBoundsStrategy parameter in the InternalDocumentSplitter
Enhanced InternalDocumentSplitter with UUID Metadata Field

Updated Notebooks And Demonstrations For making Spark NLP For Healthcare Easier To Navigate And Understand

New Flattener Notebook
New Rule Based Entity Matchers Notebook
New Opioid Notebook
Updated Clinical Deidentification Improvement Notebook for latest models
New Opioid Demo
New Biomarker Classification Demo
New SNOMED Term NER Demo
New Multi Language NER Demo
Updated Age Classification Demo with new genericclassifier_age_e5 model

We Have Added And Updated A Substantial Number Of New Clinical Models And Pipelines, Further Solidifying Our Offering In The Healthcare Domain.

drug_matcher
biomarker_matcher
umls_loinc_mapper
umls_cpt_mapper
umls_snomed_mapper
snomed_umls_mapper
cpt_umls_mapper
loinc_umls_mapper
ner_opioid
ner_snomed_term
ner_deid_multilingual
ner_deid_name_multilingual_clinical
assertion_opioid_wip
assertion_opioid_drug_status_wip
assertion_opioid_general_symptoms_status_wip
genericclassifier_age_e5
bert_sequence_classifier_biomarker
ncit_resolver_pipeline
rxcui_resolver_pipeline
hgnc_resolver_pipeline
hcpcs_resolver_pipeline
snomed_procedures_measurements_resolver_pipeline
icdo_resolver_pipeline
icd10pcs_resolver_pipeline
icd10cm_generalised_resolver_pipeline
loinc_resolver_pipeline
loinc_numeric_resolver_pipeline
medication_resolver_pipeline
medication_resolver_transform_pipeline
clinical_deidentification_obfuscation_medium
clinical_deidentification_obfuscation_small
clinical_deidentification
clinical_deidentification_generic
clinical_deidentification_subentity
sbiobertresolve_snomed_bodyStructure
sbiobertresolve_snomed_drug
sbiobertresolve_snomed_conditions
sbiobertresolve_snomed_auxConcepts
sbiobertresolve_snomed_findings
sbiobertresolve_snomed_findings_aux_concepts
sbiobertresolve_snomed_procedures_measurements

For all Spark NLP for Healthcare models, please check: Models Hub Page

Versions

Version
Version
Version

PREVIOUSVersion Compatibility

5.3.0

Highlights

4 New Opioid NER Model to Extract Opioid-Related Entities from 22 Classes and 3 Assertion Models to Detect the Status of Opioid Drug Usage and Underlying Symptoms.

Multi-Lingual NER Model for Deidentification to Detect Sensitive Entities (name, date, location etc.) from Multiple Languages.

New Age Classification Model to Detect Age Groups from Clinical Texts without any Mention of Age

Biomarker Text Classification Model to Detect Sentences/Phrases that may Contain Biomarker-related Terms

New NER Model for SNOMED Term Extraction Regardless of its Type

New 6 ChunkMapper Models for Medical Code Mapping to Map Various Medical Terminologies Across Each Other

Curated Pretrained Pipelines to Analyse Clinical Documents for Specific Clinical Tasks and Concepts at Once

Enhanced Data Exploration with the New Flattener Annotator to Prettify the Pipeline Outputs in a Tabulated Format

Rule-based Entity Recognition/matcher Modules (TextMatcher, RegexMatcher and EntityRuler) to Extract Mamed Entities with Rules and Dictionaries

Deidentification now Supports Masking and Obfuscation at the Same Time without an Additional Stage

ChunkMerger now Supports Dictionary Format for the Selective Merging

MedicalQuestionAnswering Returns Score in Metadata

New Speed Benchmarks for Various Pipelines Across Different Platforms Under Various Settings (EMR, Databricks etc.)

Various Core Improvements: Bug Fixes, Enhanced Overall Robustness, And Reliability Of Spark NLP For Healthcare

Updated Notebooks And Demonstrations For making Spark NLP For Healthcare Easier To Navigate And Understand

We Have Added And Updated A Substantial Number Of New Clinical Models And Pipelines, Further Solidifying Our Offering In The Healthcare Domain.

Versions

4 New `Opioid` NER Model to Extract Opioid-Related Entities from 22 Classes and 3 Assertion Models to Detect the Status of Opioid Drug Usage and Underlying Symptoms.

`Multi-Lingual` NER Model for `Deidentification` to Detect Sensitive Entities (`name`, `date`, `location` etc.) from Multiple Languages.

New `Age` Classification Model to Detect Age Groups from Clinical Texts without any Mention of Age

`Biomarker` Text Classification Model to Detect Sentences/Phrases that may Contain Biomarker-related Terms

Rule-based Entity Recognition/matcher Modules (`TextMatcher`, `RegexMatcher` and `EntityRuler`) to Extract Mamed Entities with Rules and Dictionaries

`ChunkMerger` now Supports Dictionary Format for the Selective Merging

`MedicalQuestionAnswering` Returns Score in Metadata