Healthcare NLP v5.4.0 Release Notes

5.4.0

Highlights

We are delighted to announce remarkable enhancements and updates in our latest release of Spark NLP for Healthcare. This release comes with a brand new LLM Loader to load and run any size of LLMs in gguf format, a few-shot assertion classifier, contextual assertion detection, and a demo to showcase accuracy differences between Healthcare NLP and GPT-4 for information extraction tasks as well as the first Menopause-specific medical models, 81 new and updated clinical pretrained models, and pipelines.

Introducing a brand new LLMLoader annotator to load and run large language models in gguf format. We also announce 9 LLMs at various sizes and quantization (3x small size medical summarizer and QA model, 3x medium size general model and 3x small size zero shot entity extractor)
Introducing a brand new FewshotAssertionClassifier annotator to train assertion detection models using a few samples with better accuracy
Introducing a rule-based ContextualAssertion annotator to detect assertion status using patterns and rules without any training or annotation
Introducing VectorDBPostProcessor annotator to filter and sort the document splits returned by vector databases in a RAG application
Introducing ContextSplitAssembler annotator to assemble the document post-processed splits as a context into an LLM stage in a RAG application
SNOMED entity resolver model for Veterinary domains
Voice of the Patients named entity recognition (NER) model
New rule-based entity matcher models to customize De-IDentification pipelines
New NER, assertion, relation extraction, and classification models to identify Alcohol and Smoking related Medical Entities
New NER and assertion models to extract Menopause related entities
Clinical document analysis with one-liner pretrained pipelines for specific clinical tasks and concepts
Formal release of oncological assertion status detection and relation extraction models
11 new fine-tuned sentence embedding models finetuned with medical assertion datasets
Significantly faster vector-db based entity resolution models than existing Sentence Entity Resolver models
RxNorm code mapping benchmarks and cost comparisons: Healthcare NLP, GPT-4, and Amazon Comprehend Medical
New blog posts on using NLP in opioid research and healthcare: harnessing NLP, knowledge graphs, and regex techniques for critical insights
New notebooks for medication and resolutions concept
Updated Udemy MOOC (our online courses) notebooks
Various core improvements; bug fixes, enhanced overall robustness and reliability of Spark NLP for Healthcare
- Resolved broken links in healthcare demos
- Added a unique ID field for each entity into the result of the pipeline_ouput_parser module
- Fixed deidentification AGE obfuscation hanging issue
- Added DatasetInfo parameter into the MedicalNERModel annotator
Updated notebooks and demonstrations for making Spark NLP for Healthcare easier to navigate and understand
- New Clinical Medication Use Case notebook
- New Resolving Medical Terms to Terminology Codes Directly notebook
- New Contextual Assertion notebook
- New VectorDB and PostProcessor for RAG Generative AI notebook
- New Analyse Veterinary Documents with Healthcare NLP notebook
- Updated FewShot Assertion Classifier notebook
- New ALCOHOL SMOKING Demo
- New JSL vs GPT4 Demo to showcase accuracy differences between Healthcare NLP and GPT-4 for information extraction tasks
The addition and update of numerous new clinical models and pipelines continue to reinforce our offering in the healthcare domain

These enhancements will elevate your experience with Spark NLP for Healthcare, enabling more efficient, accurate, and streamlined analysis of healthcare-related natural language data.

Introducing a Brand New `LLMLoader` Annotator to Load and Run Large Language Models in GGUF format

LLMLoader is designed to interact with a LLMs that are converted into gguf format. This module allows using John Snow Labs’ licensed LLMs at various sizes that are finetuned on medical context for certain tasks. It provides various methods for setting parameters, loading models, generating text, and retrieving metadata. The LLMLoader includes methods for setting various parameters such as input prefix, suffix, cache prompt, number of tokens to predict, sampling techniques, temperature, penalties, and more. Overall, the LLMLoader provides a flexible and extensible framework for interacting with language models in a Python and Scala environment using PySpark and Java.

Model Name	Description
JSL_MedS_q16_v1	Summarization and Q&A
JSL_MedS_q8_v1	Summarization and Q&A
JSL_MedS_q4_v1	Summarization and Q&A
JSL_MedM_q16_v1	Summarization, Q&A, RAG, and Chat
JSL_MedM_q8_v1	Summarization, Q&A, RAG, and Chat
JSL_MedM_q4_v1	Summarization, Q&A, RAG, and Chat
JSL_MedSNer_ZS_q16_v1	Extract and link medical named entities
JSL_MedSNer_ZS_q8_v1	Extract and link medical named entities
JSL_MedSNer_ZS_q4_v1	Extract and link medical named entities

We recommend using 8b quantized versions of the models as the qualitative performance difference between q16 and q8 versions is very negligible.

Example:

from sparknlp_jsl.llm import LLMLoader

llm_loader_pretrained = LLMLoader(spark).pretrained("jsl_meds_q16_v1", "en", "clinical/models")

llm_loader_pretrained.generate("What is the indication for the drug Methadone?")

Result:

Methadone is used to treat opioid addiction. It is a long-acting opioid agonist that is used to help individuals who are addicted to short-acting opioids such as heroin or other illicit opioids. It is also used to treat chronic pain in patients who have developed tolerance to other opioids.

Please check the LLMLoader Notebook for more information

Introducing a Brand New `FewshotAssertionClassifier` Annotator to Train Assertion Detection Models Using a Few Samples with Better Accuracy

The newly refactored FewShotAssertionClassifierModel and FewShotAssertionClassifierApproach simplify assertion annotation in clinical and biomedical texts. By leveraging sentence embeddings, these models deliver precise assertion annotations and integrate seamlessly with any SparkNLP sentence embedding model.

A key feature is the FewShotAssertionSentenceConverter, an annotator that formats documents/sentences and NER chunks for assertion classification, requiring an additional step in the pipeline.

This comprehensive approach significantly enhances the extraction, analysis, and processing of assertion-related data, making it an indispensable tool for healthcare text annotation.

The following table demonstrates the enhanced results achieved using the FewShot Assertion model compared to the traditional AssertionDL model across various datasets. The FewShot Assertion model showcases significant improvements in accuracy scores, particularly in complex medical domains.

Dataset Name	AssertionDL	FewShot Assertion
radiology	0.91	0.93
i2b2	0.86	0.93
oncology	0.55	0.90
jsl_augmented	0.85	0.90
smoking	0.67	0.96
sdoh	0.76	0.85

Annotator to Train Assertion Detection Models

FewShot Assertion Model Name	Predicted Classed
fewhot_assertion_jsl_e5_base_v2_jsl	`Present`, `Absent`, `Possible`, `Planned`, `Past`, `Family`, `Hypothetical`, `SomeoneElse`
fewhot_assertion_e5_base_v2	`absent`, `associated_with_someone_else`, `conditional`, `hypothetical`, `possible`, `present`
fewhot_assertion_sdoh_e5_base_v2_sdoh	`Absent`, `Past`, `Present`, `Someone_Else`, `Hypothetical`, `Possible`
fewhot_assertion_smoking_e5_base_v2_smoking	`Present`, `Absent`, `Past`
fewhot_assertion_oncology_e5_base_v2_oncology	`Absent`, `Past`, `Present`, `Family`, `Hypothetical`, `Possible`
fewhot_assertion_radiology_e5_base_v2_radiology	`Confirmed`, `Negative`, `Suspected`

Example:

few_shot_assertion_converter = FewShotAssertionSentenceConverter()\
    .setInputCols(["sentence","token", "ner_jsl_chunk"])\
    .setOutputCol("assertion_sentence")

e5_embeddings = E5Embeddings.pretrained("e5_base_v2_embeddings_medical_assertion_oncology", "en", "clinical/models")\
    .setInputCols(["assertion_sentence"])\
    .setOutputCol("assertion_embedding")

few_shot_assertion_classifier = FewShotAssertionClassifierModel()\
    .pretrained("fewhot_assertion_oncology_e5_base_v2_oncology", "en", "clinical/models")\
    .setInputCols(["assertion_embedding"])\
    .setOutputCol("assertion")

sample_text= """The patient is suspected to have colorectal cancer. Her family history is positive for other cancers. The result of the biopsy was positive. A CT scan was ordered to rule out metastases."""

Result:

	chunks	begin	end	entities	assertion	confidence
0	colorectal cancer	33	49	Cancer_Dx	Possible	0.581282
1	cancers	93	99	Cancer_Dx	Family	0.234656
2	biopsy	120	125	Pathology_Test	Past	0.957321
3	positive	131	138	Pathology_Result	Present	0.956439
4	CT scan	143	149	Imaging_Test	Past	0.95717
5	metastases	175	184	Metastasis	Possible	0.549866

Please check the FewShot Assertion Classifier Notebook for more information

Introducing a Rule-Based `ContextualAssertion` Annotator to Detect Assertion Status Using Patterns and Rules without any Training or Annotation

Introducing Contextual Assertion which identifies contextual cues within text data, such as negation, uncertainty, etc. It is used for clinical assertion detection, etc. It annotates text chunks with assertions based on configurable rules, prefix and suffix patterns, and exception patterns.

Dataset: 253 Clinical Texts from in-house dataset

Assertion Label	Contextual Assertion	AssertionDL
Absent	0.88	0.78
Past	0.77	0.65

Dataset: Used in-house jsl_augmented dataset

Assertion Label	Contextual Assertion	AssertionDL
Absent	0.82	0.90
Family	0.63	0.73
Hypothetical	0.51	0.69
Past	0.73	0.77
Planned	0.57	0.62
Possible	0.49	0.74
SomeoneElse	0.61	0.81

Contextual Assertion, a powerful component within Spark NLP, extends beyond mere negation detection. Its ability to identify and classify a diverse range of contextual cues, including uncertainty, temporality, and sentiment, empowers healthcare professionals to extract deeper meaning from complex medical records.

Model Name	Description
`contextual_assertion_someone_else`	Identifies contextual cues within text data to detect `someone else` assertions
`contextual_assertion_absent`	Identifies contextual cues within text data to detect `absent` assertions
`contextual_assertion_past`	Identifies contextual cues within text data to detect `past` assertions

Example:

contextual_assertion = ContextualAssertion() \
    .setInputCols(["sentence", "token", "ner_chunk"]) \
    .setOutputCol("assertion") \
    .setPrefixKeywords(["no", "not"]) \
    .setSuffixKeywords(["unlikely", "negative", "no"]) \
    .setPrefixRegexPatterns(["\\b(no|without|denies|never|none|free of|not include)\\b"]) \
    .setSuffixRegexPatterns(["\\b(free of|negative for|absence of|not|rule out)\\b"]) \
    .setExceptionKeywords(["without"]) \
    .setExceptionRegexPatterns(["\\b(not clearly)\\b"]) \
    .addPrefixKeywords(["negative for", "negative"]) \
    .addSuffixKeywords(["absent", "neither"]) \
    .setCaseSensitive(False) \
    .setPrefixAndSuffixMatch(False) \
    .setAssertion("absent") \
    .setScopeWindow([2, 2])\
    .setIncludeChunkToScope(True)\

example_text = """Patient resting in bed. Patient given azithromycin without any difficulty. Patient has audible wheezing, states chest tightness.
No evidence of hypertension. Patient denies nausea at this time. zofran declined. Patient is also having intermittent sweating associated with pneumonia.
"""

Result:

ner_chunk	begin	end	ner_label	Assertion
any difficulty	59	72	PROBLEM	absent
hypertension	149	160	PROBLEM	absent
nausea	178	183	PROBLEM	absent
zofran	199	204	TREATMENT	absent

Please check the Contextual Assertion Notebook for more information

Introducing `VectorDBPostProcessor` Annotator to Filter and Sort the Document Splits Returned by `VectorDB` in a RAG Application

The VectorDBPostProcessor is a powerful tool designed to filter and sort output from the VectorDBModel (our own VectorDB implementations will be released soon). This processor refines VECTOR_SIMILARITY_RANKINGS input annotations and outputs enhanced VECTOR_SIMILARITY_RANKINGS annotations based on specified criteria.

Key Parameters:

filterBy (str): Select and prioritize filter options (metadata, diversity_by_threshold). Options can be given as a comma-separated string, determining the filtering order. Default: metadata
sortBy (str): Select sorting option (ascending, descending, lost_in_the_middle, diversity). Default: ascending
caseSensitive (bool): Determines if string operators’ criteria are case-sensitive. Default: False
diversityThreshold (float): Sets the threshold for the diversity_by_threshold filter. Default: 0.01
maxTopKAfterFiltering (int): Limits the number of annotations returned after filtering. Default: 20
allowZeroContentAfterFiltering (bool): Determines whether zero annotations are allowed after filtering. Default: False

This processor ensures precise and customizable annotation management, making it an essential component for advanced data processing workflows.

Example:

post_processor = VectorDBPostProcessor() \
    .setInputCols("vector_db") \
    .setOutputCol("post") \
    .setSortBy("ascending")
    .setMaxTopKAfterFiltering(5)
    .setFilterBy("metadata") \
    .setMetadataCriteria([
        {"field": "pubdate", "fieldType": "date", "operator": "greater_than", "value": "2017 May 11", "dateFormats": ["yyyy MMM dd", "yyyy MMM d"], "converterFallback": "filter"},
        {"field": "distance", "fieldType": "float", "operator": "less_than", "value": "0.5470"},
        {"field": "title", "fieldType": "string", "operator": "contains", "matchMode": "any", "matchValues": ["diabetes", "immune system"]}
      ])

Please check the VectorDB and PostProcessor for RAG Generative AI Notebook for more information

Introducing `ContextSplitAssembler` Annotator to Assemble the Document Post-processed Splits as a Context into an LLM Stage in a RAG Application

The ContextSplitAssembler is a versatile tool designed to work seamlessly with vector databases (our own VectorDB implementations will be released soon) and VectorDBPostProcessor. It combines and organizes annotation results with customizable delimiters and optional splitting.

Key Parameters:

joinString (str): Specifies the delimiter string inserted between annotations when combining them into a single result. Ensures proper separation and organization. Default: “ “
explodeSplits (bool): Determines whether to split the annotations into separate entries. Default: False

This assembler enhances the management and presentation of annotations, making it an essential tool for advanced data processing workflows.

Example:

context_split_assembler = ( ContextSplitAssembler()
  .setInputCols("vector_db")
  .setOutputCol("document")
  .setJoinString("\n")
  .setExplodeSplits(False))

Please check the VectorDB and PostProcessor for RAG Generative AI Notebook for more information

SNOMED Entity Resolver Model for `Veterinary` Domains

This advanced model facilitates the mapping of veterinary-related entities and concepts to SNOMED codes using sbiobert_base_cased_mli Sentence BERT embeddings. It is trained with an enhanced dataset derived from the sbiobertresolve_snomed_veterinary_wip model. The model ensures precise and reliable resolution of veterinary terms to standardized SNOMED codes, aiding in consistent and comprehensive veterinary data documentation and analysis.

Example:

snomed_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_snomed_veterinary", "en", "clinical/models") \
    .setInputCols(["sentence_embeddings"]) \
    .setOutputCol("snomed_code")\
    .setDistanceFunction("EUCLIDEAN")
text = "The veterinary team is closely monitoring the patient for signs of lymphoblastic lymphoma, a malignant neoplasm of lymphoid origin. They are also treating the patient's osteoarthritis, a degenerative joint disease. Additionally, the team is vigilantly observing the facility for potential outbreaks of mink distemper."

Result:

ner_chunk	entity	snomed_code	description
lymphoblastic lymphoma	PROBLEM	312281000009102	lymphoblastic lymphoma
a malignant neoplasm of lymphoid origin	PROBLEM	443495005	neoplasm of lymphoid system structure
the patient’s osteoarthritis	PROBLEM	201826000	erosive osteoarthrosis
a degenerative joint disease	PROBLEM	201819000	degenerative joint disease involving multiple joints
mink distemper	PROBLEM	348361000009108	mink distemper

Please check the model card for more information

Voice of the Patients Named Entity Recognition (NER) Model

The Voice of the Patients NER Model is designed to extract healthcare-related terms from patient-generated documents. This model processes the natural language used by patients to identify and categorize medical terms, facilitating better understanding and documentation of patient-reported information.

Example:

ner_model = MedicalNerModel.pretrained("ner_vop_v2", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

text = "Hello,I'm 20 year old girl. I'm diagnosed with hyperthyroid 1 month ago. I was feeling weak, light headed,poor digestion, panic attacks, depression, left chest pain, increased heart rate, rapidly weight loss, from 4 months. Because of this, I stayed in the hospital and just discharged from hospital. I had many other blood tests, brain mri, ultrasound scan, endoscopy because of some dumb doctors bcs they were not able to diagnose actual problem."

Result:

chunk	begin	end	ner_label
20 year old	15	25	Age
girl	27	30	Gender
hyperthyroid	52	63	Disease
1 month ago	65	75	DateTime
weak	92	95	Symptom
light	98	102	Symptom
panic attacks	127	139	PsychologicalCondition
depression	142	151	PsychologicalCondition
left	154	157	Laterality
chest	159	163	BodyPart
pain	165	168	Symptom
increased	171	179	TestResult
heart rate	181	190	VitalTest
rapidly	193	199	Modifier
weight loss	201	211	Symptom
4 months	220	227	Duration
hospital	263	270	ClinicalDept
discharged	281	290	AdmissionDischarge
hospital	297	304	ClinicalDept
blood tests	324	334	Test
brain	337	341	BodyPart
mri	343	345	Test
ultrasound scan	348	362	Test
endoscopy	365	373	Procedure

Please check the model card for more information

New Rule-Based Entity Matcher Models to Customise De-IDentification Pipelines

We introduce a suite of text and regex matchers, specifically designed to enhance the deidentification and clinical document understanding process with rule-based methods.

Model Name	Description
`cancer_diagnosis_matcher`	This model extracts cancer diagnoses in clinical notes using a rule-based `TextMatcherInternal` annotator.
`country_matcher`	This model extracts countries in clinical notes using a rule-based `TextMatcherInternal` annotator.
`email_matcher`	This model extracts emails in clinical notes using a rule-based `RegexMatcherInternal` annotator.
`phone_matcher`	This model extracts phone entities in clinical notes using a rule-based `RegexMatcherInternal` annotator.
`state_matcher`	This model extracts states in clinical notes using a rule-based `RegexMatcherInternal` annotator.
`zip_matcher`	This model extracts zip codes in clinical notes using a rule-based `RegexMatcherInternal` annotator.
`city_matcher`	This model extracts city names in clinical notes using a rule-based `TextMatcherInternal` annotator.

Example:

text_matcher = TextMatcherInternalModel.pretrained("cancer_diagnosis_matcher", "en", "clinical/models") \
    .setInputCols(["sentence", "token"])\
    .setOutputCol("cancer_dx")\
    .setMergeOverlapping(True)

example_text = """A 65-year-old woman had a history of debulking surgery, bilateral oophorectomy with omentectomy, total anterior hysterectomy with radical pelvic lymph nodes dissection due to ovarian carcinoma (mucinous-type carcinoma, stage Ic) 1 year ago. The patient's medical compliance was poor and failed to complete her chemotherapy (cyclophosphamide 750 mg/m2, carboplatin 300 mg/m2). Recently, she noted a palpable right breast mass, 15 cm in size which nearly occupied the whole right breast in 2 months. Core needle biopsy revealed metaplastic carcinoma. Neoadjuvant chemotherapy with the regimens of Taxotere (75 mg/m2), Epirubicin (75 mg/m2), and Cyclophosphamide (500 mg/m2) was given for 6 cycles with poor response, followed by a modified radical mastectomy (MRM) with dissection of axillary lymph nodes and skin grafting. Postoperatively, radiotherapy was done with 5000 cGy in 25 fractions. The histopathologic examination revealed a metaplastic carcinoma with squamous differentiation associated with adenomyoepithelioma. Immunohistochemistry study showed that the tumor cells are positive for epithelial markers-cytokeratin (AE1/AE3) stain, and myoepithelial markers, including cytokeratin 5/6 (CK 5/6), p63, and S100 stains.  Expressions of hormone receptors, including ER, PR, and Her-2/Neu, were all negative."""

Result:

chunk	begin	end	label
ovarian carcinoma	176	192	Cancer_dx
mucinous-type carcinoma	195	217	Cancer_dx
metaplastic carcinoma	528	548	Cancer_dx
metaplastic carcinoma	937	957	Cancer_dx
adenomyoepithelioma	1005	1023	Cancer_dx

New NER, Assertion, Relation Extraction, and Classification Models to Identify `Alcohol` and `Smoking` Related Medical Entities

A suite of models designed for the identification and analysis of alcohol and smoking related entities in text data. These models include Named Entity Recognition (NER), assertion status, relation extraction, and classification, providing a comprehensive toolkit for analyzing substance use information.

NER Model

Model Name	Predicted Entities	Description
ner_alcohol_smoking	`Drinking_Status`, `Alcohol_Type`, `Smoking_Status`, `Smoking_Type`, `Substance_Duration`, `Substance_Frequency`, `Substance_Quantity`, `Cardiovascular_Issues`, `Respiratory_Issues`, `GUT_Issues`, `Neurologic_Issues`, `Psychiatric_Issues`, `Other_Health_Issues`, `Drinking_Environment`, `Cessation_Treatment`, `Withdrawal_Treatment`	Detects alcohol and smoking related entities within text data

This ner_alcohol_smoking model is designed to detect and label alcohol and smoking-related entities within text data. Alcohol refers to beverages containing ethanol, a psychoactive substance that is widely consumed for its pleasurable effects. Smoking typically involves inhaling smoke from burning tobacco, a highly addictive substance. The model has been trained using advanced deep learning techniques on a diverse range of text sources and can accurately recognize and classify a wide range of entities related to alcohol and smoking. The model’s accuracy and precision have been carefully validated against expert-labeled data to ensure reliable and consistent results.

Example:

ner_model = MedicalNerModel.pretrained("ner_alcohol_smoking", "en", "clinical/models")\
    .setInputCols(["sentence", "token","embeddings"])\
    .setOutputCol("ner")

sample_texts = ["""The outpatient clinic addressed a complaint from the patient regarding severe anxiety and withdrawal symptoms. 
He disclosed a history of alcohol addiction, with weekly episodes of intense binge drinking over the past decade. 
However, due to recent challenges in his personal and professional life, he decided to quit drinking cold turkey a week ago. 
Since then, he has been experiencing escalating symptoms including tremors, sweating, nausea, and severe anxiety. 
The patient denies recent use drugs or smoking, focusing her struggles solely on alcohol.
He was placed on CIWA protocol w/ lorazepam for management. Scheduled for cognitive-behavioral therapy (CBT)."""]

Result:

chunk	begin	end	ner_label
anxiety	78	84	Psychoneurologic_Issue
alcohol addiction	138	154	Drinking_Status
weekly	162	167	Substance_Frequency
binge	189	193	Substance_Quantity
drinking	195	202	Drinking_Status
over the past decade	204	223	Substance_Duration
drinking	319	326	Drinking_Status
tremors	420	426	Psychoneurologic_Issue
sweating	429	436	Other_Health_Issues
nausea	439	444	GUT_Issues
anxiety	458	464	Psychoneurologic_Issue
smoking	507	513	Smoking_Status
alcohol	549	555	Drinking_Status
CIWA	575	578	Withdrawal_Treatment
lorazepam	592	600	Withdrawal_Treatment
cognitive-behavioral therapy	632	659	Cessation_Treatment
CBT	662	664	Cessation_Treatment

Assertion Models

Model Name	Assertion Status	Description
assertion_alcohol_smoking_wip	`Absent`, `Hypothetical_Possible`, `Past_History`, `Present_Planned`	This model detects the assertion status of entities related to alcohol-smoking.
assertion_alcohol_smoking_general_symptoms_wip	`Overdose_Symptom`, `Withdrawal_Symptom`	This model detects the assertion status of general symptoms entity related to alcohol-smoking.

Example:

assertion = AssertionDLModel.pretrained("assertion_alcohol_smoking_wip", "en", "clinical/models")\
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")

sample_texts = ["""Per the patient, the last drink was on ___, prior to admission. The patient admits to having experienced tremors, palpitations, and diaphoresis during the past alcohol withdrawals, but he denies ever having experienced seizures. Mr. ___ did not report experiencing any symptoms of withdrawal throughout his hospital stay, and an examination revealed no evidence of withdrawal.""",
               """SUBSTANCE ABUSE: The patient admitted to occasional binge drinking, but admitted to normally consuming one pint of liquor a day in the week before her admission. Before she attempted suicide, she was heavily intoxicated and had a high blood alcohol level (BAL). Attending the AA meetings and expressing a desire to keep going to AA to support sobriety were two ways the patient showed motivation to stop drinking. The patient was put on the CIWA protocol upon admission, but no PRN Valium was needed for alcohol withdrawal."""]

Result:

chunk	begin	end	ner_label	assertion	confidence
drink	26	30	Drinking_Status	Past_History	0.8507
tremors	105	111	Psychoneurologic_Issue	Past_History	0.9315
palpitations	114	125	Cardiovascular_Issues	Past_History	0.9251
diaphoresis	132	142	Other_Health_Issues	Past_History	0.9181
alcohol	160	166	Drinking_Status	Past_History	0.9109
seizures	219	226	Psychoneurologic_Issue	Absent	0.5359
binge	52	56	Substance_Quantity	Present_Planned	0.5528
drinking	58	65	Drinking_Status	Present_Planned	0.5704
one pint	103	110	Substance_Quantity	Present_Planned	0.6838
liquor	115	120	Alcohol_Type	Present_Planned	0.6879
a day	122	126	Substance_Frequency	Present_Planned	0.8029
suicide	183	189	Psychoneurologic_Issue	Past_History	0.731
intoxicated	208	218	Psychoneurologic_Issue	Past_History	0.7832
alcohol	241	247	Drinking_Status	Past_History	0.507
AA	276	277	Cessation_Treatment	Present_Planned	0.4559
AA	329	330	Cessation_Treatment	Present_Planned	0.5112
drinking	404	411	Drinking_Status	Present_Planned	0.5385
CIWA	441	444	Withdrawal_Treatment	Present_Planned	0.5693
Valium	482	487	Withdrawal_Treatment	Absent	0.553
alcohol	504	510	Drinking_Status	Present_Planned	0.5135

Relation Extraction Model

Model Name	Predicted Entities	Description
re_alcohol_smoking_clinical_wip	`is_caused_by`, `is_used_for`	It recognizes relations between treatment cessation and withdrawal with drinking and smoking status, as well as relations between various health issues (Neurologic, Psychiatric, Cardiovascular, Respiratory, GUT, and Other Health Issues) and drinking and smoking status.

Example:

clinical_re_Model = RelationExtractionModel()\
    .pretrained("re_alcohol_smoking_clinical_wip", "en", "clinical/models")\
    .setInputCols(["embeddings", "pos_tags", "ner_chunk", "dependencies"])\
    .setOutputCol("relations")\
    .setMaxSyntacticDistance(4)\
    .setRelationPairs(["Cessation_Treatment-Drinking_Status",
    "Cessation_Treatment-Smoking_Status",
    "Respiratory_Issues-Drinking_Status",
    "Respiratory_Issues-Smoking_Status"])

text = ["""Pulmonary Function Tests: Demonstrates airflow limitation consistent with chronic obstructive pulmonary disease
 (COPD). Diagnosis: Acute exacerbation of COPD secondary to smoking.
 Diagnosis: Alcoholic fatty liver disease and smoking-related respiratory symptoms.Management: The patient received alcohol cessation counseling and support services to address her alcohol use disorder. She was also provided with smoking cessation pharmacotherapy and behavioral interventions to help her quit smoking."""]

Result:

	sentence	entity1_begin	entity1_end	chunk1	entity1	entity2_begin	entity2_end	chunk2	entity2	relation	confidence
0	2	154	157	COPD	Respiratory_Issues	172	178	smoking	Smoking_Status	is_caused_by	0.999902
2	4	297	303	alcohol	Drinking_Status	305	324	cessation counseling	Cessation_Treatment	is_used_for	0.999512
3	4	297	303	alcohol	Drinking_Status	330	345	support services	Cessation_Treatment	is_used_for	0.933377
4	5	411	417	smoking	Smoking_Status	419	443	cessation pharmacotherapy	Cessation_Treatment	is_used_for	0.996433
5	5	411	417	smoking	Smoking_Status	449	472	behavioral interventions	Cessation_Treatment	is_used_for	0.9565

Classification Models

Model Name	Assertion Status	Description
genericclassifier_alcohol_mpnet_wip	`Current_Drinker`, `Others`	The primary goal of the model is to categorize texts into two main label categories: ‘Current_Drinker’ and ‘Others.’ (past or non-smoker)
genericclassifier_smoking_mpnet_wip	`Current_Smoker`, `Others`	The primary goal of the model is to categorize texts into two main label categories: ‘Current_Smoker’ and ‘Others.’ (past or non-smoker)

Example:

generic_classifier = GenericClassifierModel.pretrained('genericclassifier_alcohol_mpnet_wip', 'en', 'clinical/models')\
    .setInputCols("features")\
    .setOutputCol("prediction")

text_list = [
             "The patient, with a history of COPD and alcohol dependence, was initially admitted due to a COPD exacerbation and community-acquired pneumonia. The situation was further complicated by alcohol withdrawal. He was later transferred to another facility for treatment of left hand cellulitis, which raised concerns for necrotizing fasciitis.",
             "Until recently, the patient had maintained stability on his antidepressant regimen. However, he experienced a notable worsening of depressive symptoms last week, leading him to engage in heavy binge drinking as an ineffective way to suppress his emotional distress and feelings of despair.",
             "Ms. Jane Doe, a 60-year-old retired teacher, presented to the emergency department complaining of severe abdominal pain and vomiting. She has a history of gallstones but has been asymptomatic for years. Currently, she does not smoke or drink alcohol, focusing on a healthy lifestyle.",
             "Mr. John Smith, a 45-year-old accountant, came to the clinic reporting intense chest pain and shortness of breath. He has a history of hypertension but has managed it well with medication. He currently does not smoke or drink alcohol, maintaining a healthy lifestyle."]

Result:

text	result
The patient, with a history of COPD and alcohol dependence, was initially admitted due to a COPD …	Current_Drinker
Until recently, the patient had maintained stability on his antidepressant regimen. However, he e…	Current_Drinker
Ms. Jane Doe, a 60-year-old retired teacher, presented to the emergency department complaining of…	Others
Mr. John Smith, a 45-year-old accountant, came to the clinic reporting intense chest pain and sho…	Others

New NER and Assertion Models to Extract `Menopause` Related Entities

A set of sophisticated models aimed at extracting and analyzing menopause-related entities in text data. These models include a Named Entity Recognition (NER) model and assertion models, which identify and determine the status of various menopause-related terms, aiding in comprehensive menopause data analysis.

NER Model

Model Name	Predicted Entities	Description
ner_menopause_core	`Perimenopause`, `Menopause`, `Gynecological_Symptom`, `Gynecological_Disease`, `Other_Symptom`, `Irregular_Menstruation`, `G_P`, `Hypertension`, `Osteoporosis`, `Oncological`, `Fracture`, `Hormone_Replacement_Therapy`, `Osteporosis_Therapy`, `Antidepressants`, `Procedure`, `Hormone_Testing`, `Vaginal_Swab`, `Age`, `Test_Result`	Detects menopause related entities within text data

This ner_menopause_core model is designed to detect and label core entities related to menopause and associated conditions within text data. Menopause-related terms and conditions are crucial factors that influence individuals’ health outcomes, especially among women undergoing the menopausal transition. The model has been trained using advanced machine-learning techniques on a diverse range of text sources. It can accurately recognize and classify a wide range of menopause-related entities. The model’s accuracy and precision have been carefully validated against expert-labeled data to ensure reliable and consistent results.

Example:

ner_model = MedicalNerModel.pretrained("ner_menopause_core", "en", "clinical/models"))\
    .setInputCols(["sentence", "token","embeddings"])\
    .setOutputCol("ner")

sample_texts = ["""The patient is a 52-year-old female, G3P2, who presents with complaints of irregular menstruation and symptoms suggestive of perimenopause. She reports experiencing hot flashes, night sweats, and vaginal dryness. Her medical history includes polycystic ovary syndrome (PCOS), fatigue, mood swings, hypertension diagnosed 5 years ago and currently managed with medication, and osteoporosis diagnosed 2 years ago with ongoing treatment. 
Current medications include estradiol for hormone replacement therapy, alendronate for osteoporosis therapy, and fluoxetine for depressive symptoms related to menopause. Recent tests and procedures include a bone density scan to monitor osteoporosis, blood tests for estradiol and follicle-stimulating hormone (FSH) levels, and a vaginal swab collected for routine infection screening. Test results showed elevated FSH levels indicating menopause."""]

Result:

chunk	begin	end	ner_label
irregular menstruation	76	97	Irregular_Menstruation
perimenopause	126	138	Perimenopause
hot flashes	166	176	Other_Symptom
night sweats	179	190	Other_Symptom
vaginal dryness	197	211	Gynecological_Symptom
polycystic ovary syndrome	243	267	Gynecological_Disease
PCOS	270	273	Gynecological_Disease
fatigue	277	283	Other_Symptom
hypertension	299	310	Hypertension
osteoporosis	377	388	Osteoporosis
estradiol	466	474	Hormone_Replacement_Therapy
hormone replacement therapy	480	506	Hormone_Replacement_Therapy
alendronate	509	519	Osteporosis_Therapy
osteoporosis	525	536	Osteoporosis
fluoxetine	551	560	Antidepressants
menopause	597	605	Menopause
osteoporosis	675	686	Osteoporosis
estradiol	705	713	Hormone_Testing
follicle-stimulating hormone	719	746	Hormone_Testing
FSH	749	751	Hormone_Testing
vaginal swab	768	779	Vaginal_Swab
elevated	844	851	Test_Result
FSH	853	855	Hormone_Testing
menopause	875	883	Menopause

Assertion Models

Model Name	Assertion Status	Description
assertion_menopause_wip	`Present`, `Absent`, `Possible`, `Past`, `Hypothetical`, `Planned`, `Family`, `Menarche_Age`, `Menopause_Age`	This model detects the assertion status of menopause-related entities.

Example:

assertion = AssertionDLModel.pretrained("assertion_menopause_wip", "en", "clinical/models")\
    .setInputCols(["sentence", "ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")

sample_texts = ["""A 50-year-old woman, G2P1, presents with symptoms of perimenopause including night sweats, irregular menstruation, and fatigue.She has previously been diagnosed with hypertension. She is taking hormone replacement therapy with estradiol and norethindrone acetate. Recent tests included a bone density scan, which confirmed osteoporosis and showed elevated FSH levels. She also underwent a vaginal swab test for routine screening. Her mother has a history of breast cancer. Her menarche age was 11."""]

Result:

chunk	begin	end	ner_label	assertion	confidence
G2P1	21	24	G_P	Present	0.9999
perimenopause	53	65	Perimenopause	Present	0.9999
night sweats	77	88	Other_Symptom	Present	0.9997
irregular menstruation	91	112	Irregular_Menstruation	Present	0.9997
fatigue	119	125	Other_Symptom	Present	0.9954
hypertension	166	177	Hypertension	Past	0.9916
hormone replacement therapy	194	220	Hormone_Replacement_Therapy	Present	0.9988
estradiol	227	235	Hormone_Replacement_Therapy	Present	0.9696
norethindrone acetate	241	261	Hormone_Replacement_Therapy	Present	0.9984
osteoporosis	323	334	Osteoporosis	Present	1.0
elevated	347	354	Test_Result	Present	1.0
FSH	356	358	Hormone_Testing	Present	0.9999
vaginal swab	389	400	Vaginal_Swab	Present	1.0
breast cancer	458	470	Oncological	Family	0.9843
11	494	495	Age	Menarche_Age	0.9891

Clinical Document Analysis with One-Liner Pretrained Pipelines for Specific Clinical Tasks and Concepts

We introduce a suite of advanced, hybrid pretrained pipelines, specifically designed to streamline the clinical document analysis process. These pipelines are built upon multiple state-of-the-art (SOTA) pretrained models, delivering a comprehensive solution for quickly extracting vital information.

Model Name	Description
`ner_deid_context_nameAugmented_pipeline`	In this pipeline, there are `ner_deid_generic_augmented`, `ner_deid_subentity_augmented`, `ner_deid_name_multilingual_clinical` NER models and several ContextualParser, RegexMatcher, and TextMatcher models were used
`ner_profiling_vop`	This pipeline can be used to simultaneously evaluate various pre-trained named entity recognition (NER) models, enabling comprehensive analysis of text data pertaining to patient perspectives and experiences, also known as the “Voice of Patients”.
`ner_profiling_sdoh`	This pipeline can be used to simultaneously evaluate various pre-trained named entity recognition (NER) models, enabling comprehensive analysis of text data pertaining to the social determinants of health (SDOH). When you run this pipeline over your text, you will end up with the predictions coming out of each pretrained clinical NER model trained with the `embeddings_clinical`, which are specifically designed for clinical and biomedical text.

Example:

from sparknlp.pretrained import PretrainedPipeline

pipeline_sdoh = PretrainedPipeline("ner_profiling_sdoh", "en", "clinical/models")

text = """
The patient reported experiencing symptoms of anxiety and depression, which have been affecting his quality of life.
He reported a history of childhood trauma related to violence and abuse in his household, which has contributed to his smoking, alcohol use and current mental health struggles.
He denied any recent substance use or sexual activity and reported being monogamous in his relationship with his wife.
The patient is an immigrant and speaks English as a second language.
He reported difficulty accessing healthcare due to lack of medical insurance.
He has a herniated disc, hypertension, coronary artery disease (CAD) and diabetes mellitus.
The patient has a manic disorder, is presently psychotic and shows impulsive behavior. He has been disabled since 2001.
"""

Results:

******************** ner_sdoh_substance_usage_wip Model Results ******************** 

('smoking', 'Smoking') ('alcohol use', 'Alcohol') ('substance use', 'Substance_Use')

******************** ner_sdoh_health_behaviours_problems_wip Model Results ******************** 

('anxiety', 'Mental_Health') ('depression', 'Mental_Health') ('quality of life', 'Quality_Of_Life') ('mental health struggles', 'Mental_Health') ('sexual activity', 'Sexual_Activity') ('monogamous', 'Sexual_Activity') ('herniated disc', 'Other_Disease') ('hypertension', 'Hypertension') ('coronary artery disease', 'Other_Disease') ('CAD', 'Other_Disease') ('diabetes mellitus', 'Other_Disease') ('manic disorder', 'Mental_Health') ('psychotic', 'Mental_Health') ('impulsive behavior', 'Mental_Health') ('disabled', 'Disability')

******************** ner_jsl_greedy Model Results ******************** 

('anxiety', 'Psychological_Condition') ('depression', 'Psychological_Condition') ('smoking', 'Smoking') ('alcohol', 'Alcohol') ('substance', 'Substance') ('sexual activity', 'Symptom') ('difficulty accessing healthcare', 'Symptom') ('herniated disc', 'Disease_Syndrome_Disorder') ('hypertension', 'Hypertension') ('coronary artery disease', 'Heart_Disease') ('CAD', 'Heart_Disease') ('diabetes mellitus', 'Diabetes') ('manic disorder', 'Psychological_Condition') ('psychotic', 'Psychological_Condition') ('impulsive behavior', 'Symptom')

******************** ner_jsl_enriched Model Results ******************** 

('anxiety', 'Psychological_Condition') ('depression', 'Psychological_Condition') ('smoking', 'Smoking') ('alcohol', 'Alcohol') ('substance', 'Substance') ('difficulty accessing healthcare', 'Symptom') ('lack of medical insurance', 'Symptom') ('herniated disc', 'Disease_Syndrome_Disorder') ('hypertension', 'Hypertension') ('coronary artery disease', 'Heart_Disease') ('CAD', 'Heart_Disease') ('diabetes mellitus', 'Diabetes') ('manic disorder', 'Psychological_Condition') ('psychotic', 'Symptom') ('impulsive behavior', 'Symptom') ('disabled', 'Symptom')

Please check the Task Based Clinical Pretrained Pipelines Notebook for more information

Formal Release of Oncological Assertion Status Detection and Relation Extraction Models

We are releasing the formal version of the “work-in-progress (WIP)” assertion status detection and relation extraction models in the Oncology domain.

Here is the reference table:

WIP Version	Formal Version	Task
assertion_oncology_demographic_binary_wip	assertion_oncology_demographic_binary	Assertion Status Detection
assertion_oncology_family_history_wip	assertion_oncology_family_history	Assertion Status Detection
assertion_oncology_problem_wip	assertion_oncology_problem	Assertion Status Detection
assertion_oncology_response_to_treatment_wip	assertion_oncology_response_to_treatment	Assertion Status Detection
assertion_oncology_smoking_status_wip	assertion_oncology_smoking_status	Assertion Status Detection
assertion_oncology_test_binary_wip	assertion_oncology_test_binary	Assertion Status Detection
assertion_oncology_treatment_binary_wip	assertion_oncology_treatment_binary	Assertion Status Detection
assertion_oncology_wip	assertion_oncology	Assertion Status Detection
re_oncology_biomarker_result_wip	re_oncology_biomarker_result	Relation Extraction
re_oncology_granular_wip	re_oncology_granular	Relation Extraction
re_oncology_location_wip	re_oncology_location	Relation Extraction
re_oncology_size_wip	re_oncology_size	Relation Extraction
re_oncology_temporal_wip	re_oncology_temporal	Relation Extraction
re_oncology_test_result_wip	re_oncology_test_result	Relation Extraction
re_oncology_wip	re_oncology	Relation Extraction (DL)
redl_oncology_biobert_wip	redl_oncology_biobert	Relation Extraction (DL)
redl_oncology_biomarker_result_biobert_wip	redl_oncology_biomarker_result_biobert	Relation Extraction (DL)
redl_oncology_granular_biobert_wip	redl_oncology_granular_biobert	Relation Extraction (DL)
redl_oncology_location_biobert_wip	redl_oncology_location_biobert	Relation Extraction (DL)
redl_oncology_size_biobert_wip	redl_oncology_size_biobert	Relation Extraction (DL)
redl_oncology_temporal_biobert_wip	redl_oncology_temporal_biobert	Relation Extraction (DL)
redl_oncology_test_result_biobert_wip	redl_oncology_test_result_biobert	Relation Extraction (DL)

11 New Fine-Tuned Sentence Embedding Models finetuned with medical assertion datasets

Discover our new fine-tuned transformer-based sentence embedding models, meticulously trained on a curated list of clinical and biomedical datasets. These models are specifically optimized for Few-Shot Assertion tasks but are versatile enough to be utilized for other applications, such as Classification and Retrieval-Augmented Generation (RAG). Our collection offers precise and reliable embeddings tailored for various medical domains, significantly enhancing the extraction, analysis, and processing of assertion-related data in healthcare texts.

Model Name	Description
`mpnet_embeddings_medical_assertion`	Fine-tuned on the in-house dataset using the MPNet architecture.
`mpnet_embeddings_medical_assertion_jsl`	Fine-tuned on the in-house dataset using the MPNet architecture.
`mpnet_embeddings_medical_assertion_oncology`	Fine-tuned on the oncology dataset using the MPNet architecture.
`mpnet_embeddings_medical_assertion_sdoh`	Fine-tuned on the social determinants of health dataset using the MPNet architecture.
`e5_base_v2_embeddings_medical_assertion_base`	Fine-tuned on the in-house dataset using the E5 architecture.
`e5_base_v2_embeddings_medical_assertion_jsl`	Fine-tuned on the in-house dataset using the E5 architecture.
`e5_base_v2_embeddings_medical_assertion`	Fine-tuned on the the in-house dataset using the E5 architecture.
`e5_base_v2_embeddings_medical_assertion_sdoh`	Fine-tuned on the social determinants of health dataset using the E5 architecture.
`e5_base_v2_embeddings_medical_assertion_smoking`	Fine-tuned on the smoking dataset using the E5 architecture.
`e5_base_v2_embeddings_medical_assertion_oncology`	Fine-tuned on the oncology dataset using the E5 architecture.
`e5_base_v2_embeddings_medical_assertion_radiology`	Fine-tuned on the radiology dataset using the E5 architecture.

Example:

mpnet_embedding = MPNetEmbeddings.pretrained("mpnet_embeddings_medical_assertion_sdoh", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("mpnet_embeddings")

text = [
    ["I feel a bit drowsy after taking an insulin."],
    ["Peter Parker is a nice lad and lives in New York"]
]

Result:

embeddings
[{sentence_embeddings, 0, 43, I feel a bit drowsy after taking an insulin., {sentence -> 0}, [-0.09830807, 0.0137982415, -0.051585164, -0.0023749713, -0.017916167, 0.017543513, 0.025593378, 0.05106…
[{sentence_embeddings, 0, 47, Peter Parker is a nice lad and lives in New York, {sentence -> 0}, [-0.10453681, 0.010062916, -0.024983741, 0.009945293, -0.01242009, 0.018787898, 0.039723188, 0.04624…

Significantly Faster Vector-DB Based Entity Resolution Models Than Existing Sentence Entity Resolver Models

We have developed vector database-based entity resolution models that are 10x faster on GPU and 2x as fast on CPU compared to the existing Sentence Entity Resolver models.

NOTE: These models are not available on the Models Hub page yet and cannot be used like the other Spark NLP for Healthcare models. They will be integrated into the marketplace and made available there soon.

RxNorm Code Mapping Benchmarks and Cost Comparisons: John Snow Labs, GPT-4, and Amazon Comprehend Medical

We have prepared an accuracy benchmark and the cost analysis between Healthcare NLP, GPT-4, and Amazon Comprehend Medical for mapping medications to their RxNorm terms. Here are the notes:

For the ground truth dataset, we used 79 in-house clinical notes annotated by the medical experts of John Snow Labs.
John Snow Labs: We used sbiobertresolve_rxnorm_augmented and biolordresolve_rxnorm_augmented models for this benchmark. These models can return up to 25 closest results sorted by their distances.
GPT-4: Both GPT-4 (Turbo) and GPT-4o models are used. According to the official announcement, the performance of GPT-4 and GPT-4o is almost identical, and we used both versions for the accuracy calculation. Additionally, the GPT-4 returns only one result, which means you will see the same results in both evaluation approaches.
Amazon Comprehend Medical: The RxNorm tool of this service is used, and it returns up to 5 closest matches sorted by their distances.
We adopted two approaches for evaluating these tools, given that the model outputs may not precisely match the annotations:
- Top-3: Compare the annotations to see if they appear in the first three results.
- Top-5: Compare the annotations to see if they appear in the first five results.

Here are the accuracy results:

Top-3 Results:

top_3

Top-5 Results:

top_5

Conclusion:

Based on the evaluation results:

The sbiobertresolve_rxnorm_augmented model of John Snow Labs consistently provides the most accurate results in each top_k comparison.
The biolordresolve_rxnorm_augmented model of John Snow Labs outperforms Amazon Comprehend Medical and GPT-4 in mapping terms to their RxNorm codes.
The GPT-4 could only return one result, reflected similarly in both charts, and has proven to be the least accurate.

If you want to process 1M documents and extract RxNorm codes for medication entities (excluding the NER stage), the total cost:

With John Snow Labs is about $4,500, including the infrastructure costs.
$24,250 with Amazon Comprehend Medical
$44,000 with the GPT-4 and $22,000 with the GPT-4o model.

Therefore, John Snow Labs is almost 5 times cheaper than its closest alternative, not to mention the accuracy differences (Top 3: John Snow Labs 82.7% vs Amazon 55.8% vs GPT-4 8.9%).

Accuracy & Cost Table
	Top-3 Accuracy	Top-5 Accuracy	Cost
John Snow Labs	82.7%	84.6%	$4,500
Amazon Comprehend Medical	55.8%	56.2%	$24,250
GPT-4 (Turbo)	8.9%	8.9%	$44,000
GPT-4o	8.9%	8.9%	$22,000

If you want to see more details, please check Benchmarks Page and State-of-the-art RxNorm Code Mapping with NLP: Comparative Analysis between the tools by John Snow Labs, Amazon, and GPT-4 blog post.

New Blogposts on Using NLP in Opioid Research and Healthcare: Harnessing NLP, Knowledge Graphs, and Regex Techniques for Critical Insights

Explore the latest developments in healthcare NLP and Knowledge Graphs through our new blog posts, where we take a deep dive into the innovative technologies and methodologies transforming the medical field. These posts offer insights into how the latest tools are being used to analyze large amounts of unstructured data, identify critical medical assets, and extract meaningful patterns and correlations. Learn how these advances are not only improving our understanding of complex health issues but also contributing to more effective prevention, diagnosis, and treatment strategies.

Harnessing the Power of NLP and Knowledge Graphs for Opioid Research discusses how Natural Language Processing (NLP) and Knowledge Graphs (KG) are transforming opioid research. By using NLP to process large volumes of unstructured medical data and employing Knowledge Graphs to map intricate relationships, researchers can achieve greater insights into the opioid crisis.
Extracting Critical Insights on Opioid Use Disorder with Healthcare NLP Models discusses how John Snow Labs’ Healthcare NLP models are transforming the extraction of crucial insights on opioid use disorder. These advanced NLP techniques efficiently identify and categorize medical terminology related to opioid addiction, improving clinical understanding and treatment strategies.
Extract Medical Named Entities with Regex in Healthcare NLP at Scale explains that the RegexMatcherInternal class employs regular expressions to detect and associate specific text patterns with predefined entities like dates, SSNs, and email addresses. This method facilitates targeted entity extraction by matching text patterns to these predefined entities.
Extracting Medical Named Entities with Healthcare NLP’s EntityRulerInternal explains that EntityRulerInternal in Spark NLP extracts medical entities from text using regex patterns or exact matches defined in JSON or CSV files. This post explains how to set it up and use it in a Healthcare NLP pipeline, with practical examples.
Using Contextual Assertion for Clinical Text Analysis: A Comprehensive Guide dive into leveraging Healthcare NLP, a robust NLP library, for clinical text analysis, emphasizing the role of Contextual Assertion. Contextual Assertion markedly enhances the accuracy of detecting negation, possibility, and temporality in medical records. It surpasses deep learning-based assertion status detection in accurately categorizing health conditions. Benchmark comparisons reveal an average F1 score improvement of 10-15%, highlighting the superior precision and reliability of Contextual Assertion in healthcare data analysis.
State-of-the-art RxNorm Code Mapping with NLP: Comparative Analysis between the tools by John Snow Labs, Amazon, and GPT-4 compares RxNorm code mapping accuracy and a price analysis between John Snow Labs, GPT-4, and Amazon.

New Notebooks for Medication and Resolutions Concept

To better understand the Medication and Resolutions Concept, the following notebooks have been developed:

New Clinical Medication Use Case notebook: This notebook is designed to extract and analyze medication information from a clinical dataset. Its purpose is to identify commonly used medications, gather details on dosage, frequency, strength, and route, determine current and past usage, understand pharmacological actions, identify treatment purposes, retrieve relevant codes (RxNorm, NDC, UMLS, SNOMED), and find associated adverse events.
New Resolving Medical Terms to Terminology Codes Directly notebook: In this notebook, you will find how to optimize the process to get SentenceEntityResolverModel model outputs.
New Analyse Veterinary Documents with Healthcare NLP notebook: In this notebook, we use Spark NLP for Healthcare to process veterinary documents. We focus on Named Entity Recognition (NER) to identify entities, Assertion Status to confirm their condition, Relation Extraction to understand their relationships, and Entity Resolution to standardize terms. This helps us efficiently extract and analyze critical information from unstructured veterinary texts.

Updated Udemy MOOC (Our Online Courses) Notebooks

Recently updated Udemy MOOC (Massive Online Course) notebooks that focus on using Spark NLP annotators for healthcare applications. These notebooks provide practical examples and exercises for learning how to implement and utilize various Spark NLP tools and techniques specifically designed for processing and analyzing healthcare-related text data. The update might include new features, improvements, or additional content to enhance the learning experience for students and professionals in the healthcare field.

Please check the Spark_NLP_Udemy_MOOC folder for the all Healthcare MOOC Notebooks

Various Core Improvements; Bug Fixes, Enhanced Overall Robustness, and Reliability of Spark NLP for Healthcare

Resolved broken links in healthcare demos
Added a unique ID field for each entity into the result of the pipeline_ouput_parser module
Fixed deidentification AGE obfuscation hanging issue
Added DatasetInfo parameter into the MedicalNERModel annotator

Updated Notebooks And Demonstrations For making Spark NLP For Healthcare Easier To Navigate And Understand

New Clinical Medication Use Case notebook
New Resolving Medical Terms to Terminology Codes Directly notebook
New Contextual Assertion notebook
New VectorDB and PostProcessor for RAG Generative AI notebook
New Analyse Veterinary Documents with Healthcare NLP notebook
LLMLoader notebook
Updated FewShot Assertion Classifier notebook
New ALCOHOL SMOKING Demo
New JSL vs GPT4 Demo

We Have Added And Updated A Substantial Number Of New Clinical Models And Pipelines, Further Solidifying Our Offering In The Healthcare Domain.

pdf_deid_subentity_context_augmented_pipeline
ner_deid_context_nameAugmented_pipeline
ner_profiling_vop
ner_vop_v2
ner_alcohol_smoking
sbiobertresolve_snomed_veterinary
cancer_diagnosis_matcher
country_matcher
email_matcher
phone_matcher
state_matcher
zip_matcher
contextual_assertion_someone_else
contextual_assertion_absent
contextual_assertion_past
ner_alcohol_smoking
assertion_alcohol_smoking_wip
assertion_alcohol_smoking_general_symptoms_wip
re_alcohol_smoking_clinical_wip
genericclassifier_alcohol_mpnet_wip
genericclassifier_smoking_mpnet_wip
ner_menopause_core
assertion_menopause_wip
fewhot_assertion_jsl_e5_base_v2_jsl
fewhot_assertion_e5_base_v2
fewhot_assertion_sdoh_e5_base_v2_sdoh
fewhot_assertion_smoking_e5_base_v2_smoking
fewhot_assertion_oncology_e5_base_v2_oncology
fewhot_assertion_radiology_e5_base_v2_radiology
mpnet_embeddings_medical_assertion
mpnet_embeddings_medical_assertion_jsl
mpnet_embeddings_medical_assertion_oncology
mpnet_embeddings_medical_assertion_sdoh
e5_base_v2_embeddings_medical_assertion_base
e5_base_v2_embeddings_medical_assertion_jsl
e5_base_v2_embeddings_medical_assertion
e5_base_v2_embeddings_medical_assertion_sdoh
e5_base_v2_embeddings_medical_assertion_smoking
e5_base_v2_embeddings_medical_assertion_oncology
e5_base_v2_embeddings_medical_assertion_radiology
ner_profiling_vop
ner_profiling_sdoh
ner_profiling_oncology
explain_clinical_doc_granular
explain_clinical_doc_radiology
explain_clinical_doc_medication
medication_resolver_pipeline
medication_resolver_transform_pipeline
rxnorm_resolver_pipeline
rxnorm_mapper
biolordresolve_rxnorm_augmented
sbiobertresolve_rxnorm_augmented
sbiobertresolve_umls_clinical_drugs
sbiobertresolve_umls_disease_syndrome
sbiobertresolve_umls_drug_substance
sbiobertresolve_umls_findings
sbiobertresolve_umls_general_concepts
sbiobertresolve_umls_major_concepts
clinical_deidentification
clinical_deidentification_multi_mode_output
classifierml_ade
assertion_dl_radiology
ner_oncology_wip
ner_sdoh_access_to_healthcare_wip
ner_sdoh_community_condition_wip
ner_sdoh_demographics_wip
ner_sdoh_health_behaviours_problems_wip
ner_sdoh_income_social_status_wip
ner_sdoh_slim_wip
ner_sdoh_social_environment_wip
ner_sdoh_substance_usage_wip
ner_sdoh_wip
JSL_MedS_q16_v1
JSL_MedS_q8_v1
JSL_MedS_q4_v1
JSL_MedM_q16_v1
JSL_MedM_q8_v1
JSL_MedM_q4_v1
JSL_MedSNer_ZS_q16_v1
JSL_MedSNer_ZS_q8_v1
JSL_MedSNer_ZS_q4_v1

For all Spark NLP for Healthcare models, please check: Models Hub Page

Versions

Version
Version
Version

PREVIOUSVersion Compatibility

5.4.0

Highlights

Introducing a Brand New LLMLoader Annotator to Load and Run Large Language Models in GGUF format

Introducing a Brand New FewshotAssertionClassifier Annotator to Train Assertion Detection Models Using a Few Samples with Better Accuracy

Introducing a Rule-Based ContextualAssertion Annotator to Detect Assertion Status Using Patterns and Rules without any Training or Annotation

Introducing VectorDBPostProcessor Annotator to Filter and Sort the Document Splits Returned by VectorDB in a RAG Application

Introducing ContextSplitAssembler Annotator to Assemble the Document Post-processed Splits as a Context into an LLM Stage in a RAG Application

SNOMED Entity Resolver Model for Veterinary Domains

Voice of the Patients Named Entity Recognition (NER) Model

New Rule-Based Entity Matcher Models to Customise De-IDentification Pipelines

New NER, Assertion, Relation Extraction, and Classification Models to Identify Alcohol and Smoking Related Medical Entities

New NER and Assertion Models to Extract Menopause Related Entities

Clinical Document Analysis with One-Liner Pretrained Pipelines for Specific Clinical Tasks and Concepts

Formal Release of Oncological Assertion Status Detection and Relation Extraction Models

11 New Fine-Tuned Sentence Embedding Models finetuned with medical assertion datasets

Significantly Faster Vector-DB Based Entity Resolution Models Than Existing Sentence Entity Resolver Models

RxNorm Code Mapping Benchmarks and Cost Comparisons: John Snow Labs, GPT-4, and Amazon Comprehend Medical

New Blogposts on Using NLP in Opioid Research and Healthcare: Harnessing NLP, Knowledge Graphs, and Regex Techniques for Critical Insights

New Notebooks for Medication and Resolutions Concept

Updated Udemy MOOC (Our Online Courses) Notebooks

Various Core Improvements; Bug Fixes, Enhanced Overall Robustness, and Reliability of Spark NLP for Healthcare

Updated Notebooks And Demonstrations For making Spark NLP For Healthcare Easier To Navigate And Understand

We Have Added And Updated A Substantial Number Of New Clinical Models And Pipelines, Further Solidifying Our Offering In The Healthcare Domain.

Versions

Introducing a Brand New `LLMLoader` Annotator to Load and Run Large Language Models in GGUF format

Introducing a Brand New `FewshotAssertionClassifier` Annotator to Train Assertion Detection Models Using a Few Samples with Better Accuracy

Introducing a Rule-Based `ContextualAssertion` Annotator to Detect Assertion Status Using Patterns and Rules without any Training or Annotation

Introducing `VectorDBPostProcessor` Annotator to Filter and Sort the Document Splits Returned by `VectorDB` in a RAG Application

Introducing `ContextSplitAssembler` Annotator to Assemble the Document Post-processed Splits as a Context into an LLM Stage in a RAG Application

SNOMED Entity Resolver Model for `Veterinary` Domains

New NER, Assertion, Relation Extraction, and Classification Models to Identify `Alcohol` and `Smoking` Related Medical Entities

New NER and Assertion Models to Extract `Menopause` Related Entities