Healthcare NLP v5.3.1 Release Notes

5.3.1

Highlights

We are delighted to announce remarkable enhancements and updates in our latest release of Spark NLP for Healthcare. This release comes with MedDRA resolver, 2x faster optimized Deidentification pipelines, Response to Treatment classifier for oncology, and 41 new clinical pretrained models and pipelines. It’s as big as a major release!

Welcoming MedDRA into the library. Releasing 8 new Entity Resolver and Mapper models to associate clinical entities with MedDRA LLT and PT codes.
Enhancing assertion annotation workflow with AssertionMerger annotator to allow using multiple assertion models within the same pipeline.
Adding new clinical deidentification pipelines that are 2x faster. Now we have 15 pretrained deidentification pipelines of various sizes and capabilities.
Efficiency analysis and cost evaluation of deidentification pipelines on cloud platforms
Updated Opioid NER model and Drug text matcher model
New text classifier for Response to Treatment to detect response status/ outcome for the treatment applied for oncology patients.
2 new Entity Resolver models for associating SNOMED clinical entities.
Clinical document analysis with one-liner pretrained pipelines for specific clinical tasks and concepts.
A new augmented NER model for multilingual name extraction by leveraging the capabilities of the LangTest library to boost its robustness significantly.
DatasetInfo parameter added to SentenceEntityResolver annotator to track the source datasets’ versions.
Robust exception handling to allow skipping only the corrupted records processed via GenericClassifier, BertSentenceChunkEmbeddings, AssertionFilterer, ChunkFilterer, ContextualParser, ChunkMerge and Deidentification annotators.
Changed the license of CPT and MedDRA models in the ModelHub, and attempting to use them in Healthcare NLP now throws an error.
Various core improvements; bug fixes, enhanced overall robustness and reliability of Spark NLP for Healthcare
- Fixed sentence positions in MedicalBertForSequenceClassification
- Updated Deidentification Module according to the latest spark versions
- Updated ALAB Module for assertion result according to tokenization flexibility
- Deprecation of the setRel Method in ChunkMapper: Transitioning to the setRels parameter
- Enhancements in SentenceEntityResolver: Bug Fix and Annotator Refactor
- Added assertion_source, ner_chunk, and ner_label metadata fields to the AssertionDL and AssertionLogReg annotators
- Implemented fixes and enhancements related to entity handling and resolution in Resolver and ChunkMapper, including incorporating an entity field in resolver metadata from embeddings, rectifying the entity field assignment in ChunkMapper, and resolving a bug with all_k_resolutions when using setMultivaluesRelations(True) in ChunkMapper
- Updated init.py and the path of class for BertSentenceChunkEmbedding
Updated notebooks and demonstrations for making Spark NLP for Healthcare easier to navigate and understand
- New Task Based Clinical Pretrained Pipelines Notebook
- Updated Clinical Assertion Model Notebook
- Updated Snomed Entity Resolver Model Training Notebook
- New Response to Treatment Classification Demo
- Updated Opioid Demo
The addition and update of numerous new clinical models and pipelines continue to reinforce our offering in the healthcare domain

These enhancements will elevate your experience with Spark NLP for Healthcare, enabling more efficient, accurate, and streamlined analysis of healthcare-related natural language data.

Welcoming MedDRA into the Library. Releasing 10 New Entity Resolver, Mapper Models and Pretrained Pipelines to Associate Clinical Entities With Meddra Llt and Pt Codes

Introducing 2 new Sentence Entity Resolver Models sbiobertresolve_meddra_lowest_level_term and sbiobertresolve_meddra_preferred_term help to map medical entities to MedDRA codes.

Model Name	Description
`sbiobertresolve_meddra_lowest_level_term`	This model maps clinical terms to their corresponding MedDRA LLT (Lowest Level Term) codes.
`sbiobertresolve_meddra_preferred_term`	This model maps clinical terms to their corresponding MedDRA PT (Preferred Term) codes.

Example:

meddra_resolver = SentenceEntityResolverModel.load("sbiobertresolve_meddra_lowest_level_term") \
     .setInputCols(["sbert_embeddings"]) \
     .setOutputCol("meddra_llt_code")\
     .setDistanceFunction("EUCLIDEAN")

text= """This is an 82-year-old male with a history of prior tobacco use, benign hypertension, chronic renal insufficiency, chronic bronchitis, gastritis, and ischemic attack. He initially presented to Braintree with ST elevation and was transferred to St. Margaret’s Center. He underwent cardiac catheterization because of the left main coronary artery stenosis, which was complicated by hypotension and bradycardia. We describe the side effects of 5-FU in a colon cancer patient who suffered mucositis and dermatitis."""

Result:

ner_chunk	begin	end	entity	meddra_llt_code	resolution	all_k_results	all_k_resolutions
tobacco	52	58	Smoking	10067622	tobacco interaction	10067622:::10086359:::10057581:::10082288:::10009180:::10…	tobacco interaction:::tobaccoism:::tobacco user:::exposur…
hypertension	72	83	Hypertension	10020772	hypertension	10020772:::10020790:::10088636:::10081425:::10015488:::10…	hypertension:::hypertension secondary:::systemic hyperten…
chronic renal insufficiency	86	112	Kidney_Disease	10050441	chronic renal insufficiency	10050441:::10009122:::10009119:::10075441:::10038474:::10…	chronic renal insufficiency:::chronic renal impairment:::…
bronchitis	123	132	Disease_Syndrome_Disorder	10006451	bronchitis	10006451:::10006448:::10008841:::10085668:::10061736:::10…	bronchitis:::bronchiolitis:::chronic bronchitis:::capilla…
gastritis	135	143	Disease_Syndrome_Disorder	10017853	gastritis	10017853:::10060703:::10076492:::10070814:::10088553:::10…	gastritis:::verrucous gastritis:::antral gastritis:::corr…
ischemic attack	150	164	Cerebrovascular_Disease	10072760	transient ischemic attack	10072760:::10060848:::10060772:::10061216:::10055221:::10…	transient ischemic attack:::ischemic cerebral infarction:…
cardiac catheterization	280	302	Procedure	10048606	cardiac catheterization	10048606:::10007527:::10054343:::10007815:::10053451:::10…	cardiac catheterization:::cardiac catheterisation:::cathe…
left main coronary artery stenosis	319	352	Heart_Disease	10090240	left main coronary artery stenosis	10090240:::10072048:::10084343:::10011089:::10083430:::10…	left main coronary artery stenosis:::left anterior descen…
hypotension	380	390	VS_Finding	10021097	hypotension	10021097:::10021107:::10066331:::10066077:::10036433:::10…	hypotension:::hypotensive:::arterial hypotension:::diasto…
bradycardia	396	406	VS_Finding	10006093	bradycardia	10006093:::10040741:::10078310:::10064883:::10065585:::10…	bradycardia:::sinus bradycardia:::central bradycardia:::r…
colon cancer	451	462	Oncological	10009944	colon cancer	10009944:::10009989:::10009957:::10061451:::10007330:::10…	colon cancer:::colonic cancer:::colon carcinoma:::colorec…
mucositis	485	493	ADE	10028127	mucositis	10028127:::10065880:::10065900:::10006525:::10021960:::10…	mucositis:::laryngeal mucositis:::tracheal mucositis:::bu…
dermatitis	499	508	ADE	10012431	dermatitis	10012431:::10048768:::10003639:::10012470:::10073737:::10…	dermatitis:::dermatosis:::atopic dermatitis:::dermatitis …

6 ChunkMapper Models for Medical Code Mapping to Map Various Medical Terminologies Across Each Other

Introducing a suite of new ChunkMapper models designed to streamline medical code mapping tasks. These models include mappings between RxNorm, ICD-10, MedDRA-LLT, and MedDRA-PT codes, offering a comprehensive solution for interoperability within medical systems.

Model Name	Description
`icd10_meddra_llt_mapper`	Maps ICD-10 codes to corresponding MedDRA LLT (Lowest Level Term) codes.
`meddra_llt_icd10_mapper`	Maps MedDRA-LLT (Lowest Level Term) codes to corresponding ICD-10 codes.
`icd10_meddra_pt_mapper`	Maps ICD-10 codes to corresponding MedDRA-PT (Preferred Term) codes.
`meddra_pt_icd10_mapper`	Maps MedDRA-PT (Preferred Term) codes to corresponding ICD-10 codes.
`meddra_llt_pt_mapper`	Maps MedDRA-LLT (Lowest Level Term) codes to their corresponding MedDRA-PT (Preferred Term) codes.
`meddra_pt_llt_mapper`	Maps MedDRA-PT (Preferred Term) codes to their corresponding MedDRA-LLT (Lowest Level Term) codes.

Example:

mapperModel = ChunkMapperModel.load('meddra_llt_pt_mapper')\
    .setInputCols(["ner_chunk"])\
    .setOutputCol("mappings")\
    .setRels(["icd10_code"])

text = ["10002442", "10000007", "10003696"]

Result:

llt_code	pt_code
10002442	10002442:Angiogram pulmonary normal
10000007	10000007:17 ketosteroids urine decreased
10003696	10001324:Adrenal atrophy

Introducing 2 New Pretrained Meddra Resolver Pipelines Designed For Effortless Integration With Just A Single Line Of Code

These pipelines are capable of extracting clinical entities and linking them to their respective MedDRA LLT and PT codes, while also facilitating mapping of these codes to LLT/PT or ICD-10 codes.

Pipeline Name	Description
`meddra_llt_resolver_pipeline`	This dedicated pipeline extracts clinical terms and links them to their corresponding MedDRA LLT (Lowest Level Term) codes, map those codes to their MedDRA PT (Preferred Term) codes and ICD-10 codes.
`meddra_pt_resolver_pipeline`	This dedicated pipeline extracts clinical terms and links them to their corresponding MedDRA PT (Preferred Term) codes, map those codes to their MedDRA LLT (Lowest Level Term) codes and ICD-10 codes.

Example:

from sparknlp.pretrained import PretrainedPipeline

meddra_llt_pipeline = PretrainedPipeline.from_disk("meddra_llt_resolver_pipeline")

result = meddra_llt_pipeline.fullAnnotate('This is an 82-year-old male with a history of prior tobacco use, benign hypertension, chronic renal insufficiency, chronic bronchitis, gastritis, and ischemic attack. He initially presented to Braintree with ST elevation and was transferred to St. Margaret’s Center. He underwent cardiac catheterization because of the left main coronary artery stenosis, which was complicated by hypotension and bradycardia. We describe the side effects of 5-FU in a colon cancer patient who suffered mucositis and dermatitis.')

Result:

chunk	label	meddra_llt_code	resolution	icd10_mappings	meddra_pt_mappings
tobacco	Smoking	10067622	tobacco interaction	NONE	10067622:Tobacco interaction
benign hypertension	PROBLEM	10004455	benign secondary hypertension	NONE	10039834:Secondary hypertension
chronic renal insufficiency	Kidney_Disease	10050441	chronic renal insufficiency	NONE	10064848:Chronic kidney disease
chronic bronchitis	PROBLEM	10008841	chronic bronchitis	NONE	10006458:Bronchitis chronic
gastritis	Disease_Syndrome_Disorder	10017853	gastritis	K29.6:Other gastritis	10017853:Gastritis
ischemic attack	Cerebrovascular_Disease	10072760	transient ischemic attack	NONE	10044390:Transient ischaemic attack
ST elevation	PROBLEM	10041887	st elevated	NONE	10014392:Electrocardiogram ST segment elevation
cardiac catheterization	Procedure	10048606	cardiac catheterization	Y84.0:Cardiac catheterization	10007815:Catheterisation cardiac
the left main coronary artery stenosis	PROBLEM	10090240	left main coronary artery stenosis	NONE	10011089:Coronary artery stenosis
hypotension	VS_Finding	10021097	hypotension	I95:Hypotension	10021097:Hypotension
bradycardia	VS_Finding	10006093	bradycardia	R00.1:Bradycardia, unspecified	10006093:Bradycardia
the side effects	PROBLEM	10044252	toxic symptom	NONE	10070863:Toxicity to various agents
a colon cancer	PROBLEM	10009944	colon cancer	NONE	10009944:Colon cancer
mucositis	ADE	10028127	mucositis	NONE	10028116:Mucosal inflammation
dermatitis	ADE	10012431	dermatitis	L27:Dermatitis due to substances taken internally	10012431:Dermatitis

Important note: To utilize these MedDRA models/pipelines, possession of a valid MedDRA license is requisite. When you want to use these models and pipelines, you will receive a warning like below. If you possess a valid MedDRA license and wish to use this model, kindly contact us at support@johnsnowlabs.com.

IllegalArgumentException: 'meddra_llt_pt_mapper' model cannot be used as a pretrained model.
To load this model locally via .load(), possession of a valid MedDRA / CPT license is required.
If you possess one thru corresponding agencies and wish to use this model, contact us at support@johnsnowlabs.com.

Enhancing Assertion Annotation Workflow with AssertionMerger Annotator to Allow using Multiple Assertion Models within the Same Pipeline.

Introducing the latest addition to our annotation toolkit, the AssertionMerger Annotator, designed to streamline the merging process of assertion columns from various annotators like AssertionDL and AssertionLogReg. This powerful tool offers customizable parameters for filtering, prioritizing, and seamlessly combining assertion annotations. Learn how to leverage features like merging overlapping annotations, applying filters before or after merging, and prioritizing based on confidence levels and assertion sources. Optimize your annotation workflow with AssertionMerger Annotator, ensuring efficient and accurate consolidation of assertion data.

Parameters:

mergeOverlapping: Whether to merge overlapping matched assertion annotations. Default: True
applyFilterBeforeMerge: Whether to apply filtering before the merging process. If True, filtering will be applied before merging; if False, filtering will be applied after the merging process. Default: False.
blackList: If defined, list of entities to ignore. The rest will be processed.
whiteList: If defined, list of entities to process. The rest will be ignored. Do not include the IOB prefix on labels.
caseSensitive: Determines whether the definitions of the white-listed and black-listed entities are case sensitive. Default: True.
assertionsConfidence: Pairs (assertion, confidenceThreshold) to filter assertions that have confidence lower than the confidence threshold.
orderingFeatures: Specifies the ordering features to use for overlapping entities. Possible values include: ‘begin’, ‘end’, ‘length’, ‘source’, and ‘confidence’. Default: ['begin', 'length', 'source']
selectionStrategy: Determines the strategy for selecting annotations. Annotations can be selected either sequentially based on their order (Sequential) or using a more diverse strategy (DiverseLonger). Currently, only Sequential and DiverseLonger options are available. Default: Sequential.
defaultConfidence: When the confidence value is included in the orderingFeatures and a given annotation does not have any confidence, this parameter determines the value to be used. The default value is 0.
assertionSourcePrecedence: Specifies the assertion sources to use for prioritizing overlapping annotations when the ‘source’ ordering feature is utilized. This parameter contains a comma-separated list of assertion sources that drive the prioritization. Annotations will be prioritized based on the order of the given string.
sortByBegin: Whether to sort the annotations by begin at the end of the merge and filter process. Default: False.

Example:

# Assertion model trained on i2b2 (sampled from MIMIC) dataset
assertion_jsl = AssertionDLModel.pretrained("assertion_jsl_augmented", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_jsl_chunk", "embeddings"]) \
    .setOutputCol("assertion_jsl")\
    .setEntityAssertionCaseSensitive(False)

# Assertion model trained on radiology dataset
assertion_dl = AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models") \
    .setInputCols(["sentence", "ner_clinical_chunk", "embeddings"]) \
    .setOutputCol("assertion_dl")

assertion_merger = AssertionMerger() \
    .setInputCols("assertion_jsl", "assertion_dl") \
    .setOutputCol("assertion_merger") \
    .setMergeOverlapping(True) \
    .setSelectionStrategy("sequential") \
    .setAssertionSourcePrecedence("assertion_dl, assertion_jsl") \
    .setCaseSensitive(False) \
    .setAssertionsConfidence({"past": 0.70}) \
    .setOrderingFeatures(["length", "source", "confidence"]) \
    .setDefaultConfidence(0.50)


text = [
    """Patient had a headache for the last 2 weeks, and appears anxious when she walks fast. No alopecia noted. She denies pain. Her father is paralyzed and it is a stressor for her. She got antidepressant. We prescribed sleeping pills for her current insomnia."""
]

Result:

ner_chunk	begin	end	ner_label	assertion	assertion_source	confidence
headache	14	21	Symptom	Past	assertion_jsl	0.9999
anxious	57	63	PROBLEM	present	assertion_dl	0.9392
alopecia	89	96	PROBLEM	absent	assertion_dl	0.9992
pain	116	119	PROBLEM	absent	assertion_dl	0.9884
paralyzed	136	144	Symptom	Family	assertion_jsl	0.9995
stressor	158	165	Symptom	Family	assertion_jsl	1.0
antidepressant	184	197	TREATMENT	present	assertion_dl	0.9628
sleeping pills	214	227	TREATMENT	present	assertion_dl	0.998
insomnia	245	252	Symptom	Past	assertion_jsl	0.9862

Please check Clinical Assertion Model Notebook for more information

Adding New Clinical Deidentification Pipelines for Flexible Deployments

The Comprehensive Clinical Deidentification Pipeline offers a robust solution for anonymizing sensitive PHI (Protected Health Information) from medical texts. This versatile pipeline is equipped to mask and obfuscate a wide array of PHI entities including AAGE, CONTACT, DATE, ID, LOCATION, NAME, PROFESSION, CITY, COUNTRY, DOCTOR, HOSPITAL, IDNUM, MEDICALRECORD, ORGANIZATION, PATIENT, PHONE, PROFESSION, STREET, USERNAME, ZIP, ACCOUNT, LICENSE, VIN, SSN, DLN, PLATE, IPADDR, and more. With support for multiple languages including Arabic, German, French, English Spanish, Italian, Portuguese, and Romanian this pipeline ensures compliance with privacy regulations across diverse healthcare settings. Choose from various models optimized for different use cases, such as obfuscation levels and subentity recognition, to tailor the deidentification process according to specific requirements.

index	model	lang
1	clinical_deidentification	ar, de, en, es, fr, it, pt, ro
2	clinical_deidentification_augmented	es
3	clinical_deidentification_generic	en
4	clinical_deidentification_generic_optimized	en
5	clinical_deidentification_glove	en
6	clinical_deidentification_glove_augmented	en
7	clinical_deidentification_langtest	en
8	clinical_deidentification_multi_mode_output	en
9	clinical_deidentification_obfuscation_medium	en
10	clinical_deidentification_obfuscation_small	en
11	clinical_deidentification_slim	en
12	clinical_deidentification_subentity	en
13	clinical_deidentification_subentity_nameAugmented	en
14	clinical_deidentification_subentity_optimized	en
15	clinical_deidentification_wip	en

We will share a detailed table on our wiki page soon to explain the pros and cons of each model as well as tips and tricks to show how to use them effectively

Efficiency Analysis and Cost Evaluation of Deidentification Pipelines on Cloud Platforms

These results show speed benchmarks and cost evaluations for deidentification pipelines deployed across diverse cloud platforms, including AWS EMR and EC2. Additionally, forthcoming results from DataBricks promise to further enrich the analysis, offering deeper insights into de-identification pipeline performance.

Partition	EMR Base Pipeline	EMR Optimized Pipeline	EC2 Instance Base Pipeline	EC2 Instance Optimized Pipeline
1024	5 min 1 sec	2 min 45 sec	7 min 6 sec	3 min 26 sec
512	4 min 52 sec	2 min 30 sec	6 min 56 sec	3 min 41 sec
256	4 min 50 sec	2 min 30 sec	9 min 10 sec	5 min 18 sec
128	4 min 55 sec	2 min 30 sec	14 min 30 sec	7 min 51 sec
64	6 min 24 sec	3 min 8 sec	18 min 59 sec	9 min 9 sec
32	7 min 15 sec	3 min 43 sec	18 min 47.2 sec	9 min 18 sec
16	11 min 6 sec	4 min 57 sec	12 min 47.5 sec	6 min 14 sec
8	19 min 13 se	8 min 8 sec	16 min 52 sec	8 min 48 sec

Estimated Minimum Costs:

EMR Base Pipeline: partition number: 256, 10K cost:$1.04, 1M cost:$104.41
EMR Optimized Pipeline: partition number: 256, 10K cost:$0.54, 1M cost:$54.04
EC2 Instance Base Pipeline: partition number: 512, 10K cost:$0.36, 1M cost:$35.70
EC2 Instance Optimized Pipeline: partition number: 1024, 10K cost:$0.18, 1M cost:$17.85
DataBricks results will be published soon.

ner_opioid: This Updated Opioid-related Named Entity Recognition model has been enhanced with new annotated text data. Opioids are a class of drugs that include the illegal drug heroin, synthetic opioids such as fentanyl, and pain relievers available legally by prescription. The model is designed to detect and label opioid-related entities within text data. It has been retrained using advanced deep learning techniques on an expanded and diversified range of text sources, including newly annotated text specifically focused on opioid-related content.

Example:

ner_model = MedicalNerModel.pretrained("ner_opioid", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

sample_texts = """History of Present Illness: A 20-year-old male was transferred from an outside hospital for evaluation for liver transplant following a Percocet overdose. On Sunday, March 27th, he experienced a stressful day and consumed approximately 20 Percocet (5/325) tablets throughout the day following a series of family arguments. He denies any intent to harm himself, although his parents confirm past suicidal attempts. On Monday, he felt he was experiencing a Percocet withdrawal "hangover" and took an additional 5 Percocet. He was admitted to the Surgical Intensive Care Unit (SICU) and received care from Liver, Transplant, Toxicology. Treatment included Naloxone every 4 hours, resulting in a gradual improvement in liver function tests (LFTs) and INR. During recovery, he developed hypertension and was initiated on clonidine."""

Result:

chunk	begin	end	ner_label
Percocet	136	143	opioid_drug
overdose	145	152	other_disease
20	236	237	drug_quantity
Percocet	239	246	opioid_drug
tablets	256	262	drug_form
harm himself	347	358	violence
suicidal attempts	395	411	psychiatric_issue
Percocet	455	462	opioid_drug
withdrawal	464	473	general_symptoms
hangover	476	483	general_symptoms
5	509	509	drug_quantity
Percocet	511	518	opioid_drug
Naloxone	653	660	antidote
every 4 hours	662	674	drug_frequency
hypertension	782	793	other_disease
clonidine	816	824	other_drug

Benchmark:

                 label  precision    recall  f1-score   support
           alcohol_use       0.92      0.95      0.94       353
              antidote       1.00      0.99      0.99       141
  communicable_disease       0.76      0.88      0.82       224
         drug_duration       0.81      0.71      0.75       238
             drug_form       0.97      0.95      0.96       614
        drug_frequency       0.94      0.97      0.96      1527
         drug_quantity       0.96      0.94      0.95      2169
            drug_route       0.95      0.98      0.97       903
         drug_strength       0.84      0.95      0.89       388
            employment       0.79      0.63      0.70       306
      general_symptoms       0.90      0.84      0.87      4483
           legal_issue       0.73      0.52      0.61        84
        marital_status       0.95      0.95      0.95        57
           opioid_drug       0.98      0.96      0.97       725
         other_disease       0.91      0.90      0.90      4145
            other_drug       0.94      0.93      0.94      2617
     psychiatric_issue       0.88      0.85      0.86      1356
    sexual_orientation       1.00      0.78      0.88        23
substance_use_disorder       0.91      0.88      0.90       276
                  test       0.97      0.93      0.95       102
           test_result       1.00      0.93      0.97        30
              violence       0.81      0.71      0.76       542
             micro-avg       0.92      0.89      0.90     21303
             macro-avg       0.91      0.87      0.89     21303
          weighted-avg       0.91      0.89      0.90     21303

drug_matcher: The latest iteration of the Drug-related Text Matcher Model has been enhanced significantly, boasting an expanded database with the inclusion of an additional 100 thousand drugs. Through meticulous curation, the model has undergone refinement by strategically eliminating words that may have previously led to false positives.

Example:

text_matcher = TextMatcherInternalModel.pretrained("drug_matcher","en","clinical/models") \
    .setInputCols(["sentence", "token"])\
    .setOutputCol("matched_text")\

sample_texts = """John's doctor prescribed aspirin for his heart condition, along with paracetamol for his fever and headache, amoxicillin for his tonsilitis and lansoprazole for his GORD on 2023-12-01."""

Result:

chunk	begin	end	label
aspirin	25	31	DRUG
paracetamol	69	79	DRUG
amoxicillin	109	119	DRUG
lansoprazole	144	155	DRUG

New Oncological Response to Treatment Classification Model

The Oncological Response to Treatment classifier was trained on a diverse dataset, this model provides accurate label assignments and confidence scores for its predictions. The primary goal of this model is to categorize text into two key labels: Yes and No.

Example:

sequenceClassifier = MedicalBertForSequenceClassification\
    .pretrained("bert_sequence_classifier_response_to_treatment", "en", "clinical/models")\
    .setInputCols(["document", "token"])\
    .setOutputCol("prediction")


sample_texts = [
    "The breast ultrasound after neoadjuvant chemotherapy displayed a decrease in the primary lesion size from 3 cm to 1 cm, suggesting a favorable response to treatment. The skin infection is also well controlled with a multi-antibiotic approach. ",
    "MRI of the pelvis indicated no further progression of endometriosis after laparoscopic excision and six months of hormonal suppression therapy.",
    "A repeat endoscopy revealed healing gastric ulcers with new signs of malignancy or H. pylori infection. Will discuss the PPI continuum.",
    "Dynamic contrast-enhanced MRI of the liver revealed no significant reduction in the size and number of hepatic metastases following six months of targeted therapy with sorafenib."
]

Result:

text	result
The breast ultrasound after neoadjuvant chemotherapy displayed a decrease in the primary lesion s…	Yes
MRI of the pelvis indicated no further progression of endometriosis after laparoscopic excision a…	Yes
A repeat endoscopy revealed healing gastric ulcers with new signs of malignancy or H. pylori infe…	No
Dynamic contrast-enhanced MRI of the liver revealed no significant reduction in the size and numb…	No

Benchmark:

       label  precision    recall  f1-score   support
          No     0.9927    0.9875    0.9901      3031
         Yes     0.8430    0.9027    0.8718       226
    accuracy          -         -    0.9816      3257
   macro-avg     0.9178    0.9451    0.9309      3257
weighted-avg     0.9823    0.9816    0.9819      3257

2 New Sentence Entity Resolver Models for Associating SNOMED Clinical Entities

Introducing 2 new Sentence Entity Resolver Models sbiobertresolve_snomed_no_class and sbiobertresolve_snomed_conditions help to map medical entities to SNOMED codes.

Model Name	Description
`sbiobertresolve_snomed_no_class`	This model maps extracted medical entities (no concept class) to SNOMED codes.
`sbiobertresolve_snomed_conditions`	This model maps clinical conditions to their corresponding SNOMED (domain: Conditions) codes.

Example:

resolver = SentenceEntityResolverModel\
    .pretrained("sbiobertresolve_snomed_conditions", "en", "clinical/models")\
    .setInputCols(["sbert_embeddings"]) \
    .setOutputCol("resolution")\
    .setDistanceFunction("EUCLIDEAN")

text = """Medical professionals rushed in the bustling emergency room to attend to the patient with alarming symptoms.
The attending physician immediately noted signs of respiratory distress, including stridor, a high-pitched sound indicative of upper respiratory tract obstruction.
The patient, struggling to breathe, exhibited dyspnea. Concern heightened when they began experiencing syncope,
a sudden loss of consciousness likely stemming from inadequate oxygenation. Further examination revealed a respiratory tract hemorrhage."""

Result:

chunk	label	snomed_code	resolution	all_codes	all_resolutions
respiratory distress	VS_Finding	271825005	respiratory distress	271825005,418092006…	respiratory distress,respiratory tract cong…
stridor	Symptom	70407001	stridor	70407001,301826004:…	stridor,intermittent stridor,inhalatory str…
high-pitched sound	Symptom	51406002	high pitched voice	51406002,271661003:…	high pitched voice,heart sounds exaggerated…
upper respiratory tract obstruction	Disease_Syndrome_Disorder	68372009	upper respiratory tract obstruction	68372009,79688008::…	upper respiratory tract obstruction,respira…
struggling to breathe	Symptom	289105003	difficulty controlling breathing	289105003,230145002…	difficulty controlling breathing,difficulty…
dyspnea	Symptom	267036007	dyspnea	267036007,60845006:…	dyspnea,exertional dyspnea,inspiratory dysp…
syncope	Symptom	271594007	syncope	271594007,234167006…	syncope,situational syncope,tussive syncope…
loss of consciousness	Symptom	419045004	loss of consciousness	419045004,44077006:…	loss of consciousness,loss of sensation,los…
inadequate oxygenation	Symptom	238161004	impaired oxygen delivery	238161004,70944005:…	impaired oxygen delivery,impaired gas excha…
respiratory tract hemorrhage	Disease_Syndrome_Disorder	95431003	respiratory tract hemorrhage	95431003,233783005:…	respiratory tract hemorrhage,tracheal hemor…

Clinical Document Analysis with One-Liner Pretrained Pipelines for Specific Clinical Tasks and Concepts

We introduce a suite of advanced, hybrid pretrained pipelines, specifically designed to streamline the process of analyzing clinical documents. These pipelines are built upon multiple state-of-the-art (SOTA) pretrained models, delivering a comprehensive solution for extracting vital information with unprecedented ease.

What sets this release apart is the elimination of complexities typically involved in building and chaining models. Users no longer need to navigate the intricacies of constructing intricate pipelines from scratch or the uncertainty of selecting the most effective model combinations. Our new pretrained pipelines simplify these processes, offering a seamless, user-friendly experience.

Pipeline Name	Description
`icd10cm_rxnorm_resolver_pipeline`	This pipeline can extract clinical conditions and medication entities, map the clinical conditions to their respective ICD-10-CM codes, and medication entities to RxNorm codes.
`snomed_term_resolver_pipeline`	This pretrained resolver pipeline extracts SNOMED terms and maps them to their corresponding SNOMED codes.
`snomed_findings_resolver_pipeline`	This pipeline extracts clinical findings and maps them to their corresponding SNOMED (CT version) codes.
`snomed_body_structure_resolver_pipeline`	This pipeline extracts anatomical structure entities and maps them to their corresponding SNOMED (body structure version) codes.
`snomed_auxConcepts_resolver_pipeline`	This pipeline extracts `Morph Abnormality`, `Clinical Drug`, `Clinical Drug Form`, `Procedure`, `Substance`, `Physical Object`, and `Body Structure` concepts from clinical notes, then maps them to their corresponding SNOMED codes.
`snomed_conditions_resolver_pipeline`	This advanced pipeline extracts clinical conditions from clinical texts and map these entities to their corresponding SNOMED codes.
`snomed_drug_resolver_pipeline`	This advanced pipeline extracts drug entities from clinical texts and maps these entities to their corresponding SNOMED codes.
`snomed_resolver_pipeline`	This pipeline extracts `Clinical Findings` , `Morph Abnormality`, `Clinical Drug`, `Clinical Drug Form`, `Procedure`, `Substance`, `Physical Object`, and `Body Structure` concepts from clinical notes and maps them to their corresponding SNOMED codes.
`clinical_deidentification_generic_optimized`	This pipeline can be used to deidentify PHI information from medical texts.
`clinical_deidentification_subentity_nameAugmented`	This pipeline can be used to deidentify PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text.
`clinical_deidentification_subentity_optimized`	This pipeline can be used to deidentify PHI information from medical texts. The PHI information will be obfuscated in the resulting text and also masked with entity labels in the metadata.
`explain_clinical_doc_public_health`	This specialized public health pipeline can extract public health-related entities, assign assertion status to the extracted entities, establish relations between the extracted entities from the clinical documents. In this pipeline, five NER, one assertion, and one relation extraction model were used to achieve those tasks.
`explain_clinical_doc_biomarker`	This specialized biomarker pipeline can extract biomarker entities, classify sentences whether they contain biomarker entities or not, establish relations between the extracted biomarker and biomarker results from the clinical documents.
`explain_clinical_doc_risk_factors`	This pipeline is designed to extract all clinical/medical entities, which may be considered as risk factors from text, assign assertion status to the extracted entities, establish relations between the extracted entities.
`clinical_deidentification_multi_mode_output`	This pipeline simultaneously produces masked with entity labels, fixed-length char, same-length char and obfuscated version of the text.

A New Augmented NER Model for Multilingual Name Extraction by Leveraging the Capabilities of the LangTest Library to Boost Their Robustness Significantly

The newly introduced augmented NER model namely ner_deid_name_multilingual_clinical_langtest is powered by the innovative LangTest library. This cutting-edge NLP toolkit is at the forefront of language processing advancements, incorporating state-of-the-art techniques and algorithms to enhance the capabilities of our models significantly.

Example:

ner = MedicalNerModel.pretrained("ner_deid_name_multilingual_clinical_langtest", "xx", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")


text_list = [
    """Record date: 2093-01-13, David Hale, M.D., Name: Hendrickson, Ora MR. # 7194334 Date: 01/13/93 PCP: Oliveira, 25 years old, Record date: 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. The patient's complaints first surfaced when he started working for Brothers Coal-Mine.""",     
    """J'ai vu en consultation Michel Martinez (49 ans) adressé au Centre Hospitalier De Plaisir pour un diabète mal contrôlé avec des symptômes datant de Mars 2015.""",   
    """Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen."""    ,
    """Ho visto Gastone Montanariello (49 anni) riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015.""",
    """Antonio Miguel Martínez, un varón de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos.""",
    """Detalhes do paciente:
    Nome do paciente: Pedro Gonçalves NHC: 2569870 Endereço: Rua Das Flores 23. Cidade/ Província: Porto Código Postal: 21754-987 Dados de cuidados Data de nascimento: 10/10/1963 Idade: 53 anos Data de admissão: 17/06/2016 Doutora: Maria Santos""",
    """Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România Tel: +40(235)413773 Data setului de analize: 25 May 2022 15:36:00 Nume&Prenume: BUREAN MARIA, Varsta: 77 CNP: 2450502264401"""
]

Result:

ner_chunk	begin	end	ner_label
David Hale	25	34	NAME
Hendrickson, Ora	49	64	NAME
Brothers Coal-Mine	291	308	NAME
Michel Martinez	24	38	NAME
Michael Berger	0	13	NAME
Berger	117	122	NAME
Gastone Montanariello	9	29	NAME
Antonio Miguel Martínez	0	22	NAME
Pedro Gonçalves	41	55	NAME
Maria Santos	251	262	NAME
BUREAN MARIA	154	165	NAME

Please check: ner_deid_name_multilingual_clinical_langtest

`DatasetInfo` Parameter Added to `SentenceEntityResolver` Annotator to Track the Source Datasets’ Versions.

Introduced a setDatasetInfo param to SentenceEntityResolverApproach annotator to let users add dataset information (version, year, etc.) to the “model metadata” not the output.

Example:

bertExtractor = SentenceEntityResolverApproach()\
  .setNeighbours(25)\
  .setThreshold(1000)\
  .setInputCols("bert_embeddings")\
  .setNormalizedCol("concept_name")\
  .setLabelCol("conceptId")\
  .setOutputCol('snomed_code')\
  .setDistanceFunction("EUCLIDIAN")\
  .setCaseSensitive(False)\
  .setDatasetInfo("the model version:531")

Robust Exception Handling to Allow Skipping only the Corrupted Records Processed via `GenericClassifier`, `BertSentenceChunkEmbeddings`, `AssertionFilterer`, `ChunkFilterer`, `ContextualParser`, `ChunkMerge` and `Deidentification` Annotators

We added the doExceptionHandling parameter into GenericClassifier, BertSentenceChunkEmbeddings, AssertionFilterer, ChunkFilterer, ContextualParser, ChunkMerge and Deidentification annotators for a robust exception handling if the process is broken down due to corrupted inputs. Suppose it is set as True. In that case, the annotator tries to process as usual and if exception-causing data (e.g. corrupted record/ document) is passed to the annotator, an exception warning is emitted which has the exception message. Processing continues with the next one while the rest of the records within the same batch are parsed without interruption. The default behavior is False and will throw an exception and break the process to inform users.

Example:

deidentification = DeIdentification() \
    .setInputCols(["sentence", "token", "ner_chunk"]) \
    .setOutputCol("deidentified") \
    .setMode("mask")\
    .setDoExceptionHandling(True)

Changed the license of CPT and MedDRA Models in the ModelHub, and Attempting to Use Them in Healthcare NLP now Throws an Error

The CPT and MedDRA models have been removed from the S3 storage. As a result, when attempting to use these models in Spark NLP, a new error message is thrown. The new error message states that the specified model (e.g., ‘aaa’) cannot be used as a pre-trained model. It further explains that to load the model locally using the .load() method, possession of a valid MedDRA or CPT license is required. If the user has such a license obtained through the corresponding agencies, they are instructed to contact the support team at support@johnsnowlabs.com to inquire about using the model.

Various Core Improvements: Bug Fixes, Enhanced Overall Robustness, And Reliability Of Spark NLP For Healthcare

Fixed sentence positions in MedicalBertForSequenceClassification
Updated Deidentification Module according to the latest spark versions
Updated ALAB Module for assertion result according to tokenization flexibility
Deprecation of the setRel Method in ChunkMapper: Transitioning to the setRels parameter
Enhancements in SentenceEntityResolver: Bug Fix and Annotator Refactor
Added assertion_source, ner_chunk, and ner_label metadata fields to the AssertionDL and AssertionLogReg annotators
Implemented fixes and enhancements related to entity handling and resolution in Resolver and ChunkMapper, including incorporating an entity field in resolver metadata from embeddings, rectifying the entity field assignment in ChunkMapper, and resolving a bug with all_k_resolutions when using setMultivaluesRelations(True) in ChunkMapper
Updated init.py and the path of class for BertSentenceChunkEmbedding

Updated Notebooks And Demonstrations For making Spark NLP For Healthcare Easier To Navigate And Understand

New Task Based Clinical Pretrained Pipelines Notebook
Updated Clinical Assertion Model Notebook with AssertionMerger example
Updated Snomed Entity Resolver Model Training Notebook with new parameter
New Response to Treatment Classification Demo
Updated Opioid Demo with GPT4 comparison

We Have Added And Updated A Substantial Number Of New Clinical Models And Pipelines, Further Solidifying Our Offering In The Healthcare Domain.

sbiobertresolve_snomed_no_class
sbiobertresolve_snomed_conditions
sbiobertresolve_meddra_lowest_level_term
sbiobertresolve_meddra_preferred_term
sbiobertresolve_snomed_bodyStructure
sbiobertresolve_snomed_drug
sbiobertresolve_snomed_findings_aux_concepts
ner_deid_name_multilingual_clinical_langtest
explain_clinical_doc_ade
explain_clinical_doc_biomarker
explain_clinical_doc_public_health
explain_clinical_doc_risk_factors
meddra_llt_resolver_pipeline
meddra_pt_resolver_pipeline
medication_resolver_pipeline
medication_resolver_transform_pipeline
ner_medication_pipeline
icd10cm_rxnorm_resolver_pipeline
snomed_term_resolver_pipeline
snomed_findings_resolver_pipeline
snomed_body_structure_resolver_pipeline
snomed_auxConcepts_resolver_pipeline
snomed_conditions_resolver_pipeline
snomed_drug_resolver_pipeline
snomed_resolver_pipeline
clinical_deidentification_generic_optimized
clinical_deidentification_subentity_nameAugmented
clinical_deidentification_subentity_optimized
umls_rxnorm_mapper
icd10_meddra_llt_mapper
meddra_llt_icd10_mapper
icd10_meddra_pt_mapper
meddra_pt_icd10_mapper
meddra_llt_pt_mapper
meddra_pt_llt_mapper
rxnorm_umls_mapper
drug_matcher
ner_opioid
clinical_deidentification
clinical_deidentification_multi_mode_output
bert_sequence_classifier_response_to_treatment

For all Spark NLP for Healthcare models, please check: Models Hub Page

Versions

Version
Version
Version

PREVIOUSVersion Compatibility

5.3.1

Highlights

Welcoming MedDRA into the Library. Releasing 10 New Entity Resolver, Mapper Models and Pretrained Pipelines to Associate Clinical Entities With Meddra Llt and Pt Codes

Enhancing Assertion Annotation Workflow with AssertionMerger Annotator to Allow using Multiple Assertion Models within the Same Pipeline.

Adding New Clinical Deidentification Pipelines for Flexible Deployments

Efficiency Analysis and Cost Evaluation of Deidentification Pipelines on Cloud Platforms

Updated Opioid-related Named Entity Recognition and Drug-related Text Matcher Models

New Oncological Response to Treatment Classification Model

2 New Sentence Entity Resolver Models for Associating SNOMED Clinical Entities

Clinical Document Analysis with One-Liner Pretrained Pipelines for Specific Clinical Tasks and Concepts

A New Augmented NER Model for Multilingual Name Extraction by Leveraging the Capabilities of the LangTest Library to Boost Their Robustness Significantly

DatasetInfo Parameter Added to SentenceEntityResolver Annotator to Track the Source Datasets’ Versions.

Robust Exception Handling to Allow Skipping only the Corrupted Records Processed via GenericClassifier, BertSentenceChunkEmbeddings, AssertionFilterer, ChunkFilterer, ContextualParser, ChunkMerge and Deidentification Annotators

Changed the license of CPT and MedDRA Models in the ModelHub, and Attempting to Use Them in Healthcare NLP now Throws an Error

Various Core Improvements: Bug Fixes, Enhanced Overall Robustness, And Reliability Of Spark NLP For Healthcare

Updated Notebooks And Demonstrations For making Spark NLP For Healthcare Easier To Navigate And Understand

We Have Added And Updated A Substantial Number Of New Clinical Models And Pipelines, Further Solidifying Our Offering In The Healthcare Domain.

Versions

`DatasetInfo` Parameter Added to `SentenceEntityResolver` Annotator to Track the Source Datasets’ Versions.

Robust Exception Handling to Allow Skipping only the Corrupted Records Processed via `GenericClassifier`, `BertSentenceChunkEmbeddings`, `AssertionFilterer`, `ChunkFilterer`, `ContextualParser`, `ChunkMerge` and `Deidentification` Annotators