Healthcare NLP v5.2.1 Release Notes

5.2.1

Highlights

We are delighted to announce a suite of remarkable enhancements and updates in our latest release of Spark NLP for Healthcare. This release comes with a new Opioid NER model as well as 23 new clinical pretrained models and pipelines.

Introducing a new named entity recognition (NER) model for extracting information regarding Opioid usage
Introducing a new multilingual NER model to extract NAME entities for Deidentification purposes
Clinical document analysis with state-of-the-art Pretrained Pipelines for specific clinical tasks and concepts
Returning text embeddings within sentence entity resolution models
Setting entity pairs for relation labels in RelationExtractionDLModel to reduce false positives
Cluster and CPU speed benchmarks for Chunk Mapper, Entity Resolver, and Deidentification pipelines
ONNX support for ZeroShotNerModel, MedicalBertForSequenceClassification, MedicalBertForTokenClassification, and MedicalDistilBertForSequenceClassification
Various core improvements; bug fixes, enhanced overall robustness and reliability of Spark NLP for Healthcare
- The error caused by splitChars in NerConverterInternal has been resolved
- Fixed loading from disk issue for ChunkConverter, AnnotationMerger, and GenericRE annotators
- ContextualParser now supports unlimited document size
- Updated settings in sparknlp_jsl.start() function for Spark configuration
Updated notebooks and demonstrations for making Spark NLP for Healthcare easier to navigate and understand
- New Opioid Demo
- New Structured Streaming with Spark NLP for Healthcare Notebook
- Updated Clinical Relation Extraction Model Notebook
The addition and update of numerous new clinical models and pipelines continue to reinforce our offering in the healthcare domain

We believe that these enhancements will elevate your experience with Spark NLP for Healthcare, enabling more efficient, accurate, and streamlined analysis of healthcare-related natural language data.

Introducing a New Opioid Named Entity Recognition (NER) Model for Extracting Information Regarding `Opioid` Usage

This model is designed to detect and label opioid-related entities within text data. Opioids are a class of drugs that include the illegal drug heroin, synthetic opioids such as fentanyl, and pain relievers available legally by prescription. The model has been trained using advanced deep-learning techniques on a diverse range of text sources and can accurately recognize and classify a wide range of opioid-related entities. The model’s accuracy and precision have been carefully validated against expert-labeled data to ensure reliable and consistent results.

Please see the model card ner_opioid_small_wip for more information about the model

Example:

ner_model = MedicalNerModel.pretrained("ner_opioid_small_wip", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

sample_texts = """20 year old male transferred from [**Hospital1 112**] for liver transplant evaluation after percocet overdose. On Sunday [**3-27**] had a stressful day and pt took approximately 20 percocet (5/325) throughout the day after a series of family arguments. Denies trying to hurt himself. Parents confirm to suicidal attempts in the past. Pt felt that he had a hangover on Monday secondary to "percocet withdrawal" and took an additional 5 percocet.  Pt was admitted to the SICU and followed by Liver, Transplant, Toxicology, and [**Month/Year (2) **].  He was started on NAC q4hr with gradual decline in LFT's and INR.  His recovery was c/b hypertension, for which he was started on clonidine.  Pt was  transferred to the floor on [**4-1**].

Past Medical History:
Bipolar D/o (s/p suicide attempts in the past)
ADHD
S/p head injury [**2160**]: s/p MVA with large L3 transverse process
fx, small right frontal epidural hemorrhage-- with
post-traumatic seizures (was previously on dilantin, now dc'd)

Social History:
Father is HCP, student in [**Name (NI) 108**], Biology major, parents and brother live in [**Name (NI) 86**], single without children, lived in a group home for 3 years as a teenager, drinks alcohol 1 night a week, denies illict drug use, pt in [**Location (un) 86**] for neuro eval
"""

Result:

chunk	begin	end	ner_label
percocet	92	99	opioid_drug
20	178	179	drug_quantity
percocet	181	188	opioid_drug
5/325	191	195	drug_strength
suicidal attempts	303	319	psychiatric_issue
hangover	356	363	general_symptoms
percocet	389	396	opioid_drug
withdrawal	398	407	general_symptoms
5	433	433	drug_quantity
percocet	435	442	opioid_drug
NAC	567	569	other_drug
q4hr	571	574	drug_frequency
decline in LFT’s	589	604	general_symptoms
clonidine	679	687	other_drug
Bipolar	761	767	psychiatric_issue
suicide attempts	778	793	psychiatric_issue
ADHD	808	811	psychiatric_issue
dilantin	976	983	other_drug
illict drug use	1236	1250	substance_use_disorder

Please check the Opioid Demo

Introducing a New Multilingual NER Model To Extract `NAME` Entities for Deidentification Purposes

Introducing our latest invention Multilingual Named Entity Recognition model which annotates English, German, French, Italian, Spanish, Portuguese, and Romanian text to find NAME entities that may need to be de-identified. It was trained with in-house annotated datasets and detects NAME entities. We plan to expand this multilingual NER model to other PHI entities in the upcoming releases.

Example:

embeddings = XlmRoBertaEmbeddings.pretrained("xlm_roberta_base", "xx") \
    .setInputCols("sentence", "token") \
    .setOutputCol("embeddings")\
    .setMaxSentenceLength(512)\
    .setCaseSensitive(False)

ner = MedicalNerModel.pretrained("ner_deid_name_multilingual", "xx", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")


text = ["""Record date: 2093-01-13, David Hale, M.D., Name: Hendrickson, Ora MR. # 7194334 Date: 01/13/93 PCP: Oliveira, 25 years old, Record date: 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. The patient's complaints first surfaced when he started working for Brothers Coal-Mine.""",
"""J'ai vu en consultation Michel Martinez (49 ans) adressé au Centre Hospitalier De Plaisir pour un diabète mal contrôlé avec des symptômes datant de Mars 2015.""",
"""Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.""",
"""Ho visto Gastone Montanariello (49 anni) riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015."""]

Result:

doc_id	chunks	begin	end	entities
0	David Hale	26	35	NAME
0	Hendrickson, Ora	51	66	NAME
0	Oliveira	104	111	NAME
1	Michel Martinez	24	38	NAME
2	Michael Berger	0	13	NAME
2	Berger	117	122	NAME
3	Gastone Montanariello	9	29	NAME

Please see the model card ner_deid_name_multilingual for more information about the model

Clinical Document Analysis With State-Of-The-Art Pretrained Pipelines for Specific Clinical Tasks and Concepts

We introduce a suite of advanced, hybrid pretrained pipelines, specifically designed to streamline the process of analyzing clinical documents. These pipelines are built upon multiple state-of-the-art (SOTA) pretrained models, delivering a comprehensive solution for extracting vital information with unprecedented ease.

What sets this release apart is the elimination of complexities typically involved in building and chaining models. Users no longer need to navigate the intricacies of constructing intricate pipelines from scratch or the uncertainty of selecting the most effective model combinations. Our new pretrained pipelines simplify these processes, offering a seamless, user-friendly experience.

Pipeline Name	Description
`explain_clinical_doc_generic`	This pipeline is designed to extract all clinical/medical entities, assign assertion status to the extracted entities, establish relations between the extracted entities from the clinical texts.
`explain_clinical_doc_oncology`	This specialized oncology pipeline can extract oncological entities, assign assertion status to the extracted entities, establish relations between the extracted entities from the clinical documents.
`explain_clinical_doc_vop`	This pipeline is designed to extract healthcare-related terms entities, assign assertion status to the extracted entities, establish relations between the extracted entities from the documents transferred from the patient’s sentences.
`ner_vop_pipeline`	This pipeline includes the full taxonomy Named-Entity Recognition model to extract information from health-related text in colloquial language. This pipeline extracts diagnoses, treatments, tests, anatomical references, and demographic entities.
`ner_oncology_pipeline`	This pipeline extracts more than 40 oncology-related entities, including therapies, tests and staging
`oncology_diagnosis_pipeline`	This pipeline includes Named-Entity Recognition, Assertion Status, Relation Extraction and Entity Resolution models to extract information from oncology texts. This pipeline focuses on entities related to oncological diagnosis
`clinical_deidentification`	This pipeline can be used to deidentify PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text.
`clinical_deidentification_langtest`	This pipeline can be used to deidentify PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text.
`summarizer_clinical_laymen_onnx_pipeline`	This model is a modified version of LLM based summarization model that is finetuned with custom dataset by John Snow Labs to avoid using clinical jargon on the summaries
`clinical_notes_qa_base_onnx_pipeline`	This model is capable of open-book question answering on Medical Notes.
`clinical_notes_qa_large_onnx_pipeline`	This model is capable of open-book question answering on Medical Notes.
`medical_qa_biogpt_pipeline`	This pipeline is trained on Pubmed abstracts and then finetuned with PubmedQA dataset.
`flan_t5_base_jsl_qa_pipeline`	This pipeline provides a powerful and efficient solution for accurately answering medical questions and delivering insightful information in the medical domain.
`atc_resolver_pipeline`	This pipeline extracts `DRUG` entities from clinical texts and map these entities to their corresponding Anatomic Therapeutic Chemical (ATC) codes.
`cpt_procedures_measurements_resolver_pipeline`	This pipeline extracts `Procedure` and `Measurement` entities and maps them to corresponding Current Procedural Terminology (CPT) codes.
`hcc_resolver_pipeline`	This advanced pipeline extracts clinical conditions from clinical texts and maps these entities to their corresponding Hierarchical Condition Categories (HCC) codes.
`hpo_resolver_pipeline`	This advanced pipeline extracts human phenotype entities from clinical texts and maps these entities to their corresponding HPO codes.
`snomed_body_structure_resolver_pipeline`	This pipeline extracts anatomical structure entities and maps them to their corresponding SNOMED (body structure version) codes.
`snomed_findings_resolver_pipeline`	This pipeline extracts clinical findings and maps them to their corresponding SNOMED (CT version) codes.

Returning Text Embeddings within Sentence Entity Resolution Models

The unique aspect highlighted in this implementation is the use of the setReturnResolvedTextEmbeddings parameter. By setting it to True, the code allows for the inclusion of embeddings for resolved text candidates, enabling a more comprehensive analysis and understanding of the resolved entities within the clinical text. This parameter provides flexibility by allowing users to either include or exclude embeddings based on their requirements, with the default setting being False.

Example:

rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented", "en", "clinical/models") \
    .setInputCols(["sentence_embeddings"]) \
    .setOutputCol("rxnorm_code")\
    .setDistanceFunction("EUCLIDEAN")\
    .setReturnResolvedTextEmbeddings(True)

text = 'metformin 100 mg'

Result:

text	embeddings
metformin 100 mg	-0.20578815, 0.25846115, -0.7783525, 0.80831814, 0.91270417, -0.43411028, 0.41243184, 0.2023627…

Setting Entity Pairs for Relation Labels Feature in `RelationExtractionDLModel` to Reduce False Positives

RelationExtractionDLModel now includes the ability to set entity pairs for each relation label, giving you more control over your results and even greater accuracy.

In the following example, we utilize entity pair restrictions to limit the results of Relation Extraction labels solely to relations that exist between specified entities, thus improving the accuracy and relevance of the extracted data. If we don’t set the setRelationTypePerPair parameter here, the REDL model may return different RE labels for these specified entities.

Example:

ade_redl_model = RelationExtractionDLModel()\
    .pretrained('redl_ade_biobert', 'en', "clinical/models") \
    .setPredictionThreshold(0.5)\
    .setInputCols(["re_ner_chunks", "sentence"]) \
    .setOutputCol("relations")\
    .setRelationPairsCaseSensitive(False)\
    .setRelationTypePerPair({
        "1":["drug-ade", "ade-drug"]
    })

text ="""Hypersensitivity to aspirin can be manifested as acute asthma, urticaria and/or angioedema, or a systemic anaphylactoid reaction."""

Result:

chunk1	entity1	chunk2	entity2	relation	confidence
aspirin	DRUG	acute asthma	ADE	1	0.99978
aspirin	DRUG	urticaria	ADE	1	0.999787
angioedema	ADE	aspirin	DRUG	1	0.999798
systemic anaphylactoid reaction	ADE	aspirin	DRUG	1	0.999821

Please check the Clinical Relation Extraction Model Notebook for more information.

Cluster and CPU speed benchmark for Chunk Mapper, Entity Resolver, and Deidentification pipelines

Dive into the heart of healthcare data processing with our benchmark experiment meticulously designed for Mapper, Resolver, and Deidentification Pipelines. This benchmark provides crucial insights into the performance of these pipelines under varied configurations and dataset conditions.

These figures might differ based on the size of the mapper and resolver models. The larger the models, the higher the inference times. Depending on the success rate of mappers (any chunk coming in caught by the mapper successfully), the combined mapper and resolver timing would be less than resolver-only timing.

If the resolver-only timing is equal to or very close to the combined mapper and resolver timing, it means that the mapper is not capable of catching/ mapping any chunk. In that case, try playing with various parameters in the mapper or retrain/ augment the mapper.

Mapper and Resolver Benchmark Experiment
- DataBricks Config: 32 CPU Core, 128GiB RAM (8 worker, Standard_DS3_v2)
- AWS EC2 Config: 32 CPU Cores, 58GiB RAM (c6a.8xlarge) Dataset: 100 Clinical Texts from MTSamples, approx. 705 tokens and 11 chunks per text.

partition	DataBricks mapper timing	AWS EC2 mapper timing	DataBricks resolver timing	AWS EC2 resolver timing	DataBricks mapper and resolver timing	AWS EC2 mapper and resolver timing
4	37.3 sec	12 sec	4.46 mins	2.37 mins	2.52 mins	1.47 mins
8	26.7 sec	7 sec	2.46 mins	1.39 mins	1.37 mins	1.04 mins
16	8.85 sec	7 sec	1.27 mins	1.30 mins	1.06 mins	1.02 mins
32	7.74 sec	7 sec	1.38 mins	1.00 mins	54.5 sec	43 sec
64	7.22 sec	7 sec	1.23 mins	1.07 mins	55.6 sec	48 sec
100	6.32 sec	10 sec	1.16 mins	1.08 mins	50.9 sec	45 sec
1000	8.37 sec	10 sec	59.6 sec	1.02 mins	49.3 sec	41 sec

Explore the efficiency of our clinical_deidentification pipeline through a dedicated benchmark experiment. Unearth performance metrics and make informed decisions to enhance your healthcare data processing workflows.

Deidentification Benchmark Experiment
- DataBricks Config: 32 CPU Core, 128GiB RAM (8 worker)
- AWS Config: 32 CPU Cores, 58GiB RAM (c6a.8xlarge)
- Colab Config: 8 CPU Cores 52GiB RAM (Colab Pro - High RAM)

Dataset: 1000 Clinical Texts from MTSamples, approx. 503 tokens and 21 chunks per text.

partition	AWS EC2 result timing	DataBricks result timing	Colab result timing
1024	1 min 3 sec	1 min 55 sec	5 min 45 sec
512	56 sec	1 min 26 sec	5 min 15 sec
256	50 sec	1 min 20 sec	5 min 4 sec
128	45 sec	1 min 21 sec	5 min 11 sec
64	46 sec	1 min 31 sec	5 min 3 sec
32	46 sec	1 min 26 sec	5 min 0 sec
16	56 sec	1 min 43 sec	5 min 3 sec
8	1 min 21 sec	2 min 33 sec	5 min 3 sec
4	2 min 26 sec	4 min 53 sec	6 min 3 sec

Please check the Cluster Speed Benchmarks page for more information.

ONNX Support for `ZeroShotNerModel`, `MedicalBertForSequenceClassification`, `MedicalBertForTokenClassification`, and `MedicalDistilBertForSequenceClassification`

We are thrilled to announce the integration of ONNX support for several critical annotators, enhancing the versatility of our healthcare models. The following models now benefit from ONNX compatibility:

ZeroShotNerModel
MedicalBertForSequenceClassification
MedicalBertForTokenClassification
MedicalDistilBertForSequenceClassification

This update opens doors to a wider range of deployment scenarios and interoperability with other systems that support the Open Neural Network Exchange (ONNX) format. Experience heightened efficiency and integration capabilities as you incorporate these models into your healthcare workflows. Stay at the forefront of healthcare AI with the latest in interoperable model support.

Various Core Improvements: Bug Fixes, Enhanced Overall Robustness, And Reliability Of Spark NLP For Healthcare

The error caused by splitChars in NerConverterInternal has been resolved
Fixed loading issue for ChunkConverter, AnnotationMerger, and GenericRE annotators
ContextualParser now supports unlimited document size
Updated settings in sparknlp_jsl.start() function for Spark configuration

Updated Notebooks And Demonstrations For making Spark NLP For Healthcare Easier To Navigate And Understand

New Opioid Demo
New Structured Streaming with SparkNLP for Healthcare Notebook
Updated Clinical Relation Extraction Model Notebook

We Have Added And Updated A Substantial Number Of New Clinical Models And Pipelines, Further Solidifying Our Offering In The Healthcare Domain.

ner_deid_name_multilingual
ner_opioid_small_wip
ner_oncology_pipeline
ner_vop_pipeline
oncology_diagnosis_pipeline
summarizer_clinical_laymen_onnx_pipeline
clinical_notes_qa_base_onnx_pipeline
clinical_notes_qa_large_onnx_pipeline
medical_qa_biogpt_pipeline
flan_t5_base_jsl_qa_pipeline
clinical_deidentification
clinical_deidentification_langtest
explain_clinical_doc_generic
explain_clinical_doc_vop
explain_clinical_doc_oncology
explain_clinical_doc_radiology
atc_resolver_pipeline
cpt_procedures_measurements_resolver_pipeline
hcc_resolver_pipeline
hpo_resolver_pipeline
snomed_findings_resolver_pipeline
snomed_body_structure_resolver_pipeline
sbiobertresolve_rxnorm_augmented

For all Spark NLP for Healthcare models, please check: Models Hub Page

Versions

Version
Version
Version

PREVIOUSVersion Compatibility

5.2.1

Highlights

Introducing a New Opioid Named Entity Recognition (NER) Model for Extracting Information Regarding Opioid Usage

Introducing a New Multilingual NER Model To Extract NAME Entities for Deidentification Purposes

Clinical Document Analysis With State-Of-The-Art Pretrained Pipelines for Specific Clinical Tasks and Concepts

Returning Text Embeddings within Sentence Entity Resolution Models

Setting Entity Pairs for Relation Labels Feature in RelationExtractionDLModel to Reduce False Positives

Cluster and CPU speed benchmark for Chunk Mapper, Entity Resolver, and Deidentification pipelines

ONNX Support for ZeroShotNerModel, MedicalBertForSequenceClassification, MedicalBertForTokenClassification, and MedicalDistilBertForSequenceClassification

Various Core Improvements: Bug Fixes, Enhanced Overall Robustness, And Reliability Of Spark NLP For Healthcare

Updated Notebooks And Demonstrations For making Spark NLP For Healthcare Easier To Navigate And Understand

We Have Added And Updated A Substantial Number Of New Clinical Models And Pipelines, Further Solidifying Our Offering In The Healthcare Domain.

Versions

Introducing a New Opioid Named Entity Recognition (NER) Model for Extracting Information Regarding `Opioid` Usage

Introducing a New Multilingual NER Model To Extract `NAME` Entities for Deidentification Purposes

Setting Entity Pairs for Relation Labels Feature in `RelationExtractionDLModel` to Reduce False Positives

ONNX Support for `ZeroShotNerModel`, `MedicalBertForSequenceClassification`, `MedicalBertForTokenClassification`, and `MedicalDistilBertForSequenceClassification`