5.2.1
Highlights
We are delighted to announce a suite of remarkable enhancements and updates in our latest release of Spark NLP for Healthcare. This release comes with a new Opioid NER model as well as 23 new clinical pretrained models and pipelines.
- Introducing a new named entity recognition (NER) model for extracting information regarding
Opioid
usage - Introducing a new multilingual NER model to extract
NAME
entities for Deidentification purposes - Clinical document analysis with state-of-the-art Pretrained Pipelines for specific clinical tasks and concepts
- Returning text embeddings within sentence entity resolution models
- Setting entity pairs for relation labels in
RelationExtractionDLModel
to reduce false positives - Cluster and CPU speed benchmarks for Chunk Mapper, Entity Resolver, and Deidentification pipelines
- ONNX support for
ZeroShotNerModel
,MedicalBertForSequenceClassification
,MedicalBertForTokenClassification
, andMedicalDistilBertForSequenceClassification
- Various core improvements; bug fixes, enhanced overall robustness and reliability of Spark NLP for Healthcare
- The error caused by
splitChars
inNerConverterInternal
has been resolved - Fixed loading from disk issue for
ChunkConverter
,AnnotationMerger
, andGenericRE
annotators ContextualParser
now supports unlimited document size- Updated settings in
sparknlp_jsl.start()
function for Spark configuration
- The error caused by
- Updated notebooks and demonstrations for making Spark NLP for Healthcare easier to navigate and understand
- The addition and update of numerous new clinical models and pipelines continue to reinforce our offering in the healthcare domain
We believe that these enhancements will elevate your experience with Spark NLP for Healthcare, enabling more efficient, accurate, and streamlined analysis of healthcare-related natural language data.
Introducing a New Opioid Named Entity Recognition (NER) Model for Extracting Information Regarding Opioid
Usage
This model is designed to detect and label opioid-related entities within text data. Opioids are a class of drugs that include the illegal drug heroin, synthetic opioids such as fentanyl, and pain relievers available legally by prescription. The model has been trained using advanced deep-learning techniques on a diverse range of text sources and can accurately recognize and classify a wide range of opioid-related entities. The model’s accuracy and precision have been carefully validated against expert-labeled data to ensure reliable and consistent results.
Please see the model card ner_opioid_small_wip for more information about the model
Example:
ner_model = MedicalNerModel.pretrained("ner_opioid_small_wip", "en", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
sample_texts = """20 year old male transferred from [**Hospital1 112**] for liver transplant evaluation after percocet overdose. On Sunday [**3-27**] had a stressful day and pt took approximately 20 percocet (5/325) throughout the day after a series of family arguments. Denies trying to hurt himself. Parents confirm to suicidal attempts in the past. Pt felt that he had a hangover on Monday secondary to "percocet withdrawal" and took an additional 5 percocet. Pt was admitted to the SICU and followed by Liver, Transplant, Toxicology, and [**Month/Year (2) **]. He was started on NAC q4hr with gradual decline in LFT's and INR. His recovery was c/b hypertension, for which he was started on clonidine. Pt was transferred to the floor on [**4-1**].
Past Medical History:
Bipolar D/o (s/p suicide attempts in the past)
ADHD
S/p head injury [**2160**]: s/p MVA with large L3 transverse process
fx, small right frontal epidural hemorrhage-- with
post-traumatic seizures (was previously on dilantin, now dc'd)
Social History:
Father is HCP, student in [**Name (NI) 108**], Biology major, parents and brother live in [**Name (NI) 86**], single without children, lived in a group home for 3 years as a teenager, drinks alcohol 1 night a week, denies illict drug use, pt in [**Location (un) 86**] for neuro eval
"""
Result:
chunk | begin | end | ner_label |
---|---|---|---|
percocet | 92 | 99 | opioid_drug |
20 | 178 | 179 | drug_quantity |
percocet | 181 | 188 | opioid_drug |
5/325 | 191 | 195 | drug_strength |
suicidal attempts | 303 | 319 | psychiatric_issue |
hangover | 356 | 363 | general_symptoms |
percocet | 389 | 396 | opioid_drug |
withdrawal | 398 | 407 | general_symptoms |
5 | 433 | 433 | drug_quantity |
percocet | 435 | 442 | opioid_drug |
NAC | 567 | 569 | other_drug |
q4hr | 571 | 574 | drug_frequency |
decline in LFT’s | 589 | 604 | general_symptoms |
clonidine | 679 | 687 | other_drug |
Bipolar | 761 | 767 | psychiatric_issue |
suicide attempts | 778 | 793 | psychiatric_issue |
ADHD | 808 | 811 | psychiatric_issue |
dilantin | 976 | 983 | other_drug |
illict drug use | 1236 | 1250 | substance_use_disorder |
Please check the Opioid Demo
Introducing a New Multilingual NER Model To Extract NAME
Entities for Deidentification Purposes
Introducing our latest invention Multilingual Named Entity Recognition model which annotates English, German, French, Italian, Spanish, Portuguese, and Romanian text to find NAME
entities that may need to be de-identified. It was trained with in-house annotated datasets and detects NAME entities. We plan to expand this multilingual NER model to other PHI entities in the upcoming releases.
Example:
embeddings = XlmRoBertaEmbeddings.pretrained("xlm_roberta_base", "xx") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)\
.setCaseSensitive(False)
ner = MedicalNerModel.pretrained("ner_deid_name_multilingual", "xx", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
text = ["""Record date: 2093-01-13, David Hale, M.D., Name: Hendrickson, Ora MR. # 7194334 Date: 01/13/93 PCP: Oliveira, 25 years old, Record date: 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. The patient's complaints first surfaced when he started working for Brothers Coal-Mine.""",
"""J'ai vu en consultation Michel Martinez (49 ans) adressé au Centre Hospitalier De Plaisir pour un diabète mal contrôlé avec des symptômes datant de Mars 2015.""",
"""Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.""",
"""Ho visto Gastone Montanariello (49 anni) riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015."""]
Result:
doc_id | chunks | begin | end | entities |
---|---|---|---|---|
0 | David Hale | 26 | 35 | NAME |
0 | Hendrickson, Ora | 51 | 66 | NAME |
0 | Oliveira | 104 | 111 | NAME |
1 | Michel Martinez | 24 | 38 | NAME |
2 | Michael Berger | 0 | 13 | NAME |
2 | Berger | 117 | 122 | NAME |
3 | Gastone Montanariello | 9 | 29 | NAME |
Please see the model card ner_deid_name_multilingual for more information about the model
Clinical Document Analysis With State-Of-The-Art Pretrained Pipelines for Specific Clinical Tasks and Concepts
We introduce a suite of advanced, hybrid pretrained pipelines, specifically designed to streamline the process of analyzing clinical documents. These pipelines are built upon multiple state-of-the-art (SOTA) pretrained models, delivering a comprehensive solution for extracting vital information with unprecedented ease.
What sets this release apart is the elimination of complexities typically involved in building and chaining models. Users no longer need to navigate the intricacies of constructing intricate pipelines from scratch or the uncertainty of selecting the most effective model combinations. Our new pretrained pipelines simplify these processes, offering a seamless, user-friendly experience.
Pipeline Name | Description |
---|---|
explain_clinical_doc_generic |
This pipeline is designed to extract all clinical/medical entities, assign assertion status to the extracted entities, establish relations between the extracted entities from the clinical texts. |
explain_clinical_doc_oncology |
This specialized oncology pipeline can extract oncological entities, assign assertion status to the extracted entities, establish relations between the extracted entities from the clinical documents. |
explain_clinical_doc_vop |
This pipeline is designed to extract healthcare-related terms entities, assign assertion status to the extracted entities, establish relations between the extracted entities from the documents transferred from the patient’s sentences. |
ner_vop_pipeline |
This pipeline includes the full taxonomy Named-Entity Recognition model to extract information from health-related text in colloquial language. This pipeline extracts diagnoses, treatments, tests, anatomical references, and demographic entities. |
ner_oncology_pipeline |
This pipeline extracts more than 40 oncology-related entities, including therapies, tests and staging |
oncology_diagnosis_pipeline |
This pipeline includes Named-Entity Recognition, Assertion Status, Relation Extraction and Entity Resolution models to extract information from oncology texts. This pipeline focuses on entities related to oncological diagnosis |
clinical_deidentification |
This pipeline can be used to deidentify PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text. |
clinical_deidentification_langtest |
This pipeline can be used to deidentify PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text. |
summarizer_clinical_laymen_onnx_pipeline |
This model is a modified version of LLM based summarization model that is finetuned with custom dataset by John Snow Labs to avoid using clinical jargon on the summaries |
clinical_notes_qa_base_onnx_pipeline |
This model is capable of open-book question answering on Medical Notes. |
clinical_notes_qa_large_onnx_pipeline |
This model is capable of open-book question answering on Medical Notes. |
medical_qa_biogpt_pipeline |
This pipeline is trained on Pubmed abstracts and then finetuned with PubmedQA dataset. |
flan_t5_base_jsl_qa_pipeline |
This pipeline provides a powerful and efficient solution for accurately answering medical questions and delivering insightful information in the medical domain. |
atc_resolver_pipeline |
This pipeline extracts DRUG entities from clinical texts and map these entities to their corresponding Anatomic Therapeutic Chemical (ATC) codes. |
cpt_procedures_measurements_resolver_pipeline |
This pipeline extracts Procedure and Measurement entities and maps them to corresponding Current Procedural Terminology (CPT) codes. |
hcc_resolver_pipeline |
This advanced pipeline extracts clinical conditions from clinical texts and maps these entities to their corresponding Hierarchical Condition Categories (HCC) codes. |
hpo_resolver_pipeline |
This advanced pipeline extracts human phenotype entities from clinical texts and maps these entities to their corresponding HPO codes. |
snomed_body_structure_resolver_pipeline |
This pipeline extracts anatomical structure entities and maps them to their corresponding SNOMED (body structure version) codes. |
snomed_findings_resolver_pipeline |
This pipeline extracts clinical findings and maps them to their corresponding SNOMED (CT version) codes. |
Returning Text Embeddings within Sentence Entity Resolution Models
The unique aspect highlighted in this implementation is the use of the setReturnResolvedTextEmbeddings
parameter. By setting it to True
, the code allows for the inclusion of embeddings for resolved text candidates, enabling a more comprehensive analysis and understanding of the resolved entities within the clinical text. This parameter provides flexibility by allowing users to either include or exclude embeddings based on their requirements, with the default setting being False
.
Example:
rxnorm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_rxnorm_augmented", "en", "clinical/models") \
.setInputCols(["sentence_embeddings"]) \
.setOutputCol("rxnorm_code")\
.setDistanceFunction("EUCLIDEAN")\
.setReturnResolvedTextEmbeddings(True)
text = 'metformin 100 mg'
Result:
text | embeddings |
---|---|
metformin 100 mg | -0.20578815, 0.25846115, -0.7783525, 0.80831814, 0.91270417, -0.43411028, 0.41243184, 0.2023627… |
Setting Entity Pairs for Relation Labels Feature in RelationExtractionDLModel
to Reduce False Positives
RelationExtractionDLModel
now includes the ability to set entity pairs for each relation label, giving you more control over your results and even greater accuracy.
In the following example, we utilize entity pair restrictions to limit the results of Relation Extraction labels solely to relations that exist between specified entities, thus improving the accuracy and relevance of the extracted data. If we don’t set the setRelationTypePerPair
parameter here, the REDL model may return different RE labels for these specified entities.
Example:
ade_redl_model = RelationExtractionDLModel()\
.pretrained('redl_ade_biobert', 'en', "clinical/models") \
.setPredictionThreshold(0.5)\
.setInputCols(["re_ner_chunks", "sentence"]) \
.setOutputCol("relations")\
.setRelationPairsCaseSensitive(False)\
.setRelationTypePerPair({
"1":["drug-ade", "ade-drug"]
})
text ="""Hypersensitivity to aspirin can be manifested as acute asthma, urticaria and/or angioedema, or a systemic anaphylactoid reaction."""
Result:
chunk1 | entity1 | chunk2 | entity2 | relation | confidence |
---|---|---|---|---|---|
aspirin | DRUG | acute asthma | ADE | 1 | 0.99978 |
aspirin | DRUG | urticaria | ADE | 1 | 0.999787 |
angioedema | ADE | aspirin | DRUG | 1 | 0.999798 |
systemic anaphylactoid reaction | ADE | aspirin | DRUG | 1 | 0.999821 |
Please check the Clinical Relation Extraction Model Notebook for more information.
Cluster and CPU speed benchmark for Chunk Mapper, Entity Resolver, and Deidentification pipelines
Dive into the heart of healthcare data processing with our benchmark experiment meticulously designed for Mapper, Resolver, and Deidentification Pipelines. This benchmark provides crucial insights into the performance of these pipelines under varied configurations and dataset conditions.
These figures might differ based on the size of the mapper and resolver models. The larger the models, the higher the inference times. Depending on the success rate of mappers (any chunk coming in caught by the mapper successfully), the combined mapper and resolver timing would be less than resolver-only timing.
If the resolver-only timing is equal to or very close to the combined mapper and resolver timing, it means that the mapper is not capable of catching/ mapping any chunk. In that case, try playing with various parameters in the mapper or retrain/ augment the mapper.
- Mapper and Resolver Benchmark Experiment
- DataBricks Config: 32 CPU Core, 128GiB RAM (8 worker, Standard_DS3_v2)
- AWS EC2 Config: 32 CPU Cores, 58GiB RAM (c6a.8xlarge) Dataset: 100 Clinical Texts from MTSamples, approx. 705 tokens and 11 chunks per text.
partition | DataBricks mapper timing |
AWS EC2 mapper timing |
DataBricks resolver timing |
AWS EC2 resolver timing |
DataBricks mapper and resolver timing |
AWS EC2 mapper and resolver timing |
---|---|---|---|---|---|---|
4 | 37.3 sec | 12 sec | 4.46 mins | 2.37 mins | 2.52 mins | 1.47 mins |
8 | 26.7 sec | 7 sec | 2.46 mins | 1.39 mins | 1.37 mins | 1.04 mins |
16 | 8.85 sec | 7 sec | 1.27 mins | 1.30 mins | 1.06 mins | 1.02 mins |
32 | 7.74 sec | 7 sec | 1.38 mins | 1.00 mins | 54.5 sec | 43 sec |
64 | 7.22 sec | 7 sec | 1.23 mins | 1.07 mins | 55.6 sec | 48 sec |
100 | 6.32 sec | 10 sec | 1.16 mins | 1.08 mins | 50.9 sec | 45 sec |
1000 | 8.37 sec | 10 sec | 59.6 sec | 1.02 mins | 49.3 sec | 41 sec |
Explore the efficiency of our clinical_deidentification
pipeline through a dedicated benchmark experiment. Unearth performance metrics and make informed decisions to enhance your healthcare data processing workflows.
- Deidentification Benchmark Experiment
- DataBricks Config: 32 CPU Core, 128GiB RAM (8 worker)
- AWS Config: 32 CPU Cores, 58GiB RAM (c6a.8xlarge)
- Colab Config: 8 CPU Cores 52GiB RAM (Colab Pro - High RAM)
Dataset: 1000 Clinical Texts from MTSamples, approx. 503 tokens and 21 chunks per text.
partition | AWS EC2 result timing |
DataBricks result timing |
Colab result timing |
---|---|---|---|
1024 | 1 min 3 sec | 1 min 55 sec | 5 min 45 sec |
512 | 56 sec | 1 min 26 sec | 5 min 15 sec |
256 | 50 sec | 1 min 20 sec | 5 min 4 sec |
128 | 45 sec | 1 min 21 sec | 5 min 11 sec |
64 | 46 sec | 1 min 31 sec | 5 min 3 sec |
32 | 46 sec | 1 min 26 sec | 5 min 0 sec |
16 | 56 sec | 1 min 43 sec | 5 min 3 sec |
8 | 1 min 21 sec | 2 min 33 sec | 5 min 3 sec |
4 | 2 min 26 sec | 4 min 53 sec | 6 min 3 sec |
Please check the Cluster Speed Benchmarks page for more information.
ONNX Support for ZeroShotNerModel
, MedicalBertForSequenceClassification
, MedicalBertForTokenClassification
, and MedicalDistilBertForSequenceClassification
We are thrilled to announce the integration of ONNX support for several critical annotators, enhancing the versatility of our healthcare models. The following models now benefit from ONNX compatibility:
- ZeroShotNerModel
- MedicalBertForSequenceClassification
- MedicalBertForTokenClassification
- MedicalDistilBertForSequenceClassification
This update opens doors to a wider range of deployment scenarios and interoperability with other systems that support the Open Neural Network Exchange (ONNX) format. Experience heightened efficiency and integration capabilities as you incorporate these models into your healthcare workflows. Stay at the forefront of healthcare AI with the latest in interoperable model support.
Various Core Improvements: Bug Fixes, Enhanced Overall Robustness, And Reliability Of Spark NLP For Healthcare
- The error caused by
splitChars
inNerConverterInternal
has been resolved - Fixed loading issue for
ChunkConverter
,AnnotationMerger
, andGenericRE
annotators ContextualParser
now supports unlimited document size- Updated settings in
sparknlp_jsl.start()
function for Spark configuration
Updated Notebooks And Demonstrations For making Spark NLP For Healthcare Easier To Navigate And Understand
We Have Added And Updated A Substantial Number Of New Clinical Models And Pipelines, Further Solidifying Our Offering In The Healthcare Domain.
ner_deid_name_multilingual
ner_opioid_small_wip
ner_oncology_pipeline
ner_vop_pipeline
oncology_diagnosis_pipeline
summarizer_clinical_laymen_onnx_pipeline
clinical_notes_qa_base_onnx_pipeline
clinical_notes_qa_large_onnx_pipeline
medical_qa_biogpt_pipeline
flan_t5_base_jsl_qa_pipeline
clinical_deidentification
clinical_deidentification_langtest
explain_clinical_doc_generic
explain_clinical_doc_vop
explain_clinical_doc_oncology
explain_clinical_doc_radiology
atc_resolver_pipeline
cpt_procedures_measurements_resolver_pipeline
hcc_resolver_pipeline
hpo_resolver_pipeline
snomed_findings_resolver_pipeline
snomed_body_structure_resolver_pipeline
sbiobertresolve_rxnorm_augmented
For all Spark NLP for Healthcare models, please check: Models Hub Page
Versions
- 5.5.2
- 5.5.1
- 5.5.0
- 5.4.1
- 5.4.0
- 5.3.3
- 5.3.2
- 5.3.1
- 5.3.0
- 5.2.1
- 5.2.0
- 5.1.4
- 5.1.3
- 5.1.2
- 5.1.1
- 5.1.0
- 5.0.2
- 5.0.1
- 5.0.0
- 4.4.4
- 4.4.3
- 4.4.2
- 4.4.1
- 4.4.0
- 4.3.2
- 4.3.1
- 4.3.0
- 4.2.8
- 4.2.4
- 4.2.3
- 4.2.2
- 4.2.1
- 4.2.0
- 4.1.0
- 4.0.2
- 4.0.0
- 3.5.3
- 3.5.2
- 3.5.1
- 3.5.0
- 3.4.2
- 3.4.1
- 3.4.0
- 3.3.4
- 3.3.2
- 3.3.1
- 3.3.0
- 3.2.3
- 3.2.2
- 3.2.1
- 3.2.0
- 3.1.3
- 3.1.2
- 3.1.1
- 3.1.0
- 3.0.3
- 3.0.2
- 3.0.1
- 3.0.0
- 2.7.6
- 2.7.5
- 2.7.4
- 2.7.3
- 2.7.2
- 2.7.1
- 2.7.0
- 2.6.2
- 2.6.0
- 2.5.5
- 2.5.3
- 2.5.2
- 2.5.0
- 2.4.6
- 2.4.5
- 2.4.2
- 2.4.1
- 2.4.0