5.5.0
Highlights
We are delighted to announce remarkable enhancements and updates in our latest release of Healthcare NLP. This release comes with a brand new LLM Loader module with GPU support, several other new modules (Contextual Entity Filterer, RE Chunk Merger, Replacer) for precise and improved information extraction as well as 71 new clinical pretrained models and pipelines.
- Introducing a brand new LLM loader called
MedicalLLM
to load and run LLMs in gguf format that could scale within a Spark NLP pipeline. - Explore 6 new specialized LLMs at various sizes and quantization levels for healthcare applications (medical note summarization, Q&A, RAG, and Chat)
- Clinical document analysis with one-liner pretrained pipelines for specific clinical tasks and concepts
- Introducing 8 new Named Entity Recognition (NER) Models and pipelines to Detect PHI for Deidentification with minimal customization required
- Introducing a new mapper model designed to link
ICD-10-CM codes
with their corresponding chronicity indicators - Introducing a new Named Entity Recognition (NER) Model and a new binary classification model to detect adverse drug events
- Introducing the
REChunkMerger
annotator to merge the entities in a relationship as a single entity. - Introducing a brand new
ContextualEntityFilter
annotator to filter entities with context-specific rules. - Enhanced new sentence detector model for healthcare text segmentation in a corrupted text.
- Introducing new parameters to
Replacer
for data augmentation. - New speed benchmarks for multi-NER pipelines
- New blog posts on various topics (AI for equity, detecting stigmatizing language from medical texts, subcohort analysis for oncology patients, using small LLMs to extract structured named entities, …)
- Various core improvements; bug fixes, enhanced overall robustness and reliability of Spark NLP for Healthcare
- Using structured entity jsons from various sources (pipeline, NLP Lab, etc) within a new pipeline to merge/ consolidate named entities.
- Introducing new
setReturnEntityMappings
,setMappingsColumn
,setStaticEntityMappings
, andsetStaticEntityMappingsFallback
parameters for Replacer - Added the
RegexMatcherInternalModel
trait to Scala to match the pretrained method available in Python. - Added a new parameter for the
Flattener
annotator to sets an array of column names that should be kept in the dataframe after the flattening process. - Added
resetSentenceIndices
parameter toChunkMerger
,NerConverterInternal
, andChunkConverter
annotators for reset sentence indices to treat the entire output as if it originates from a single document. - Fixed Generative AI Lab API task deletion endpoint: Resolved an issue with the
tasks_delete
endpoint, enabling proper deletion of tasks via the API. - Added
chunk_validation_options
dictionary into thedict_to_annotation_converter
module for converting dictionary data to Spark NLP annotations. - Added pretrained feature added to
InternalDocumentSplitter
. - Deprecated the
nlp_test
module inspark-nlp-jsl
; future development is being managed byLangTest
. - Added support for
ONNX
models in theChunkKeyPhraseExtraction
annotator, allowing for compatibility with ONNX-based models.
- Updated notebooks and demonstrations for making Healthcare NLP easier to navigate and understand
- New REChunkMerger MOOC Notebook
- New ContextualEntityFilterer MOOC Notebook
- Updated Replacer MOOC Notebook
- Updated NerConverterInternal MOOC Notebook
- Updated ChunkConverter MOOC Notebook
- Updated ChunkMergeModel MOOC Notebook
- Updated Rule Based Entity Matchers notebook
- Updated Clinical DeIdentification notebook
- Updated Prepare CoNLL from Annotations for NER notebook
- Updated Contextual Parser Rule Based NER notebook
- Updated Loading Medical and Open Souce LLMs notebook
- Updated ONCOLOGY Demo
- The addition and update of numerous new clinical models and pipelines continue to reinforce our offering in the healthcare domain
These enhancements will elevate your experience with Spark NLP for Healthcare, enabling more efficient, accurate, and streamlined analysis of healthcare-related natural language data.
Introducing a Brand New LLM Loader Called MedicalLLM
to Load and Aun LLMs in GGUF Format that Could Scale within a Spark NLP Pipeline.
MedicalLLM
is a brand new annotator in Spark NLP, designed to load and run large language models (LLMs) in GGUF
format with scalable performance. Ideal for clinical and healthcare applications, MedicalLLM supports tasks like medical entity extraction, summarization, Q&A, Retrieval Augmented Generation (RAG), and conversational AI. With simple integration into Spark NLP pipelines, it allows for customizable batch sizes, prediction settings, and chat templates. GPU optimization is also available, enhancing its capabilities for high-performance environments. MedicalLLM empowers users to link medical entities and perform complex NLP tasks with efficiency and precision.
Model Name | Description |
---|---|
jsl_meds_ner_q4_v2 | Extract and link medical named entities |
jsl_meds_ner_q8_v2 | Extract and link medical named entities |
jsl_meds_ner_q16_v2 | Extract and link medical named entities |
jsl_medm_q4_v1 | Summarization and Q&A |
jsl_medm_q8_v1 | Summarization and Q&A |
jsl_medm_q16_v1 | Summarization and Q&A |
jsl_medsner_zs_q4_v1 | Extract and link medical named entities |
jsl_medsner_zs_q8_v1 | Extract and link medical named entities |
jsl_medsner_zs_q16_v1 | Extract and link medical named entities |
jsl_meds_q4_v1 | Summarization and Q&A |
jsl_meds_q8_v1 | Summarization and Q&A |
jsl_meds_q16_v1 | Summarization and Q&A |
jsl_meds_rag_q4_v1 | LLM component of Retrieval Augmented Generation (RAG) |
jsl_meds_rag_q8_v1 | LLM component of Retrieval Augmented Generation (RAG) |
jsl_meds_rag_q16_v1 | LLM component of Retrieval Augmented Generation (RAG) |
Example:
medical_llm = MedicalLLM.pretrained("jsl_meds_q16_v1", "en", "clinical/models")\
.setInputCols("document")\
.setOutputCol("completions")\
.setBatchSize(1)\
.setNPredict(100)\
.setUseChatTemplate(True)\
.setTemperature(0)\
#.setNGpuLayers(100) # if you have GPU
med_ner_prompt = """
Based on the following text, what age group is most susceptible to breast cancer?
## Text:
The exact cause of breast cancer is unknown. However, several risk factors can increase your likelihood of developing breast cancer, such as:
- A personal or family history of breast cancer
- A genetic mutation, such as BRCA1 or BRCA2
- Exposure to radiation
- Age (most commonly occurring in women over 50)
- Early onset of menstruation or late menopause
- Obesity
- Hormonal factors, such as taking hormone replacement therapy
"""
data = spark.createDataFrame([[med_ner_prompt]]).toDF("text")
Result:
The age group most susceptible to breast cancer, as mentioned in the text, is women over the age of 50.
Please check the Loading Medical and Open Souce LLMs Notebook for more information
Explore 6 New Specialized LLMs at Various Sizes and Quantisation Levels for Healthcare Applications (Medical Note Summarization, Q&A, RAG, and Chat)
Discover 9 new LLMs designed to tackle various tasks in the healthcare domain. These models include capabilities for summarization, question answering, retrieval-augmented generation (RAG), chat functionalities, and medical named entity recognition (NER). Each model is optimized with different quantization levels (q16, q8, q4) to balance performance and efficiency, catering to specific needs in medical data processing and analysis. Whether you need detailed summaries, precise Q&A, or accurate entity extraction, these models offer advanced solutions for healthcare professionals and researchers.
Model Name | Description |
---|---|
JSL_MedS_q16_v2 | Summarization, Q&A, RAG |
JSL_MedS_q8_v2 | Summarization, Q&A, RAG |
JSL_MedS_q4_v2 | Summarization, Q&A, RAG |
JSL_MedS_q16_v3 | Summarization, Q&A, RAG |
JSL_MedS_q8_v3 | Summarization, Q&A, RAG |
JSL_MedS_q4_v3 | Summarization, Q&A, RAG |
Note: Our current LLM loader implementation based on llama.cpp
may lag behind when it comes to inference speed and output quality on certain use cases given your hardware. We have other means of serving these models outside of the Healthcare NLP library and users are advised to get in touch with us if there is such a need. We recommend using 8b quantized versions of the models in a GPU-poor environment as the qualitative performance difference between q16 and q8 versions is very negligible.
Example:
from sparknlp_jsl.llm import LLMLoader
llm_loader_pretrained = LLMLoader(spark).pretrained("jsl_meds_q16_v2", "en", "clinical/models")
prompt = """
A 23-year-old pregnant woman at 22 weeks gestation presents with burning upon urination. She states it started 1 day ago and has been worsening despite drinking more water and taking cranberry extract. She otherwise feels well and is followed by a doctor for her pregnancy. Her temperature is 97.7°F (36.5°C), blood pressure is 122/77 mmHg, pulse is 80/min, respirations are 19/min, and oxygen saturation is 98% on room air. Physical exam is notable for an absence of costovertebral angle tenderness and a gravid uterus.
Which of the following is the best treatment for this patient?
A: Ampicillin
B: Ceftriaxone
C: Ciprofloxacin
D: Doxycycline
E: Nitrofurantoin
"""
llm_loader_pretrained.generate(prompt)
Result:
The best treatment for this patient is E: Nitrofurantoin. This medication is considered safe during pregnancy and is effective for treating urinary tract infections (UTIs). The other options listed are not recommended during pregnancy due to potential risks to the fetus. Ampicillin (A) and Ceftriaxone (B) are generally safe but may not be the first-line treatment for UTIs. Ciprofloxacin (C) and Doxycycline (D) are contraindicated in pregnancy due to potential adverse effects on fetal development. Nitrofurantoin (E) is a commonly used antibiotic for UTIs during pregnancy and has a good safety profile.
Please check the Loading Medical and Open Souce LLMs Notebook for more information
Clinical Document Analysis with One-Liner Pretrained Pipelines for Specific Clinical Tasks and Concepts
We introduce a suite of advanced, hybrid pretrained pipelines, specifically designed to streamline the clinical document analysis process. These pipelines are built upon multiple state-of-the-art (SOTA) pretrained models, delivering a comprehensive solution for quickly extracting vital information.
What sets this release apart is the elimination of complexities typically involved in building and chaining models. Users no longer need to navigate the intricacies of constructing intricate pipelines from scratch or the uncertainty of selecting the most effective model combinations. Our new pretrained pipelines simplify these processes, offering a seamless, user-friendly experience.
Model Name | Description |
---|---|
explain_clinical_doc_vop_small |
This pipeline is designed to extract all clinical/medical entities, assertion status, and relation informations which may be considered as Voice Of Patient (VOP) entities from text. |
explain_clinical_doc_cancer_type |
This pipeline is designed to extract all clinical/medical entities, assertion status, and relation informations which may be considered as extract oncological and cancer type entities from text. |
explain_clinical_doc_sdoh_small |
This pipeline is designed to extract all social determinants of health (SDOH) entities from text, assign assertion status to the extracted entities, establish relations between the extracted entities. |
Example:
from sparknlp.pretrained import PretrainedPipeline
pipeline_sdoh = PretrainedPipeline("explain_clinical_doc_vop_small", "en", "clinical/models")
text = """It seems like my health troubles started a few years ago. I had been feeling really tired all the time and was losing weight without even trying. My doctor did some blood work and said my sugar levels were high - he diagnosed me with something called type 2 diabetes. He put me on two medications - I take a pill called metformin 500 mg twice a day, and another one called glipizide 5 mg before breakfast and dinner. Those are supposed to help lower my blood sugar. I also have to watch what I eat and try to exercise more even though it's hard with my energy levels. A couple years after the diabetes, I started having really bad heartburn all the time. I saw a specialist called a gastroenterologist who did an endoscopy procedure where they stick a camera down your throat. That test showed I have chronic acid reflux disease or GERD. Now I take a daily pill called omeprazole 20 mg to control the heartburn symptoms.Most recently, I've had a lot of joint pain in my shoulders and knees. My primary doctor ran some blood tests that showed something called rheumatoid arthritis. He referred me to a rheumatologist who started me on a weekly medication called methotrexate. I have to remember to take folic acid with that to help minimize side effects. It seems to be helping the joint pain so far."""
NER and Assertion Result:
chunks | begin | end | sentence | entities | assertion | confidence | |
---|---|---|---|---|---|---|---|
0 | tired | 85 | 89 | 1 | Symptom | Present_Or_Past | 0.9959 |
1 | losing weight | 112 | 124 | 1 | Symptom | Present_Or_Past | 0.81445 |
2 | doctor | 150 | 155 | 2 | Employment | SomeoneElse | 0.9895 |
3 | blood work | 166 | 175 | 2 | Test | Present_Or_Past | 0.8835 |
4 | sugar levels | 189 | 200 | 2 | Test | Present_Or_Past | 0.8277 |
5 | high | 207 | 210 | 2 | TestResult | SomeoneElse | 0.9095 |
6 | type 2 diabetes | 252 | 266 | 2 | Disease | Hypothetical_Or_Absent | 0.379367 |
7 | metformin | 321 | 329 | 3 | Drug | Hypothetical_Or_Absent | 0.997 |
8 | glipizide | 374 | 382 | 3 | Drug | Hypothetical_Or_Absent | 0.9953 |
9 | blood sugar | 454 | 464 | 4 | Test | Present_Or_Past | 0.6415 |
10 | diabetes | 594 | 601 | 6 | Disease | Present_Or_Past | 0.9901 |
11 | heartburn | 632 | 640 | 6 | Symptom | Present_Or_Past | 0.988 |
12 | specialist | 664 | 673 | 7 | Employment | SomeoneElse | 0.9878 |
13 | gastroenterologist | 684 | 701 | 7 | Employment | SomeoneElse | 0.9866 |
14 | endoscopy procedure | 714 | 732 | 7 | Procedure | Hypothetical_Or_Absent | 0.75475 |
15 | chronic acid reflux disease | 802 | 828 | 8 | Disease | Present_Or_Past | 0.7071 |
16 | GERD | 833 | 836 | 8 | Disease | Hypothetical_Or_Absent | 0.9476 |
17 | omeprazole | 870 | 879 | 9 | Drug | Present_Or_Past | 0.9987 |
18 | heartburn | 902 | 910 | 9 | Symptom | Present_Or_Past | 0.9849 |
19 | pain | 961 | 964 | 10 | Symptom | Present_Or_Past | 0.9923 |
20 | primary doctor | 996 | 1009 | 11 | Employment | SomeoneElse | 0.75345 |
21 | blood tests | 1020 | 1030 | 11 | Test | Present_Or_Past | 0.93715 |
22 | rheumatoid arthritis | 1061 | 1080 | 11 | Disease | Hypothetical_Or_Absent | 0.74685 |
23 | rheumatologist | 1103 | 1116 | 12 | Employment | Present_Or_Past | 0.9913 |
24 | methotrexate | 1163 | 1174 | 12 | Drug | Present_Or_Past | 0.9995 |
25 | folic acid | 1204 | 1213 | 13 | Drug | Present_Or_Past | 0.7913 |
26 | pain | 1289 | 1292 | 14 | Symptom | Present_Or_Past | 0.9837 |
Relation Extraction Result:
sentence | entity1_begin | entity1_end | chunk1 | entity1 | entity2_begin | entity2_end | chunk2 | entity2 | relation | confidence |
---|---|---|---|---|---|---|---|---|---|---|
2 | 166 | 175 | blood work | Test | 207 | 210 | high | TestResult | Test-TestResult | 1 |
3 | 309 | 312 | pill | Form | 321 | 329 | metformin | Drug | Form-Drug | 1 |
3 | 321 | 329 | metformin | Drug | 331 | 336 | 500 mg | Dosage | Drug-Dosage | 1 |
3 | 321 | 329 | metformin | Drug | 338 | 348 | twice a day | Frequency | Drug-Frequency | 1 |
3 | 374 | 382 | glipizide | Drug | 384 | 387 | 5 mg | Dosage | Drug-Dosage | 1 |
6 | 571 | 588 | couple years after | DateTime | 594 | 601 | diabetes | Disease | DateTime-Disease | 1 |
6 | 594 | 601 | diabetes | Disease | 632 | 640 | heartburn | Symptom | Disease-Symptom | 1 |
7 | 714 | 732 | endoscopy procedure | Procedure | 770 | 775 | throat | BodyPart | Procedure-BodyPart | 1 |
9 | 852 | 856 | daily | Frequency | 870 | 879 | omeprazole | Drug | Frequency-Drug | 1 |
9 | 858 | 861 | pill | Form | 870 | 879 | omeprazole | Drug | Form-Drug | 1 |
9 | 870 | 879 | omeprazole | Drug | 881 | 885 | 20 mg | Dosage | Drug-Dosage | 1 |
10 | 927 | 934 | recently | DateTime | 961 | 964 | pain | Symptom | DateTime-Symptom | 1 |
10 | 955 | 959 | joint | BodyPart | 961 | 964 | pain | Symptom | BodyPart-Symptom | 1 |
10 | 961 | 964 | pain | Symptom | 972 | 980 | shoulders | BodyPart | Symptom-BodyPart | 1 |
10 | 961 | 964 | pain | Symptom | 986 | 990 | knees | BodyPart | Symptom-BodyPart | 1 |
12 | 1138 | 1143 | weekly | Frequency | 1163 | 1174 | methotrexate | Drug | Frequency-Drug | 1 |
14 | 1283 | 1287 | joint | BodyPart | 1289 | 1292 | pain | Symptom | BodyPart-Symptom | 1 |
Please check the Task Based Clinical Pretrained Pipelines model for more information
Introducing 8 New Named Entity Recognition (NER) Models and Pipelines to Detect PHI for Deidentification with Minimal Customisation Required
Introducing 8 New Named Entity Recognition (NER) Models: ner_deid_generic_docwise
, ner_deid_subentity_docwise
, ner_deid_subentity_augmented_docwise
, ner_deid_aipii
, ner_deid_subentity_augmented_v2
, clinical_deidentification_docwise_wip
, clinical_deidentification_nameAugmented_v2
and clinical_deidentification_v2_wip
. These models work at the document level and are particularly useful for detecting Protected Health Information (PHI) for de-identification.
Model Name | Description |
---|---|
ner_deid_generic_docwise |
This document-level model detects PHI entities for de-identification. (Generic) |
ner_deid_subentity_augmented_docwise |
This document-level model detects PHI entities for de-identification. (Subentity_Augmented) |
ner_deid_subentity_docwise |
This document-level model detects PHI entities for de-identification. (Subentity) |
ner_deid_aipii |
This model is particularly effective in identifying and labeling various entities, making it useful for detecting protected health information (PHI) that may need to be masked or de-identified. |
ner_deid_subentity_augmented_v2 |
This document-level model detects PHI entities for de-identification. (Subentity) |
clinical_deidentification_docwise_wip |
This pipeline can be used to deidentify PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text. |
clinical_deidentification_nameAugmented_v2 |
This pipeline can be used to deidentify PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text. |
clinical_deidentification_v2_wip |
This pipeline can be used to deidentify PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text. |
Example:
ner_deid_generic = MedicalNerModel.pretrained("ner_deid_generic_docwise", "en", "clinical/models") \
.setInputCols(["document", "token", "embeddings"]) \
.setOutputCol("ner_deid_generic_docwise")
text= '''Dr. John Taylor, ID 982345, a cardiologist at St. Mary's Hospital in Boston, was contacted on 05/10/2023 regarding a 45-year-old male patient.'''
Result:
chunk | begin | end | ner_label |
---|---|---|---|
John Taylor | 5 | 15 | NAME |
982345 | 21 | 26 | CONTACT |
cardiologist | 31 | 42 | PROFESSION |
St. Mary’s Hospital | 47 | 65 | LOCATION |
Boston | 70 | 75 | LOCATION |
05/10/2023 | 95 | 104 | DATE |
45-year-old | 118 | 128 | AGE |
Please check the Clinical Deidentification Notebook for more information
Introducing a New Mapper Model Designed to Link ICD-10-CM codes
with Their Corresponding Chronicity Indicators
Introducing a suite of new ChunkMapper models designed to streamline medical code mapping tasks.
Model Name | Description |
---|---|
icd10cm_chronic_indicator_mapper |
This mapper model links ICD-10-CM codes to their corresponding chronicity indicators. The chronic indicator can have three different values; 0 : “not chronic”, 1 : “chronic”, 9 : “no determination” |
Example:
mapperModel = ChunkMapperModel.pretrained("icd10cm_chronic_indicator_mapper","en", "clinical/models")\
.setInputCols(["chunk"])\
.setOutputCol("chronic_indicator_mapping")\
.setRels(["chronic_indicator"])
data = spark.createDataFrame([["""A 42-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus, associated with besity with a body mass index (BMI) of 33.5 kg/m2, presented with a one-week history of polyuria, polydipsia, poor appetite, and vomiting. Two weeks prior to presentation, she was treated with a five-day course of amoxicillin for a respiratory tract infection."""]]).toDF("text")
Result:
sentence_id | entity | begin | end | label | icd10cm | resolution | chronic_indicator |
---|---|---|---|---|---|---|---|
0 | gestational diabetes mellitus | 39 | 67 | PROBLEM | O24.4 | gestational diabetes mellitus [gestational diabetes mellitus] | 0 |
0 | subsequent type two diabetes mellitus | 117 | 153 | PROBLEM | O24.11 | pre-existing type 2 diabetes mellitus [pre-existing type 2 diabetes mellitus, in pregna… | 1 |
0 | obesity | 172 | 178 | PROBLEM | E66.9 | obesity [obesity, unspecified] | 1 |
0 | a body mass index | 185 | 201 | PROBLEM | Z68.41 | finding of body mass index [body mass index [bmi] 40.0-44.9, adult] | 9 |
0 | polyuria | 261 | 268 | PROBLEM | R35 | polyuria [polyuria] | 0 |
0 | polydipsia | 271 | 280 | PROBLEM | R63.1 | polydipsia [polydipsia] | 0 |
0 | poor appetite | 283 | 295 | PROBLEM | R63.0 | poor appetite [anorexia] | 0 |
0 | vomiting | 302 | 309 | PROBLEM | R11.1 | vomiting [vomiting] | 0 |
1 | a respiratory tract infection | 403 | 431 | PROBLEM | J98.8 | respiratory tract infection [other specified respiratory disorders] | 0 |
Introducing a New Named Entity Recognition (NER) Model and a New Binary Classification Model to Detect Adverse Drug Events
- Named Entity Recognition (NER) Model: ner_ade_clinical_v2 to detect adverse reactions of drugs, and problem in reviews, tweets, and medical text using pretrained NER model.
- Binary Classification Model: bert_sequence_classifier_ade_augmented_v2 Classify texts/sentences in two categories:
True
: The sentence is talking about a possible ADE.False
: The sentence doesn’t have any information about an ADE.
Example:
ner_model = MedicalNerModel.pretrained("ner_ade_clinical_v2", "en", "clinical/models")\
.setInputCols(["sentence", "token","embeddings"])\
.setOutputCol("ner")
data = spark.createDataFrame([["""I have an allergic reaction to vancomycin so I have itchy skin, sore throat/burning/itching, numbness of tongue and gums.
I would not recommend this drug to anyone, especially since I have never had such an adverse reaction to any other medication."""]]).toDF("sentence")
Result:
chunk | begin | end | ner_label |
---|---|---|---|
allergic reaction | 10 | 26 | ADE |
vancomycin | 31 | 40 | DRUG |
itchy skin | 52 | 61 | ADE |
sore throat/burning/itching | 64 | 90 | ADE |
numbness of tongue and gums | 93 | 119 | ADE |
an adverse reaction | 204 | 222 | PROBLEM |
Introducing the REChunkMerger
Annotator to Merge the Entities in a Relationship as a Single Entity
The REChunkMerger
annotator merge related chunks of data into a new, single chunk. It specifically merges entities that are identified as being in a relationship by using a separator, which by default is a whitespace (“ “). This means when two related entities are found within a text, this annotator combines them into one chunk using the specified separator to see the relationship clear.
Key Parameters:
separator
: Separator to add between the relation chunks. (default is a whitespace: “ “).
Example:
re_chunk_merger = REChunkMerger() \
.setInputCols(["re_chunk"]) \
.setOutputCol("relation_chunks") \
.setSeparator(" ")\
data = spark.createDataFrame([["""The patient was prescribed 1 unit of Advil for 5 days after meals. The patient was also
given 1 unit of Metformin daily.
He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night ,
12 units of insulin lispro with meals , and metformin 1000 mg two times a day."""]]).toDF("sentence")
Relation Result:
sentence | chunk1 | entity1 | chunk2 | entity2 | relation | confidence |
---|---|---|---|---|---|---|
0 | 1 unit | DOSAGE | Advil | DRUG | DOSAGE-DRUG | 1.0 |
0 | Advil | DRUG | for 5 days | DURATION | DRUG-DURATION | 1.0 |
1 | 1 unit | DOSAGE | Metformin | DRUG | DOSAGE-DRUG | 1.0 |
1 | Metformin | DRUG | daily | FREQUENCY | DRUG-FREQUENCY | 1.0 |
2 | 40 units | DOSAGE | insulin glargine | DRUG | DOSAGE-DRUG | 1.0 |
2 | insulin glargine | DRUG | at night | FREQUENCY | DRUG-FREQUENCY | 1.0 |
2 | 12 units | DOSAGE | insulin lispro | DRUG | DOSAGE-DRUG | 1.0 |
2 | insulin lispro | DRUG | with meals | FREQUENCY | DRUG-FREQUENCY | 1.0 |
2 | metformin | DRUG | 1000 mg | STRENGTH | DRUG-STRENGTH | 1.0 |
2 | metformin | DRUG | two times a day | FREQUENCY | DRUG-FREQUENCY | 1.0 |
REChunkMerger Result:
result |
---|
1 unit Advil |
Advil for 5 days |
1 unit Metformin |
Metformin daily |
40 units insulin glargine |
insulin glargine at night |
12 units insulin lispro |
insulin lispro with meals |
metformin 1000 mg |
metformin two times a day |
Please check the REChunkMerger Mooc Notebook Notebook for more information
Introducing a Brand New ContextualEntityFilter
Annotator to Filter Entities with Context-Specific Rules.
The ContextualEntityFilterer
filters segments of text—identified by metadata as “CHUNK” annotations that contain specific types of entities. These entities are defined by identifiers or field types detailed in the metadata.
Key Parameters:
ruleScope
: The ruleScope parameter to apply the filter. Options: sentence, document.caseSensitive
: Whether to use case sensitive when matching wordsrules
: The rules parameter to filter chunks based on contextual rules and it is a list of dictionaries. A dictionary should contain the following keys: \ -entity
: The entity field to filter. \ -scopeWindow
: The scope window around the entity, defined as a list of two integers [before, after], specifying how many chunks before and after should be considered.\ -whiteListEntities
: The white list of entities. If an entity from this list appears within the scope window, the chunk will be kept.\ -blackListEntities
: The black list of entities. If an entity from this list appears within the scope window, the chunk will be filtered out.\ -blackListWords
: The black list of words. If a word from this list appears within the scope window, the chunk will be filtered out.\ -whiteListWords
: The white list of words. If a word from this list appears within the scope window, the chunk will be kept.\ -confidenceThreshold
: The confidence threshold to filter the chunks. Filtering is only applied if the confidence of the chunk is below the threshold.\ -scopeWindowLevel
: The level to apply the scope window. Options: token, chunk.
Example:
rules =[
{
"entity": "LOCATION",
"scopeWindow": [2, 2],
"whiteList": ["AGE", "DATE"],
"blackList": ["ID", "NAME"],
"scopeWindowLevel": "token"
},
{
"entity": "DATE",
"scopeWindow": [2, 2],
"whiteList": ["AGE", "DATE"],
"blackList": ["ID", "NAME"],
"scopeWindowLevel": "chunk"
}
]
contextual_entity_filterer = ContextualEntityFilterer() \
.setInputCols("sentence", "token", "ner_chunks") \
.setOutputCol("filtered_ner_chunks") \
.setRules(rules)\
.setRuleScope("sentence")
text = "California, known for its beautiful beaches,and he is 36 years. " \
"The Grand Canyon in Arizona, where the age is 37, is a stunning natural landmark. " \
"It was founded on September 9, 1850, and Arizona on February 14, 1912."
df = spark.createDataFrame([[text]]).toDF("text")
Input DataFrame:
chunk | begin | end | ner_label |
---|---|---|---|
California | 0 | 9 | LOCATION |
36 | 54 | 55 | AGE |
Grand Canyon | 68 | 79 | LOCATION |
Arizona | 84 | 90 | LOCATION |
37 | 110 | 111 | AGE |
September 9, 1850 | 164 | 180 | DATE |
February 14, 1912 | 198 | 214 | DATE |
Result after filtering:
chunk | begin | end | ner_label | confidence |
---|---|---|---|---|
36 | 54 | 55 | AGE | 0.96 |
37 | 110 | 111 | AGE | 0.9921 |
September 9, 1850 | 164 | 180 | DATE | 0.964375 |
February 14, 1912 | 198 | 214 | DATE | 0.952525 |
Please check the ContextualEntityFilter Mooc Notebook Notebook for more information
Enhanced New Sentence Detector Model for Healthcare Text Segmentation in Corrupted Text
The new Sentence Detector DL Model v2
significantly improves sentence segmentation accuracy compared to its predecessor. Optimized for clinical and healthcare domains, it ensures more precise detection of sentence boundaries, particularly in complex medical texts. The model delivers superior performance across various document formats and medical terminology. The new model offers enhanced support for edge cases, improved handling of abbreviations and punctuation, and better adaptability to diverse linguistic structures, ensuring more reliable results in medical NLP pipelines.
Example:
sentence_detector_v1 = SentenceDetectorDLModel\
.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \
.setInputCols(["document"]) \
.setOutputCol("sentence_v1")
sentence_detector_v2 = SentenceDetectorDLModel\
.pretrained("sentence_detector_dl_healthcare_v2_wip", "en", "clinical/models") \
.setInputCols(["document"]) \
.setOutputCol("sentence_v2")
pipeline = Pipeline(
stages=[
document_assembler,
sentence_detector_v1,
sentence_detector_v2])
Sentence Detector v1 Result:
sent_id | sentence |
---|---|
0 | He was given boluses of MS04 with some effect, he has since been placed on a PCA - \nhe take 80mg of oxycontin at home, his PCA dose is ~ 2 the morphine dose of the oxycontin, \nhe has also received ativan for anxiety. |
1 | Repleted with 20 meq kcl po, 30 mmol K-phos iv and 2 gms \nmag so4 iv. |
2 | Size: Prostate gland measures 10x1. |
3 | 1x4.9 cm (LS x AP x TS). |
4 | Estimated volume is 51.9 ml \nand is mildly enlarged in size. |
5 | Normal delineation pattern of the prostate gland is preserved. |
6 | ORs with 95% CI for family - based analyses were calculated in PLINK (http: / / pngu . mgh . \nharvard . |
7 | edu / purcell / plink /) [59]. |
8 | Classification and regression based quantitative \nstructure - toxicity relationship (QSTR) as well as toxicophore models were developed for \nthe first time on basal cytotoxicity data (in vitro 3T3 neutral red uptake data) of a diverse \nseries of chemicals (including drugs and environmental pollutants) collected from the ACuteTox \ndatabase (http: / / www . acutetox . eu /). |
9 | Here, we created a database, the Worm Developmental \nDynamics Database (http: / / so . qbic . riken . jp / wddd /), which stores a collection of \nquantitative information about cell division dynamics in early Caenorhabditis elegans embryos \nwith single genes silenced by RNA - mediated interference. |
10 | Worldwide prevalence figures estimate \nthat there are 280 million diabetic patients in 2011 and more than 500 million in 2030 (http: / \n/ www . diabetesatlas . |
11 | org /). |
12 | ESTs with homologues / orthologues in C . elegans and other \nnematodes were also subjected to analysis employing the KEGG Orthology - Based Annotation System (KOBAS) |
13 | (www . kobas . |
14 | cbi . pku . edu . cn), which predicts the biochemical pathways in which \nmolecules are involved. |
15 | We also included the TNF - 308G / A promoter SNP, together with nine \nfurther SNPs in the region of HLA - B and MICA obtained from the database, dbSNP (http: / / \nwww . ncbi . nlm . nih . gov / projects / SNP /). |
16 | Sequences were analyzed using DNASTAR 4 . 0 |
17 | (http: / / www . dnastar . |
18 | com), GeneDoc, and GCC (University of Wisconsin). |
19 | Peptides inferred \nfrom ESTs were classified functionally using Interproscan (available at http: / / www . ebi . \nac . uk / InterProScan /) employing the default search parameters. |
Sentence Detector v2 Result:
sent_id | sentence |
---|---|
0 | He was given boluses of MS04 with some effect, he has since been placed on a PCA - \nhe take 80mg of oxycontin at home, his PCA dose is ~ 2 the morphine dose of the oxycontin, \nhe has also received ativan for anxiety. |
1 | Repleted with 20 meq kcl po, 30 mmol K-phos iv and 2 gms \nmag so4 iv. |
2 | Size: Prostate gland measures 10x1.1x4.9 cm (LS x AP x TS). |
3 | Estimated volume is 51.9 ml \nand is mildly enlarged in size. |
4 | Normal delineation pattern of the prostate gland is preserved. |
5 | ORs with 95% CI for family - based analyses were calculated in PLINK (http: / / pngu . mgh . \nharvard . edu / purcell / plink /) [59]. |
6 | Classification and regression based quantitative \nstructure - toxicity relationship (QSTR) as well as toxicophore models were developed for \nthe first time on basal cytotoxicity data (in vitro 3T3 neutral red uptake data) of a diverse \nseries of chemicals (including drugs and environmental pollutants) collected from the ACuteTox \ndatabase (http: / / www . acutetox . eu /). |
7 | Here, we created a database, the Worm Developmental \nDynamics Database (http: / / so . qbic . riken . jp / wddd /), which stores a collection of \nquantitative information about cell division dynamics in early Caenorhabditis elegans embryos \nwith single genes silenced by RNA - mediated interference. |
8 | Worldwide prevalence figures estimate \nthat there are 280 million diabetic patients in 2011 and more than 500 million in 2030 (http: / \n/ www . diabetesatlas . org /). |
9 | ESTs with homologues / orthologues in C . elegans and other \nnematodes were also subjected to analysis employing the KEGG Orthology - Based Annotation System (KOBAS) |
10 | (www . kobas . cbi . pku . edu . cn), which predicts the biochemical pathways in which \nmolecules are involved. |
11 | We also included the TNF - 308G / A promoter SNP, together with nine \nfurther SNPs in the region of HLA - B and MICA obtained from the database, dbSNP (http: / / \nwww . ncbi . nlm . nih . gov / projects / SNP /). |
12 | Sequences were analyzed using DNASTAR 4 . 0 \n(http: / / www . dnastar . com), GeneDoc, and GCC (University of Wisconsin). |
13 | Peptides inferred \nfrom ESTs were classified functionally using Interproscan (available at http: / / www . ebi . \nac . uk / InterProScan /) employing the default search parameters. |
Introducing New Parameters to Replacer
for Data Augmentation
Added new parameters into Replacer
to replace identified tokens or patterns with predefined alternatives. Moreover, Added new option to noneValuesTo
: prioritize_static_entity. If a static entity mapping is available for the entity type, it will use this values for mapping. If not, it will act according to StaticEntityMappingsFallback option.
returnEntityMappings
: With this property you select if you want to return mapping columnmappingsColumn
: This column maps the annotations to their corresponding chunks before the entities are replaced.staticEntityMappings
: A map of entity types to their replacement valuesstaticEntityMappingsFallback
: Fallback option for static entity mappings. Allowed values: ‘entity’, ‘place_holder’, ‘skip’, ‘error’, Default; error
Example:
replacer = Replacer() \
.setInputCols("chunk", "sentence")\
.setOutputCol("doc")\
.setUseReplacement(True)\
.setNoneValuesTo("prioritize_static_entity") \
.setPlaceHolder("******") \
.setPlaceHolderDelimiters(["<", ">"]) \
.setReturnEntityMappings(True) \
.setMappingsColumn("mappings") \
.setStaticEntityMappings({"TREATMENT": "MEDICATION", "TEST": "ACTIVITY"}) \
.setStaticEntityMappingsFallback("entity")
sample_text = "A 32-year-old woman with a history of type 2 diabetes, previously managed for gestational diabetes. Her treatment regimen included metformin. At her check-up, she was found to be dehydrated, though without abdominal discomfort. Important lab results showed a glucose level of 130 mg/dL, triglycerides at 450 mg/dL, and venous pH of 7.30."
data = spark.createDataFrame([[sample_text]]).toDF("text")
Result:
A 32-year-old woman with a history of <PROBLEM>, previously managed for <PROBLEM>., MEDICATION included MEDICATION., At ACTIVITY, she was found to be <PROBLEM>, though without <PROBLEM>., Important lab results showed ACTIVITY of 130 mg/dL, ACTIVITY at 450 mg/dL, and ACTIVITY of 7.30.
Please check the Replacer Mooc Notebook Notebook for more information
New Speed Benchmarks for Multi-NER Pipelines
- 64 Cores:
Driver: Standard_D4s_v3, 4 core, 16 GB memory, \ Worker: Standard_D4s_v2, 8 core, 28 GB memory, \ total worker number: 8 \ input_data_rows:1000
action | partition | NER timing |
2_NER timing |
4_NER timing |
NER+RE timing |
---|---|---|---|---|---|
write_parquet | 4 | 1 min 36 sec | 3 min 1 sec | 6 min 32 sec | 3 min 12 sec |
write_deltalake | 4 | 1 min 38 sec | 3 min 2 sec | 6 min 30 sec | 3 min 18 sec |
write_parquet | 8 | 48 sec | 1 min 32 sec | 3 min 21 sec | 1 min 38 sec |
write_deltalake | 8 | 51 sec | 1 min 36 sec | 3 min 26 sec | 1 min 43 sec |
write_parquet | 16 | 28 sec | 1 min 16 sec | 2 min 2 sec | 56 sec |
write_deltalake | 16 | 31 sec | 57 sec | 2 min 2 sec | 58 sec |
write_parquet | 32 | 20 sec | 39 sec | 1 min 22 sec | 50 sec |
write_deltalake | 32 | 22 sec | 41 sec | 1 min 45 sec | 35 sec |
write_parquet | 64 | 17 sec | 31 sec | 1 min 8 sec | 27 sec |
write_deltalake | 64 | 17 sec | 32 sec | 1 min 11 sec | 29 sec |
write_parquet | 100 | 18 sec | 33 sec | 1 min 13 sec | 30 sec |
write_deltalake | 100 | 20 sec | 33 sec | 1 min 32 sec | 32 sec |
write_parquet | 1000 | 22 sec | 36 sec | 1 min 12 sec | 31 sec |
write_deltalake | 1000 | 23 sec | 34 sec | 1 min 33 sec | 52 sec |
Please check NER (BiLSTM-CNN-Char Architecture) Benchmark Experiment for more detail
New Blog Posts on Various Topics (AI for Equity, Detecting Stigmatizing Language from Medical Texts, Subcohort Analysis for Oncology Patients, Using Small LLMs to Extract Structured Named Entities, …)
Explore the latest developments in healthcare NLP through our new blog posts, where we take a deep dive into the innovative technologies and methodologies transforming the medical field. These posts offer insights into how the latest tools are being used to analyze large amounts of unstructured data, identify critical medical assets, and extract meaningful patterns and correlations. Learn how these advances are not only improving our understanding of complex health issues but also contributing to more effective prevention, diagnosis, and treatment strategies.
-
AI for Equity: Extracting Stigmatizing Language from Medical Texts for Better Patient Care The primary aim of medicine is to help patients manage and improve their health, but the language used in medical settings can sometimes have unintended negative effects. Stigmatizing language in medical records can reduce patients to their conditions or pass judgment on their illnesses, leading to poorer care and worsening health disparities. This article discusses how such language can be recognized and addressed using a model developed by John Snow Labs.
-
Refining Entity Detection in Healthcare NLP: Precision Through Entity Filtering The blog post discusses the ChunkFilterer annotator in Healthcare NLP, emphasizing its role in refining Named Entity Recognition (NER) for precise healthcare applications. It highlights the annotator’s capabilities to filter entities using whitelists, blacklists, regular expressions, and confidence scores, enabling users to focus on relevant information from unstructured clinical texts. By integrating ChunkFilterer into NLP pipelines, healthcare professionals can enhance data accuracy and efficiency, leading to improved patient diagnoses, treatment recommendations, and information retrieval from medical databases.
-
Accurate Extracting of Cancer Biomarkers from Free-Text Clinical Notes This blog post examines the role of biomarkers in enhancing cancer diagnosis through their extraction from unstructured clinical notes using advanced NLP techniques. It highlights the challenges posed by the inconsistent and complex nature of clinical documentation and how the Healthcare NLP library, developed by John Snow Labs, addresses these through specialized models for entity extraction, relation extraction, and sequence classification. By accurately identifying and analyzing biomarkers from clinical texts, the library supports personalized cancer treatment, improves diagnosis and prognosis, and accelerates cancer research, ultimately enhancing patient outcomes and advancing our understanding of cancer biology.
-
From Diagnosis to Prognosis: Understanding 6 Common Cancers in Medical Records This blog post details the application of John Snow Labs’ Healthcare NLP and LLM library in revolutionizing cancer care through enhanced analysis of medical records. Concentrating on six prevalent cancer types, it showcases how sophisticated natural language processing aids in refining diagnosis, prognosis, and personalized treatment plans by extracting and interpreting vital data from unstructured clinical texts. This technology is pivotal in tackling the rising global cancer rates and improving the overall efficiency and effectiveness of cancer treatments.
-
Harnessing Healthcare-Specific LLMs for Clinical Entity Extraction This blog post focuses on how JSL-MedS-NER models are optimized to extract clinical entities from unstructured medical text. These models identify critical information like drug names, diagnoses, side effects, and protected health information (PHI) using quantization options (q4, q8, q16) that balance speed and accuracy. It highlights the models’ application in pharmacovigilance, oncology reporting, and clinical data processing. By enabling the identification of drugs, adverse events, and medical conditions, the models support clinical decision-making and data privacy compliance across healthcare systems.
Various Core Improvements: Bug Fixes, Enhanced Overall Robustness, and Reliability of Spark NLP for Healthcare
- Added the
RegexMatcherInternalModel
trait to Scala to match the pretrained method available in Python. - Added a new parameter for the
Flattener
annotator to sets an array of column names that should be kept in the dataframe after the flattening process. - Added
resetSentenceIndices
parameter toChunkMerger
,NerConverterInternal
, andChunkConverter
annotators for reset sentence indices to treat the entire output as if it originates from a single document. - Fixed Generative AI Lab API task deletion endpoint: Resolved an issue with the
tasks_delete
endpoint, enabling proper deletion of tasks via the API. - Added
chunk_validation_options
dictionary into thedict_to_annotation_converter
module for converting dictionary data to Spark NLP annotations. - Added pretrained feature added to
InternalDocumentSplitter
. - Deprecated the
nlp_test
module inspark-nlp-jsl
; future development will now be managed byLangTest
. Streamlined thespark-nlp-jsl
package for improved efficiency. - Added support for
ONNX
models in theChunkKeyPhraseExtraction
annotator, allowing for compatibility with ONNX-based models.
Updated Notebooks And Demonstrations For making Spark NLP For Healthcare Easier To Navigate And Understand
- New REChunkMerger MOOC Notebook for merging the entities in a relationship.
- New ContextualEntityFilterer MOOC Notebook to filter chunks.
- Updated Replacer MOOC Notebook with new
returnEntityMappings
,mappingsColumn
,staticEntityMappings
, andstaticEntityMappingsFallback
parameters. - Updated NerConverterInternal MOOC Notebook with
resetSentenceIndices
parameter. - Updated ChunkConverter MOOC Notebook with
resetSentenceIndices
parameter. - Updated ChunkMergeModel MOOC Notebook with
resetSentenceIndices
parameter. - Updated Rule Based Entity Matchers notebook with
country_matcher
andstate_matcher
model. - Updated Clinical DeIdentification notebook with new models into model list.
- Updated Prepare CoNLL from Annotations for NER notebook with alternative method for creating conll file.
- Updated Contextual Parser Rule Based NER notebook with
zip_parser
model - Updated Loading Medical and Open Souce LLMs notebook with new LLMs models
- Updated ONCOLOGY Demo with classification models.
We Have Added And Updated A Substantial Number Of New Clinical Models And Pipelines, Further Solidifying Our Offering In The Healthcare Domain.
biolordresolve_icd10cm_augmented_billable_hcc
sbiobertresolve_hcc_augmented
sbiobertresolve_meddra_lowest_level_term
sbiobertresolve_meddra_preferred_term
meddra_llt_pt_mapper
meddra_pt_llt_mapper
sbertresolve_icd10cm_augmented
icd10_meddra_llt_mapper
icd10_meddra_pt_mapper
meddra_llt_icd10_mapper
meddra_pt_icd10_mapper
sbertresolve_hcc_augmented
sbertresolve_icd10cm_augmented_billable_hcc
meddra_hlt_pt_mapper
meddra_llt_snomed_mapper
meddra_pt_hlt_mapper
sbiobertresolve_icd10cm_generalised_augmented
snomed_meddra_llt_mapper
meddra_llt_resolver_pipeline
meddra_pt_resolver_pipeline
explain_clinical_doc_vop_small
explain_clinical_doc_cancer_type
sentence_detector_dl_healthcare_v2_wip
explain_clinical_doc_oncology
ndc_resolver_pipeline
explain_clinical_doc_cancer_type
sbiobertresolve_ndc
state_matcher
country_matcher
zip_parser
jsl_meds_q16_v2
jsl_meds_q8_v2
jsl_meds_q4_v2
jsl_meds_q16_v3
jsl_meds_q8_v3
jsl_meds_q4_v3
oncology_biomarker_pipeline
rxnorm_resolver_pipeline
icd10cm_rxnorm_resolver_pipeline
medication_resolver_pipeline
medication_resolver_transform_pipeline
ner_deid_subentity_augmented_v2
ner_deid_generic_docwise
ner_deid_subentity_augmented_docwise
ner_deid_subentity_docwise
ner_deid_aipii
explain_clinical_doc_sdoh_small
icd10cm_chronic_indicator_mapper
jsl_meds_ner_q4_v2
jsl_meds_ner_q8_v2
jsl_meds_ner_q16_v2
jsl_medm_q4_v1
jsl_medm_q8_v1
jsl_medm_q16_v1
jsl_medsner_zs_q4_v1
jsl_medsner_zs_q8_v1
jsl_medsner_zs_q16_v1
jsl_meds_q4_v1
jsl_meds_q8_v1
jsl_meds_q16_v1
jsl_meds_rag_q4_v1
jsl_meds_rag_q8_v1
jsl_meds_rag_q16_v1
clinical_deidentification_docwise_wip
clinical_deidentification_v2_wip
clinical_deidentification_nameAugmented_v2
ner_ade_clinical_v2
bert_sequence_classifier_ade_augmented_v2
sbiobertresolve_loinc
sbiobertresolve_loinc_augmented
sbiobertresolve_loinc_numeric
For all Spark NLP for Healthcare models, please check: Models Hub Page
Versions
- 5.5.1
- 5.5.0
- 5.4.1
- 5.4.0
- 5.3.3
- 5.3.2
- 5.3.1
- 5.3.0
- 5.2.1
- 5.2.0
- 5.1.4
- 5.1.3
- 5.1.2
- 5.1.1
- 5.1.0
- 5.0.2
- 5.0.1
- 5.0.0
- 4.4.4
- 4.4.3
- 4.4.2
- 4.4.1
- 4.4.0
- 4.3.2
- 4.3.1
- 4.3.0
- 4.2.8
- 4.2.4
- 4.2.3
- 4.2.2
- 4.2.1
- 4.2.0
- 4.1.0
- 4.0.2
- 4.0.0
- 3.5.3
- 3.5.2
- 3.5.1
- 3.5.0
- 3.4.2
- 3.4.1
- 3.4.0
- 3.3.4
- 3.3.2
- 3.3.1
- 3.3.0
- 3.2.3
- 3.2.2
- 3.2.1
- 3.2.0
- 3.1.3
- 3.1.2
- 3.1.1
- 3.1.0
- 3.0.3
- 3.0.2
- 3.0.1
- 3.0.0
- 2.7.6
- 2.7.5
- 2.7.4
- 2.7.3
- 2.7.2
- 2.7.1
- 2.7.0
- 2.6.2
- 2.6.0
- 2.5.5
- 2.5.3
- 2.5.2
- 2.5.0
- 2.4.6
- 2.4.5
- 2.4.2
- 2.4.1
- 2.4.0