5.5.1
Highlights
We are delighted to announce remarkable enhancements and updates in our latest release of Spark NLP for Healthcare. This release comes with brand new PretrainedZeroShotNER
, ContextualEntityRuler
, and StructuredJsonConverter
annotators, 39 new and updated clinical pretrained models, and pipelines.
- Introducing a brand new
PretrainedZeroShotNER
annotator to extract named entities with no annotation or additional training, for any arbitrary label (coming with 12 zero shot models that are already finetuned on in-house annotations) - New
ContextualEntityRuler
annotator customizing named entities based on contextual rules (modifying chunks via inclusion and exclusion criteria) - New
StructuredJsonConverter
annotator for prettified annotation results and enhanced data processing (returning structured JSON outputs from Spark NLP pipelines) - Majority voting for overlapping annotations in
AssertionMerger
(picking the optimal assertion status coming from multiple models) - New rule-based contextual parser and entity matcher models to customize De-Identification pipelines
- Introducing 5 new named entity recognition (NER) models and pipelines to detect German PHI data for deidentification with minimal customization
- Introducing 2 new RxNorm resolution models for mapping the medication entities to RxNorm terminology, using SOTA
MedEmbed
sentence embeddings - Optimizing Spark Driver memory allocation to utilize all the available resources by default
Databricks
support forMedicalLLM
andLLMLoader
to load/ run finetuned medical LLMs- New blog posts on various topics
- Various core improvements; bug fixes, enhanced overall robustness and reliability of Spark NLP for Healthcare
- The
AnnotationLab
supports thessl_verification
parameter, allowing users to disable SSL certificate verification - Fixed entity filtering issue in
PipelineTracer
and it supportsPretrainedZeroShotNER
andMedicalBertForTokenClassification
- Fixed hanging issue when setting
setSelectMostDifferent(True)
in theChunkKeyPhraseExtraction
annotator by updating the algorithm at the behind
- Fixed hanging issue when setting
- Updated faker output generation algorithm in
Deidentification
annotator by making it more sensitive for the “names” Deidentification
supports new faker labels such as “location_other” and “company” entities- Added new
setKeepMonth
parameter toDeidentification
to allow month data keep intacted
- The
- Updated notebooks and demonstrations for making Spark NLP for Healthcare easier to navigate and understand
- New Deidentification_Custom_Pretrained_Pipelines Notebook
- New Contextual_Entity_Ruler Notebook
- Updated Contextual_Parser_Rule_Based_NER Notebook
- Updated ZeroShot_Clinical_NER Notebook
- Updated Rule_Based_Entity_Matchers Notebook
- Updated PipelineTracer_and_PipelineOutputParser Notebook
- New NER_GENE_PHENOTYPES Demo
- New NER_VACCANIES Demo
- Updated HCC_RISK_SCORE Demo
- The addition and update of numerous new clinical models and pipelines continue to reinforce our offering in the healthcare domain
These enhancements will elevate your experience with Spark NLP for Healthcare, enabling more efficient, accurate, and streamlined analysis of healthcare-related natural language data.
PretrainedZeroShotNER
Annotator to Extract Named Entities with No Annotation or Additional Training, for any Arbitrary Label.
Pretrained-Zero-Shot Named Entity Recognition (NER) enables the identification of entities in text with minimal effort. By leveraging pre-trained language models and contextual understanding, zero-shot NER extends entity recognition capabilities to new domains and languages. While the model card includes default labels as examples, it is important to highlight that users are not limited to these labels. The model is designed to support any set of entity labels, allowing users to adapt it to their specific use cases. For best results, it is recommended to use labels that are conceptually similar to the provided defaults.
Model Name | Description | Predicted Entites |
---|---|---|
zeroshot_ner_generic_large |
This model extracts generic entities | AGE , DATE , DISEASE , DISORDER , DRUG , LOCATION , NAME , PHONE , RESULT , SYMPTOM , SYNDROME , TEST , TREATMENT |
zeroshot_ner_generic_medium |
This model extracts generic entities | AGE , DATE , DISEASE , DISORDER , DRUG , LOCATION , NAME , PHONE , RESULT , SYMPTOM , SYNDROME , TEST , TREATMENT |
zeroshot_ner_clinical_large |
This model extracts clinical entities | PROBLEM , TREATMENT , TEST |
zeroshot_ner_clinical_medium |
This model extracts clinical entities | PROBLEM , TREATMENT , TEST |
zeroshot_ner_deid_generic_docwise_large |
This model extracts demographic entities | AGE , CONTACT , DATE , ID , LOCATION , NAME , PROFESSION |
zeroshot_ner_deid_generic_docwise_medium |
This model extracts demographic entities | AGE , CONTACT , DATE , ID , LOCATION , NAME , PROFESSION |
zeroshot_ner_oncology_large |
This model extracts oncological entities | Adenopathy , Age , Biomarker , Biomarker_Result , Body_Part , Cancer_Dx , Cancer_Surgery , Cycle_Count , Cycle_Day , Date , Death_Entit , Directio , Dosage , Duration , Frequency , Gender , Grade , Histological_Type , Imaging_Test , Invasion , Metastasis , Oncogene , Pathology_Test , Race_Ethnicity , Radiation_Dose , Relative_Date , Response_To_Treatment , Route , Smoking_Status , Staging , Therapy , Tumor_Finding , Tumor_Size |
zeroshot_ner_oncology_medium |
This model extracts oncological entities | Adenopathy , Age , Biomarker , Biomarker_Result , Body_Part , Cancer_Dx , Cancer_Surgery , Cycle_Count , Cycle_Day , Date , Death_Entit , Directio , Dosage , Duration , Frequency , Gender , Grade , Histological_Type , Imaging_Test , Invasion , Metastasis , Oncogene , Pathology_Test , Race_Ethnicity , Radiation_Dose , Relative_Date , Response_To_Treatment , Route , Smoking_Status , Staging , Therapy , Tumor_Finding , Tumor_Size |
zeroshot_ner_vop_medium |
This model extracts Voice of the Patients (VOP) entities | AdmissionDischarge , Age , Allergen , BodyPart , ClinicalDept , DateTime , Disease , Dosage , Drug , Duration , Employment , Form , Frequency , Gender , Laterality , MedicalDevice ,Modifier , Procedure , PsychologicalCondition , RaceEthnicity , Substance , Symptom , Test , Treatment , Vaccine |
zeroshot_ner_vop_large |
This model extracts Voice of the Patients (VOP) entities | AdmissionDischarge , Age , Allergen , BodyPart , ClinicalDept , DateTime , Disease , Dosage , Drug , Duration , Employment , Form , Frequency , Gender , Laterality , MedicalDevice ,Modifier , Procedure , PsychologicalCondition , RaceEthnicity , Substance , Symptom , Test , Treatment , Vaccine |
Example:
# You can change the labels
labels = ['DOCTOR', 'PATIENT', 'AGE', 'DATE', 'HOSPITAL', 'CITY', 'STREET', 'STATE', 'COUNTRY', 'PHONE', 'IDNUM', 'EMAIL', 'ZIP', 'ORGANIZATION', 'PROFESSION', 'USERNAME']
pretrained_zero_shot_ner = sparknlp_jsl.annotator.PretrainedZeroShotNER()\
.pretrained("zeroshot_ner_deid_subentity_merged_medium", "en", "clinical/models")\
.setInputCols("sentence", "token")\
.setOutputCol("ner")\
.setPredictionThreshold(0.5)\
.setLabels(labels)
text = """Dr. John Lee, from Royal Medical Clinic in Chicago, attended to the patient on 11/05/2024.
The patient’s medical record number is 56467890. The patient, Emma Wilson, is 50 years old, her Contact number: 444-456-7890 .
Dr. John Taylor, ID: 982345, a cardiologist at St. Mary's Hospital in Boston, was contacted on 05/10/2023 regarding a 45-year-old.
"""
Result:
chunk | begin | end | ner_label |
---|---|---|---|
John Lee | 4 | 11 | DOCTOR |
Royal Medical Clinic | 19 | 38 | HOSPITAL |
Chicago | 43 | 49 | CITY |
11/05/2024 | 80 | 89 | DATE |
56467890 | 131 | 138 | IDNUM |
Emma Wilson | 154 | 164 | PATIENT |
50 | 170 | 171 | AGE |
444-456-7890 | 205 | 216 | PHONE |
John Taylor | 224 | 234 | DOCTOR |
982345 | 241 | 246 | IDNUM |
cardiologist | 251 | 262 | PROFESSION |
St. Mary’s Hospital | 267 | 285 | HOSPITAL |
Boston | 290 | 295 | CITY |
05/10/2023 | 315 | 324 | DATE |
45-year-old | 338 | 348 | AGE |
Example for Changing the Labels:
# You can change the labels. If we can group them such as DOCTOR -> NAME, PATIENT -> NAME ...
labels = ['NAME', 'AGE', 'DATE', 'LOCATION', 'IDNUM',' ORGANIZATION', 'PROFESSION']
pretrained_zero_shot_ner = sparknlp_jsl.annotator.PretrainedZeroShotNER()\
.pretrained("zeroshot_ner_deid_subentity_merged_medium", "en", "clinical/models")\
.setInputCols("sentence", "token")\
.setOutputCol("ner")\
.setPredictionThreshold(0.5)\
.setLabels(labels)
Result:
chunk | begin | end | ner_label |
---|---|---|---|
John Lee | 4 | 11 | NAME |
Royal Medical Clinic | 19 | 38 | ORGANIZATION |
Chicago | 43 | 49 | LOCATION |
11/05/2024 | 80 | 89 | DATE |
56467890 | 131 | 138 | IDNUM |
Emma Wilson | 154 | 164 | NAME |
50 | 170 | 171 | AGE |
444-456-7890 | 205 | 216 | IDNUM |
John Taylor | 224 | 234 | NAME |
982345 | 241 | 246 | IDNUM |
cardiologist | 251 | 262 | PROFESSION |
St. Mary’s Hospital | 267 | 285 | ORGANIZATION |
Boston | 290 | 295 | LOCATION |
05/10/2023 | 315 | 324 | DATE |
45-year-old | 338 | 348 | AGE |
Please check the ZeroShot Clinical NER Notebook for more information
ContextualEntityRuler
Annotator Customizing Named Entities Based on Contextual Rules (Modifying Chunks via Inclusion and Exclusion Criteria).
ContextualEntityRuler
is an annotator that updates chunks based on contextual rules. These rules are defined in the form of dictionaries and can include prefixes, suffixes, and the context within a specified scope window around the chunk.
This annotator modifies detected chunks by replacing content based on matching patterns and rules. It is particularly useful for refining entity recognition results in domain-specific text processing.
Example:
rules = [
{
"entity" : "Age",
"scopeWindow" : [15,15],
"scopeWindowLevel" : "char",
"suffixPatterns" : ["years old", "year old", "months",],
"replaceEntity" : "Modified_Age",
"mode" : "exclude"
},
{
"entity" : "Diabetes",
"scopeWindow" : [3,3],
"scopeWindowLevel" : "token",
"suffixPatterns" : ["with complications"],
"replaceEntity" : "Modified_Diabetes",
"mode" : "include"
},
{
"entity" : "Date",
"suffixRegexes" : ["\\d{4}"],
"replaceEntity" : "Modified_Date",
"mode" : "include"
},
{
"entity" : "Name",
"scopeWindow" : [3,3],
"scopeWindowLevel" : "token",
"prefixPatterns" : ["MD","M.D"],
"replaceEntity" : "Modified_Name",
"mode" : "include"
}
]
contextual_entity_ruler = ContextualEntityRuler() \
.setInputCols("sentence", "token", "ner_chunks") \
.setOutputCol("ruled_ner_chunks") \
.setRules(rules) \
.setCaseSensitive(False)\
.setDropEmptyChunks(True)\
.setAllowPunctuationInBetween(True)
text = """ M.D John Snow assessed the 36 years old who has a history of the diabetes mellitus with complications in May, 2006"""
data = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(data).transform(data)
NER Result:
entity | begin | end | ruled_ner_chunks_result |
---|---|---|---|
Name | 5 | 13 | John Snow |
Age | 28 | 39 | 36 years old |
Diabetes | 66 | 82 | diabetes mellitus |
Date | 106 | 108 | May |
Date | 111 | 114 | 2006 |
Result for after ContextualEntityRuler:
entity | begin | end | ruled_ner_chunks_result | description |
---|---|---|---|---|
Modified_Name | 1 | 13 | M.D John Snow | M.D included |
Modified_Age | 28 | 29 | 36 | years old excluded |
Modified_Diabetes | 66 | 101 | diabetes mellitus with complications | with complications included |
Modified_Date | 106 | 114 | May, 2006 | 2006 included |
Please check the Contextual_Entity_Ruler Notebook for more information
StructuredJsonConverter
Annotator for Prettified Annotation Results and Enhanced Data Processing (Returning Structured JSON Outputs from Spark NLP Pipelines)
This annotator integrates seamlessly with existing frameworks to process outputs from pretrained pipelines, delivering structured, easy-to-read results in a dictionary format. Optimized for API integration and user-friendly outputs, it supports streamlined data analysis workflows by converting raw annotations into a prettified, structured JSON format. With configurable schema mappings, it accommodates diverse outputs, including entities, assertions, resolutions, relations, summaries, deidentifications, and classifications. It uses column_maps to define output columns and align them with pipeline requirements. It handles diverse annotation types, including entities, assertions, resolutions, relations, summaries, deidentifications, and classifications. It produces well-structured, easy-to-read results ideal for API consumption and streamlined workflows.
Example:
biomarker_pipeline = PretrainedPipeline("explain_clinical_doc_biomarker", "en", "clinical/models")
text = """In the bone- marrow (BM) aspiration, blasts accounted for 88.1% of ANCs, which were positive for CD9 and CD10 on flow cytometry.
Measurements of serum tumor markers showed elevated level of Cyfra21-1: 4.77 ng/mL, NSE: 19.60 ng/mL, and SCCA: 2.58 ng/mL.
Immunohistochemical staining showed positive staining for CK5/6, P40, and negative staining for TTF-1 and weakly positive staining for ALK
"""
data = spark.createDataFrame([text], T.StringType()).toDF("text")
result_df = biomarker_pipeline.transform(data)
column_maps = {
'document_identifier': '',
'document_text': 'document',
'entities': ['merged_chunk'],
'assertions': [],
'resolutions': [],
'relations': ['re_oncology_biomarker_result_wip'],
'summaries': [],
'deidentifications': [],
'classifications': [{
'classification_column_name': 'prediction',
'sentence_column_name': 'sentence'
}]
}
output_converter = StructuredJsonConverter()\
.setOutputCol("result")\
.setConverterSchema(column_maps)\
.setCleanAnnotations(False)\
.setReturnRelationEntities(True)
json_output = output_converter.transform(result_df).select("result")
json_output.show(truncate=200)
Result:
{
"document_identifier": "529c4f43-ea61-4e81-a0b1-3cdb87627cc2",
"document_text": ['In the bone- marrow (BM) aspiration, blasts accounted for 88.1% of ANCs, which were positive for CD9 and CD10 on flow cytometry. \n Measurements of serum tumor markers showed elevated level of Cyfra21-1: 4.77 ng/mL, NSE: 19.60 ng/mL, and SCCA: 2.58 ng/mL. \n Immunohistochemical staining showed positive staining for CK5/6, P40, and negative staining for TTF-1 and weakly positive staining for ALK.'
],
"entities": [
{'ner_label': 'Biomarker_Result', 'sentence': '0', 'chunk': 'positive', 'end': '91', 'ner_source': 'ner_oncology_chunk', 'ner_confidence': '0.9672', 'begin': '84', 'chunk_id': 'bc15add6'},
{'ner_label': 'Biomarker', 'sentence': '0', 'chunk': 'CD9', 'end': '99', 'ner_source': 'ner_oncology_chunk', 'ner_confidence': '0.992', 'begin': '97', 'chunk_id': 'b473fd80'},
{'ner_label': 'Biomarker', 'sentence': '0', 'chunk': 'CD10', 'end': '108', 'ner_source': 'ner_oncology_chunk', 'ner_confidence': '0.9987', 'begin': '105', 'chunk_id': '0252d08a'},
{'ner_label': 'Biomarker', 'sentence': '1', 'chunk': 'tumor markers', 'end': '170', 'ner_source': 'ner_oncology_chunk', 'ner_confidence': '0.48290002', 'begin': '158', 'chunk_id': 'ddab7cc4'},
{'ner_label': 'Biomarker_Result', 'sentence': '1', 'chunk': 'elevated level', 'end': '192', 'ner_source': 'ner_oncology_chunk', 'ner_confidence': '0.90779996', 'begin': '179', 'chunk_id': 'a9c78d75'},
...
],
"assertions": [],
"resolutions": [],
"relations": [
{'entity1_begin': '84', 'chunk1': 'positive', 'chunk2': 'CD9', 'entity2_begin': '97', 'confidence': '0.9944541', 'entity2_end': '99', 'chunk1_id': 'bc15add6', 'chunk2_id': 'b473fd80', 'relation': 'is_finding_of', 'entity1': 'Biomarker_Result', 'entity2': 'Biomarker', 'entity1_end': '91', 'direction': 'both'},
{'entity1_begin': '84', 'chunk1': 'positive', 'chunk2': 'CD10', 'entity2_begin': '105', 'confidence': '0.9989317', 'entity2_end': '108', 'chunk1_id': 'bc15add6', 'chunk2_id': '0252d08a', 'relation': 'is_finding_of', 'entity1': 'Biomarker_Result', 'entity2': 'Biomarker', 'entity1_end': '91', 'direction': 'both'},
{'entity1_begin': '158', 'chunk1': 'tumor markers', 'chunk2': 'elevated level', 'entity2_begin': '179', 'confidence': '0.8983218', 'entity2_end': '192', 'chunk1_id': 'ddab7cc4', 'chunk2_id': 'a9c78d75', 'relation': 'is_finding_of', 'entity1': 'Biomarker', 'entity2': 'Biomarker_Result', 'entity1_end': '170', 'direction': 'both'},
...
],
"summaries": [],
"deidentifications": [],
"classifications": [
{'classification': '1', 'sentence': 'In the bone- marrow (BM) aspiration, blasts accounted for 88.1% of ANCs, which were positive for CD9 and CD10 on flow cytometry.', 'sentence_id': '0'},
{'classification': '1', 'sentence': 'Measurements of serum tumor markers showed elevated level of Cyfra21-1: 4.77 ng/mL, NSE: 19.60 ng/mL, and SCCA: 2.58 ng/mL.', 'sentence_id': '1'},
{'classification': '1', 'sentence': 'Immunohistochemical staining showed positive staining for CK5/6, P40, and negative staining for TTF-1 and weakly positive staining for ALK.', 'sentence_id': '2'}
]
}
Please check the PipelineTracer and PipelineOutputParser Notebook for more information
Majority Voting for Overlapping Annotations in AssertionMerger
(Picking the Optimal Assertion Status Coming from Multiple Models)
When there are multiple annotations in the same overlapping group, majority voting is used to resolve conflicts. This method helps determine the final selected annotation based on the most frequent or strongest assertion.
- If confidence is specified as an ordering feature, then the sum of confidence scores for each assertion type is used for majority voting. The assertion type with the highest total confidence score is selected.
Example-1:
In a pipeline consisting of AssertionDL
, FewShotAssertion
, and ContextualAssertion
models:
Model | Assertion Type | Confidence Score |
---|---|---|
FewShotAssertion | possible |
0.90 |
AssertionDL | present |
0.85 |
ContextualAssertion | present |
0.80 |
- Here, confidence is used as an ordering feature.
- Total confidence score for
present
=0.85 + 0.80 = 1.65
. - Total confidence score for
possible
=0.90
. - Since the total confidence score for
present
is higher, the selected annotation ispresent
.
Example-2:
Model | Assertion Type | Confidence Score |
---|---|---|
FewShotAssertion | possible |
0.90 |
AssertionDL | present |
0.30 |
ContextualAssertion | present |
0.50 |
- Total confidence score for
present
=0.30 + 0.50 = 0.80
. - Total confidence score for
possible
=0.90
. - Since the total confidence score for
possible
is higher, the selected annotation ispossible
.
Example:
assertion_merger = AssertionMerger() \
.setInputCols(["assertion_dl","assertion_jsl", "assertion_fewshot"]) \
.setOutputCol("assertion_merger")\
.setMergeOverlapping(True)\
.setMajorityVoting(True)\
.setOrderingFeatures(["confidence"])
text ="""Intramuscular as well as intravenous Haldol were ordered p.r.n. for emergency use should he become acutely agitated."""
assertion_dl Result:
ner_chunk | begin | end | ner_label | assertion_dl | confidence_assertion_dl |
---|---|---|---|---|---|
he | 402 | 403 | Gender | Hypothetical | 0.947 |
acutely agitated | 412 | 427 | PROBLEM | Hypothetical | 0.974 |
assertion_jsl Result:
ner_chunk | begin | end | ner_label | assertion_jsl | confidence_assertion_jsl |
---|---|---|---|---|---|
he | 402 | 403 | Gender | Family | 1.0 |
acutely agitated | 412 | 427 | PROBLEM | Family | 1.0 |
assertion_fewshot Result:
ner_chunk | begin | end | ner_label | assertion_fewshot | confidence_assertion_fewshot |
---|---|---|---|---|---|
he | 402 | 403 | Gender | Family | 0.887 |
acutely agitated | 412 | 427 | PROBLEM | Hypothetical | 0.887 |
Merged Result:
ner_chunk | begin | end | ner_label | assertion | assertion_source | merge_confidence |
---|---|---|---|---|---|---|
he | 402 | 403 | Gender | Family | assertion_jsl | 1.0 |
acutely agitated | 412 | 427 | PROBLEM | Hypothetical | assertion_dl | 0.974 |
New Rule-Based Contextual Parser and Entity Matcher Models to Customise De-IDentification Pipelines
We introduce a suite of text, contextual parser models, and entity matchers, specifically designed to enhance the deidentification and clinical document understanding process with rule-based methods.
Model Name | Description |
---|---|
account_parser |
This model extracts account number entities in clinical notes using a rule-based ContextualParserModel annotator. |
age_parser |
This model extracts age entities in clinical notes using a rule-based ContextualParserModel annotator. |
country_matcher |
This model extracts countries in clinical notes using rule-based TextMatcherInternal annotator. |
date_matcher |
This model extracts date entities in clinical notes using a rule-based RegexMatcherInternal annotator. |
dln_parser |
This model extracts drive license number entities in clinical notes using a rule-based ContextualParserModel annotator. |
license_parser |
This model extracts license number entities in clinical notes using a rule-based ContextualParserModel annotator. |
medical_record_parser |
This model extracts medical record entities in clinical notes using a rule-based ContextualParserModel annotator. |
phone_parser |
This model extracts phone entities in clinical notes using a rule-based ContextualParserModel annotator. |
plate_parser |
This model extracts plate number entities in clinical notes using a rule-based ContextualParserModel annotator. |
ssn_parser |
This model extracts SSN number entities in clinical notes using a rule-based ContextualParserModel annotator. |
vin_parser |
This model extracts vehicle identifier number entities in clinical notes using a rule-based ContextualParserModel annotator. |
Example:
account_contextual_parser = ContextualParserModel.pretrained("account_parser","en","clinical/models") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("chunk_account")
chunk_converter = ChunkConverter() \
.setInputCols(["chunk_account"]) \
.setOutputCol("ner_chunk")
sample_text = """Name : Hendrickson, Ora, Record date: 2093-01-13, # 719435.
Dr. John Green, ID: 1231511863, IP 203.120.223.13. account: 1234567890120 route number: 123567
He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93.
Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no:A334455B.
Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com."""
Result:
chunk | begin | end | label |
---|---|---|---|
1234567890120 | 120 | 132 | ACCOUNT |
123567 | 148 | 153 | ACCOUNT |
Please check the Rule Based Entity Matchers Notebook for more information
Introducing 5 New Named Entity Recognition (NER) Models and Pipeline to Detect German PHI Data for Deidentification with Minimal Customization
Introducing 5 New Named Entity Recognition (NER) Models: ner_deid_generic_docwise
, ner_deid_subentity_langtest
, zeroshot_ner_deid_generic_docwise_large
, clinical_deidentification_docwise_wip
, and clinical_deidentification_docwise_large_wip
. These models work at the document level and are particularly useful for detecting Protected Health Information (PHI) for de-identification in German.
Model Name | Description |
---|---|
ner_deid_generic_docwise_de |
This document-level model detects PHI entities for de-identification. (Generic) |
zeroshot_ner_deid_generic_docwise_large_de |
This document-level zero-shot NER model detects PHI entities for de-identification. (Generic) |
ner_deid_subentity_langtest_de |
This langtest model detects PHI entities for de-identification. (Subentity - Langtest) |
clinical_deidentification_docwise_wip_de |
This pipeline can be used to deidentify PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate: LOCATION , DATE , NAME , ID , AGE , PROFESSION , CONTACT , ORGANIZATION , DOCTOR , CITY , COUNTRY , STREET , PATIENT , PHONE , HOSPITAL , STATE , DLN , SSN , ZIP , ACCOUNT , LICENSE , PLATE , VIN , MEDICALRECORD , EXCLUDED , EMAIL , URL entities. |
clinical_deidentification_docwise_large_wip_de |
This pipeline can be used to deidentify PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate: LOCATION , DATE , NAME , ID , AGE , PROFESSION , CONTACT , ORGANIZATION , DOCTOR , CITY , COUNTRY , STREET , PATIENT , PHONE , HOSPITAL , STATE , DLN , SSN , ZIP , ACCOUNT , LICENSE , PLATE , VIN , MEDICALRECORD , EXCLUDED , EMAIL , URL entities. |
Example:
deid_ner = MedicalNerModel.pretrained("ner_deid_generic_docwise", "de", "clinical/models")\
.setInputCols(["document", "token", "embeddings"])\
.setOutputCol("ner")
data = spark.createDataFrame([["""Michael Berger wird am Morgen des 12 Dezember 2018 ins St. Elisabeth-Krankenhaus
in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen."""]]).toDF("text")
Result:
chunk | begin | end | ner_label |
---|---|---|---|
Michael Berger | 0 | 13 | NAME |
12 Dezember 2018 | 34 | 49 | DATE |
St. Elisabeth-Krankenhaus | 55 | 79 | LOCATION |
Bad Kissingen | 84 | 96 | LOCATION |
Herr Berger | 112 | 122 | NAME |
76 | 128 | 129 | AGE |
Introducing 2 new RxNorm Resolution Models for Mapping the Medication Entities to RxNorm Terminology, Using SOTA MedEmbed
Sentence Embeddings
The latest lineup of 3 cutting-edge resolver models are designed to enhance clinical entity mapping and coding accuracy. These models leverage advanced natural language processing to seamlessly map medical entities and concepts to standardized codes, facilitating streamlined data analysis and healthcare decision-making.
Model Name | Description |
---|---|
medembed_base_rxnorm_augmented |
This model maps clinical entities and concepts (like drugs/ingredients) to RxNorm codes |
medmebed_large_rxnorm_augmented |
This model maps clinical entities and concepts (like drugs/ingredients) to RxNorm codes |
biolordresolve_loinc_augmented |
This model maps medical entities to Logical Observation Identifiers Names and Codes(LOINC) codes using mpnet_embeddings_biolord_2023_c embeddings. |
Example:
rxnorm_resolver = SentenceEntityResolverModel.pretrained("medmebed_large_rxnorm_augmented", "en", "clinical/models")\
.setInputCols(["embeddings"])\
.setOutputCol("rxnorm_code")\
.setDistanceFunction("EUCLIDEAN")
text = "The patient was prescribed aspirin and and Albuterol inhaler, two puffs every 4 hours as needed for asthma. He was seen by the endocrinology service and she was discharged on Coumadin 5 mg with meals , and metformin 1000 mg two times a day and Lisinopril 10 mg daily"
Result:
ner_chunk | entity | rxnorm_code | all_k_resolutions |
---|---|---|---|
aspirin | DRUG | 1191 | aspirin[aspirin]:::Empirin[Empirin]:::aluminum aspirin[aluminum aspirin]:::… |
Albuterol inhaler | DRUG | 1649559 | albuterol Dry Powder Inhaler[albuterol Dry Powder Inhaler]:::albuterol[albuterol]:::… |
Coumadin 5 mg | DRUG | 855333 | warfarin sodium 5 MG [Coumadin]:::warfarin sodium 7.5 MG [Coumadin]:::… |
metformin 1000 mg | DRUG | 316255 | metformin 1000 MG[metformin 1000 MG]:::metformin hydrochloride 1000 MG[metformin…] |
Lisinopril 10 mg | DRUG | 316151 | lisinopril 10 MG[lisinopril 10 MG]:::lisinopril 10 MG Oral Tablet:::… |
Databricks
Support for MedicalLLM
and LLMLoader
to Load/Run Finetuned Medical LLMs
Enables seamless support for MedicalLLM and LLMLoader on Databricks, facilitating the deployment and management of medical-focused large language models. This integration streamlines data preprocessing, model training, and inference workflows, providing an efficient platform for developing advanced medical AI solutions. PS: Some models may not be supported yet.
Optimizing Spark Driver Memory Allocation to Utilize all the Available Resources by Default.
The Spark driver is optimized to utilize 100% of available system memory for spark.driver.memory
by default. If this allocation isn’t viable due to system constraints, Spark automatically reverts to a default memory allocation of 32GB, ensuring stability and adaptability in resource-limited environments.
New Blog Posts On Various Topics
Explore the latest developments in healthcare NLP through our new blog posts, where we take a deep dive into the innovative technologies and methodologies transforming the medical field. These posts offer insights into the transformative impact of NLP technologies in healthcare, showcasing how they streamline processes, improve patient outcomes, and pave the way for future innovations in medical research and practice.
- From Data Overload to Precision: How Medical Language Models Enhance Clinical Trials: This blog post explores how John Snow Labs’ Healthcare NLP & LLM library is transforming clinical trials by using advanced NER models to efficiently filter through large datasets of patient records. By automatically extracting cancer-related information from unstructured clinical notes, the solution enables researchers to quickly identify patients with specific cancer indications, accelerating trial enrollment and ensuring more accurate patient selection.
- The Power of Small LLMs in Healthcare: A RAG Framework Alternative to Large Language Models This blog post explores the potential of smaller, fine-tuned language models like JSL’s Retrieval-Augmented Generation (RAG) models (e.g., jsl_med_rag_v1) to deliver performance comparable to larger LLMs in specialized clinical tasks such as question answering and medical summarization. This blog post provides an in-depth look at these efficient models, demonstrating their high relevance and effectiveness in a RAG framework while challenging the dominance of larger general-purpose models like GPT-4.
- From Data Overload to Precision: How Medical Language Models Enhance Clinical Trials: This blog post explores how John Snow Labs’ Healthcare NLP & LLM library is transforming clinical trials by using advanced NER models to efficiently filter through large datasets of patient records. By automatically extracting cancer-related information from unstructured clinical notes, the solution enables researchers to quickly identify patients with specific cancer indications, accelerating trial enrollment and ensuring more accurate patient selection.
Various Core Improvements: Bug Fixes, Enhanced Overall Robustness, and Reliability of Spark NLP for Healthcare
- The
AnnotationLab
supports thessl_verification
parameter, allowing users to disable SSL certificate verification - Fixed entity filtering issue in
PipelineTracer
and it supportsPretrainedZeroShotNER
andMedicalBertForTokenClassification
- Fixed hanging issue when setting
setSelectMostDifferent(True)
in theChunkKeyPhraseExtraction
annotator by updating the algorithm at the behind Deidentification
changed the faker generation algorithm and it is more sensitive for the “names”Deidentification
supports new faker labels such as “location_other” and “company”- Added new
keepMonth
parameter toDeidentification
- Added new end-to-end
RelationalDBDeidentification
into utils
Updated Notebooks And Demonstrations For making Spark NLP For Healthcare Easier To Navigate And Understand
- New Deidentification_Custom_Pretrained_Pipelines Notebook
- New Contextual_Entity_Ruler Notebook
- Updated Contextual_Parser_Rule_Based_NER Notebook
- Updated ZeroShot_Clinical_NER Notebook
- Updated Rule_Based_Entity_Matchers Notebook
- Updated PipelineTracer_and_PipelineOutputParser Notebook
- New NER_GENE_PHENOTYPES Demo
- New NER_VACCANIES Demo
- Updated HCC_RISK_SCORE Demo
We have added and updated a substantial number of new clinical models and pipelines, further solidifying our offering in the healthcare domain.
zeroshot_ner_generic_large
zeroshot_ner_generic_medium
zeroshot_ner_clinical_large
zeroshot_ner_clinical_medium
zeroshot_ner_oncology_large
zeroshot_ner_oncology_medium
zeroshot_ner_deid_generic_docwise_large
zeroshot_ner_deid_generic_docwise_medium
zeroshot_ner_deid_subentity_docwise_large
zeroshot_ner_deid_subentity_docwise_medium
zeroshot_ner_vop_medium
zeroshot_ner_vop_large
sbiobertresolve_loinc_numeric_augmented
loinc_numeric_resolver_pipeline
loinc_resolver_pipeline
biolordresolve_loinc_augmented
account_parser
age_parser
country_matcher
date_matcher
dln_parser
license_parser
medical_record_parser
phone_parser
plate_parser
ssn_parser
vin_parser
clinical_deidentification_docwise_wip
clinical_deidentification_docwise_wip_v2
clinical_deidentification_v2_wip
ner_deid_generic_docwise
->de
zeroshot_ner_deid_generic_docwise_large
->de
clinical_deidentification_docwise_large_wip
->de
clinical_deidentification_docwise_wip
->de
ner_deid_subentity_langtest
->de
drug_action_treatment_mapper
medembed_base_rxnorm_augmented
medmebed_large_rxnorm_augmented
snomed_term_resolver_pipeline
For all Spark NLP for Healthcare models, please check: Models Hub Page
Versions
- 5.5.1
- 5.5.0
- 5.4.1
- 5.4.0
- 5.3.3
- 5.3.2
- 5.3.1
- 5.3.0
- 5.2.1
- 5.2.0
- 5.1.4
- 5.1.3
- 5.1.2
- 5.1.1
- 5.1.0
- 5.0.2
- 5.0.1
- 5.0.0
- 4.4.4
- 4.4.3
- 4.4.2
- 4.4.1
- 4.4.0
- 4.3.2
- 4.3.1
- 4.3.0
- 4.2.8
- 4.2.4
- 4.2.3
- 4.2.2
- 4.2.1
- 4.2.0
- 4.1.0
- 4.0.2
- 4.0.0
- 3.5.3
- 3.5.2
- 3.5.1
- 3.5.0
- 3.4.2
- 3.4.1
- 3.4.0
- 3.3.4
- 3.3.2
- 3.3.1
- 3.3.0
- 3.2.3
- 3.2.2
- 3.2.1
- 3.2.0
- 3.1.3
- 3.1.2
- 3.1.1
- 3.1.0
- 3.0.3
- 3.0.2
- 3.0.1
- 3.0.0
- 2.7.6
- 2.7.5
- 2.7.4
- 2.7.3
- 2.7.2
- 2.7.1
- 2.7.0
- 2.6.2
- 2.6.0
- 2.5.5
- 2.5.3
- 2.5.2
- 2.5.0
- 2.4.6
- 2.4.5
- 2.4.2
- 2.4.1
- 2.4.0