5.3.0
Highlights
We are delighted to announce remarkable enhancements and updates in our latest release of Spark NLP for Healthcare. This release comes with the new 3 rule-based entity recognition/matcher modules to extract named entities with rules and a Flattener module to flatten the pipeline outputs effortlessly, as well as 41
new clinical pretrained models and pipelines including new De-Identification pipelines at various sizes.
- 4 new
Opioid
NER models to extract opioid-related entities from 22 classes and 3 assertion models to detect the status of opioid drug usage and underlying symptoms. Multi-Lingual
NER model forDeidentification
to detect sensitive entities (name
,date
,location
etc.) from multiple languages.- New
Age
classification model to detect age groups from clinical texts without any mention of age. Biomarker
text classification model to detect sentences/ phrases that may contain biomarker-related terms.- New NER model for
SNOMED
term extraction regardless of its type. - New 6
ChunkMapper
models for medical code mapping to map various medical terminologies across each other. - Curated pretrained pipelines to analyze clinical documents for specific clinical tasks and concepts at once.
- Enhanced data exploration with the new
Flattener
annotator to prettify the pipeline outputs in a tabulated format. - Rule-based entity recognition/matcher modules (
TextMatcher
,RegexMatcher
andEntityRuler
) to extract named entities with rules and dictionaries - Deidentification now supports masking and obfuscation at the same time without an additional stage
ChunkMerger
now supports dictionary format for the selective mergingMedicalQuestionAnswering
returns Score in metadata- New speed benchmarks for various pipelines across different platforms under various settings (EMR, Databricks, etc.)
- Various core improvements; bug fixes, enhanced overall robustness and reliability of Spark NLP for Healthcare
- Consistent obfuscation is supported in
StructuredDeidentification
too - Added
deid_source
field to the metadata to infer the source of entity chunks coming from internal or external NER models and stages - Refactoring the Deidentification module for improved functionality
- Flushing the temporary files dumped by the
SentenceEntityResolver
- Fixed
IOBTagger
was returning zero instead of ‘O’ - Lighter jars for Spark NLP leading optimized Spark’s sessions
Resolution2Chunk
documentation updated- Updated the default value of the
customBoundsStrategy
parameter in theInternalDocumentSplitter
- Enhanced
InternalDocumentSplitter
with UUID Metadata Field
- Consistent obfuscation is supported in
- Updated notebooks and demonstrations for making Spark NLP for Healthcare easier to navigate and understand
- The addition and update of numerous new clinical models and pipelines continue to reinforce our offering in the healthcare domain
These enhancements will elevate your experience with Spark NLP for Healthcare, enabling more efficient, accurate, and streamlined analysis of healthcare-related natural language data.
4 New Opioid
NER Model to Extract Opioid-Related Entities from 22 Classes and 3 Assertion Models to Detect the Status of Opioid Drug Usage and Underlying Symptoms.
- NER Model
Model Name | Predicted Entities | Description |
---|---|---|
ner_opioid | communicable_disease , general_symptoms , substance_use_disorder , drug_duration , psychiatric_issue , drug_strength , drug_quantity , other_drug , drug_form , drug_frequency , opioid_drug , drug_route , employment , violence , legal_issue , other_disease , alcohol_use , test , marital_status , test_result , antidote , sexual_orientation |
Detects opioid-related entities within text data |
This ner_opioid
model is designed to detect and label opioid-related entities within text data. Opioids are a class of drugs that include the illegal drug heroin, synthetic opioids such as fentanyl, and pain relievers available legally by prescription. The model has been trained using advanced deep-learning techniques on a diverse range of text sources and can accurately recognize and classify a wide range of opioid-related entities. The model’s accuracy and precision have been carefully validated against expert-labeled data to ensure reliable and consistent results.
Example:
ner_model = MedicalNerModel.pretrained("ner_opioid", "en", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
sample_texts = ["""The patient, unmarried and with a significant history of substance abuse involving the illicit consumption of various opioids such as heroin, fentanyl, and oxycodone, presented with a headache and was diagnosed PTSD. Despite denying the use of alcohol, smoking, or marijuana, the patient, who has been unemployed for several months, required administration of Narcan for suspected opioid overdose. A recent toxicology screen confirmed the presence of opioids, and showed negative results for benzodiazepines, cocaine, amphetamines, barbiturates, and tricyclic substances."""]
Result:
chunk | begin | end | ner_label |
---|---|---|---|
unmarried | 13 | 21 | marital_status |
substance abuse | 57 | 71 | substance_use_disorder |
opioids | 118 | 124 | opioid_drug |
heroin | 134 | 139 | opioid_drug |
fentanyl | 142 | 149 | opioid_drug |
oxycodone | 156 | 164 | opioid_drug |
headache | 184 | 191 | general_symptoms |
PTSD | 211 | 214 | psychiatric_issue |
alcohol | 244 | 250 | alcohol_use |
marijuana | 265 | 273 | other_drug |
unemployed | 302 | 311 | employment |
Narcan | 360 | 365 | antidote |
opioid | 381 | 386 | opioid_drug |
overdose | 388 | 395 | other_disease |
toxicology screen | 407 | 423 | test |
opioids | 451 | 457 | test |
negative | 471 | 478 | test_result |
benzodiazepines | 492 | 506 | test |
cocaine | 509 | 515 | test |
amphetamines | 518 | 529 | test |
barbiturates | 532 | 543 | test |
tricyclic substances | 550 | 569 | test |
- Assertion Models
Model Name | Assertion Status | Description |
---|---|---|
assertion_opioid_wip | present , history , absent , hypothetical , past , family_or_someoneelse |
Detects the assertion status of entities related to opioid |
assertion_opioid_drug_status_wip | opioid_medical_use , opioid_abuse , opioid_overdose , drug_medical_use , drug_abuse , drug_overdose |
Detects the assertion status of drug entities related to opioid (including opioid_drug and other_drug) |
assertion_opioid_general_symptoms_status_wip | underlying_pain , withdrawal_symptom , overdose_symptom |
Detects the assertion status of general symptoms entity related to opioid. |
Example:
assertion = AssertionDLModel.pretrained("assertion_opioid_wip" "en", "clinical/models") \
.setInputCols(["sentence", "ner_chunk", "embeddings"]) \
.setOutputCol("assertion")
sample_texts = [
"""The patient with a history of substance abuse presented with clinical signs indicative of opioid overdose, including constricted pupils, cyanotic lips, drowsiness, and confusion. Immediate assessment and intervention were initiated to address the patient's symptoms and stabilize their condition. Close monitoring for potential complications, such as respiratory depression, was maintained throughout the course of treatment.""",
"""The patient presented to the rehabilitation facility with a documented history of opioid abuse, primarily stemming from misuse of prescription percocet pills intended for their partner's use. Initial assessment revealed withdrawal symptoms consistent with opioid dependency."""]
Result:
chunk | begin | end | ner_label | assertion | confidence |
---|---|---|---|---|---|
substance abuse | 30 | 44 | substance_use_disorder | history | 0.9644 |
opioid | 90 | 95 | opioid_drug | hypothetical | 0.7974 |
overdose | 97 | 104 | other_disease | hypothetical | 0.9961 |
constricted pupils | 117 | 134 | general_symptoms | past | 0.732 |
cyanotic lips | 137 | 149 | general_symptoms | past | 0.8501 |
drowsiness | 152 | 161 | general_symptoms | past | 0.9469 |
confusion | 168 | 176 | general_symptoms | past | 0.9686 |
respiratory depression | 351 | 372 | other_disease | hypothetical | 0.5921 |
opioid | 82 | 87 | opioid_drug | history | 0.735 |
percocet | 143 | 150 | opioid_drug | present | 0.905 |
pills | 152 | 156 | drug_form | present | 0.9363 |
withdrawal | 220 | 229 | general_symptoms | present | 0.9929 |
opioid | 256 | 261 | opioid_drug | present | 0.9348 |
Please check Opioid Notebook for more information
Multi-Lingual
NER Model for Deidentification
to Detect Sensitive Entities (name
, date
, location
etc.) from Multiple Languages.
Introducing our latest innovation: a Multilingual Named Entity Recognition (NER) model designed for deidentification purposes. This new model is capable of annotating text in English, German, French, Italian, Spanish, Portuguese, and Romanian. It excels at detecting sensitive entities such as AGE, CONTACT, DATE, ID, LOCATION, NAME, and PROFESSION. Using this model, data protection can be achieved in multiple languages and domains.
Example:
embeddings = XlmRoBertaEmbeddings.pretrained("xlm_roberta_base", "xx") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")\
.setMaxSentenceLength(512)
ner = MedicalNerModel.pretrained("ner_deid_multilingual", "xx", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
text_list = [
"""Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 .""",
"""J'ai vu en consultation Michel Martinez (49 ans) adressé au Centre Hospitalier De Plaisir pour un diabète mal contrôlé avec des symptômes datant de Mars 2015.""",
"""Michael Berger wird ins St. Elisabeth-Krankenhaus in Bad Kissingen eingeliefert. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.""",
"""Ho visto Gastone Montanariello (49 anni) riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015.""",
"""Antonio Miguel Martínez, un varón de 35 años de edad, de profesión auxiliar de enfermería y nacido en Cadiz, España.""",
"""Detalhes do paciente. Nome do paciente: Pedro Gonçalves. Endereço: Rua Das Flores 23. Cidade/ Província: Porto.""",
"""Spitalul Pentru Ochi de Deal, Drumul Oprea Nr. 972 Vaslui, 737405 România"""
]
Result:
doc_id | begin | end | chunk | ner_label |
---|---|---|---|---|
1 | 14 | 23 | 2093-01-13 | DATE |
1 | 26 | 35 | David Hale | NAME |
1 | 51 | 61 | Hendrickson | NAME |
1 | 74 | 80 | 7194334 | ID |
1 | 89 | 96 | 01/13/93 | DATE |
2 | 24 | 38 | Michel Martinez | NAME |
2 | 41 | 46 | 49 ans | AGE |
2 | 60 | 88 | Centre Hospitalier De Plaisir | LOCATION |
2 | 148 | 156 | Mars 2015 | DATE |
3 | 0 | 13 | Michael Berger | NAME |
3 | 53 | 65 | Bad Kissingen | LOCATION |
3 | 86 | 91 | Berger | NAME |
3 | 128 | 129 | 76 | AGE |
4 | 9 | 29 | Gastone Montanariello | NAME |
4 | 32 | 33 | 49 | AGE |
4 | 55 | 74 | Ospedale San Camillo | LOCATION |
4 | 128 | 137 | marzo 2015 | DATE |
5 | 0 | 22 | Antonio Miguel Martínez | NAME |
5 | 37 | 38 | 35 | AGE |
5 | 67 | 88 | auxiliar de enfermería | PROFESSION |
5 | 102 | 106 | Cadiz | LOCATION |
5 | 109 | 114 | España | LOCATION |
6 | 41 | 55 | Pedro Gonçalves | NAME |
6 | 68 | 71 | Rua Das Flores | NAME |
6 | 106 | 110 | Porto | LOCATION |
7 | 0 | 27 | Spitalul Pentru Ochi de Deal | LOCATION |
7 | 30 | 44 | Drumul Oprea Nr | LOCATION |
7 | 47 | 49 | 972 | LOCATION |
7 | 51 | 56 | Vaslui | LOCATION |
7 | 59 | 64 | 737405 | LOCATION |
7 | 66 | 72 | România | LOCATION |
Please see the model card ner_deid_multilingual for more information about the model
New Age
Classification Model to Detect Age Groups from Clinical Texts without any Mention of Age
Introducing a new age classification model is a sophisticated text classification tool tailored to identify and categorize text according to different age groups. This model distinguishes among Old Adult
, Adult
, Child and Teen
, and Other/Unknown
contexts, providing valuable insights into textual data that references age-specific scenarios or concerns.
Example:
generic_classifier = GenericClassifierModel.pretrained("genericclassifier_age_e5", "en", "clinical/models")\
.setInputCols(["features"])\
.setOutputCol("prediction")
sample_texts = [
"""The patient presents with conditions often associated with the stresses and lifestyle of early career and possibly higher education stages, including sleep irregularities and repetitive stress injuries. There's a notable emphasis on preventative care, with discussions around lifestyle choices that can impact long-term health, such as smoking cessation, regular exercise, and balanced nutrition. The patient is also counseled on mental health, particularly in managing stress and anxiety that may arise from personal and professional responsibilities and ambitions at this stage of life.""",
"""The senior patient presents with age-related issues such as reduced hearing and vision, arthritis, and memory lapses. Emphasis is on managing chronic conditions, maintaining social engagement, and adapting lifestyle to changing physical abilities. Discussions include medication management, dietary adjustments to suit older digestion, and the importance of regular, low-impact exercise.""",
"""The late teenage patient is dealing with final growth spurts, the stress of impending adulthood, and decisions about higher education or career paths. Health discussions include maintaining a balanced diet, the importance of regular sleep patterns, and managing academic and social pressures. Mental health support is considered crucial at this stage, with a focus on building resilience and coping mechanisms.""",
"""The patient, faces adjustments to a new lifestyle with changes in daily routines and social interactions. Health concerns include managing the transition from an active work life to more leisure time, which may impact physical and mental health. Preventative health measures are emphasized, along with the importance of staying mentally and physically active and engaged in the community."""
]
Result:
text | result |
---|---|
The patient presents with conditions often associated with the stresses and lifestyle of early ca… | Adult |
The senior patient presents with age-related issues such as reduced hearing and vision, arthritis… | Old Adult |
The late teenage patient is dealing with final growth spurts, the stress of impending adulthood, … | Child and Teen |
The patient, faces adjustments to a new lifestyle with changes in daily routines and social inter… | Other/Unknown |
Please check: genericclassifier_age_e5
Biomarker
Text Classification Model to Detect Sentences/Phrases that may Contain Biomarker-related Terms
We are thrilled to introduce our latest advancement: a cutting-edge text classification model specifically tailored for biomarkers. This state-of-the-art model is designed to analyze clinical sentences and accurately determine whether they contain terms associated with biomarkers.
1
: Contains biomarker related terms.0
: Doesn’t contain biomarker related terms.
Example:
sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_biomarker", "en", "clinical/models")\
.setInputCols(["sentence",'token'])\
.setOutputCol("prediction")
sample_texts = [
"""In the realm of cancer research, several biomarkers have emerged as crucial indicators of disease progression and treatment response. For instance, the expression levels of HER2/neu, a protein receptor, have been linked to aggressive forms of breast cancer. Additionally, the presence of prostate-specific antigen (PSA) is often monitored to track the progression of prostate cancer. Moreover, in cardiovascular health, high-sensitivity C-reactive protein (hs-CRP) serves as a biomarker for inflammation and potential risk of heart disease. Meanwhile, elevated levels of troponin T are indicative of myocardial damage, commonly observed in acute coronary syndrome. In the field of diabetes management, glycated hemoglobin is a widely used to assess long-term blood sugar control. Its levels reflect the average blood glucose concentration over the past two to three months, offering valuable insights into disease management strategies."""
]
Result:
sentence | prediction |
---|---|
In the realm of cancer research, several biomarkers have emerged as crucial indicators of disease progression and treatment response. | 0 |
For instance, the expression levels of HER2/neu, a protein receptor, have been linked to aggressive forms of breast cancer. | 1 |
Additionally, the presence of prostate-specific antigen (PSA) is often monitored to track the progression of prostate cancer. | 1 |
Moreover, in cardiovascular health, high-sensitivity C-reactive protein (hs-CRP) serves as a biomarker for inflammation and potential risk of heart disease. | 1 |
Meanwhile, elevated levels of troponin T are indicative of myocardial damage, commonly observed in acute coronary syndrome. | 0 |
In the field of diabetes management, glycated hemoglobin is a widely used to assess long-term blood sugar control. | 0 |
Its levels reflect the average blood glucose concentration over the past two to three months, offering valuable insights into disease management strategies. | 0 |
Please check: bert_sequence_classifier_biomarker
New NER Model for SNOMED Term Extraction Regardless of its Type
We are excited to introduce our latest Name Entity Recognition (NER) model, designed specifically to extract SNOMED terms from clinical text. This cutting-edge model offers enhanced accuracy and efficiency in identifying and categorizing SNOMED concepts within medical documents, aiding in comprehensive data analysis and clinical decision-making processes. With its advanced capabilities, this NER model promises to revolutionize the way healthcare professionals extract valuable insights from clinical narratives.
Example:
embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols("sentence", "token")\
.setOutputCol("embeddings")
ner = MedicalNerModel.pretrained("ner_snomed_term", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
text_list = ["""The patient was diagnosed with acute appendicitis and scheduled for immediate surgery.""",
"""Due to experiencing chronic pain, the patient was referred to a fibromyalgia specialist for further evaluation.""",
"""His hypertension is currently managed with a combination of lifestyle modifications and medication.""",
"""The child was brought in with symptoms of acute otitis, including ear pain and fever.""",
"""Laboratory tests indicate the individual has hyperthyroidism, requiring further endocrinological assessment.""",
"""The radiograph showed evidence of a distal radius fracture from a recent fall."""]
Result:
doc_id | begin | end | chunk | ner_label |
---|---|---|---|---|
1 | 31 | 48 | acute appendicitis | snomed_term |
2 | 20 | 31 | chronic pain | snomed_term |
2 | 63 | 74 | fibromyalgia | snomed_term |
3 | 4 | 15 | hypertension | snomed_term |
4 | 48 | 53 | otitis | snomed_term |
4 | 65 | 72 | ear pain | snomed_term |
5 | 45 | 59 | hyperthyroidism | snomed_term |
6 | 4 | 13 | radiograph | snomed_term |
6 | 43 | 57 | radius fracture | snomed_term |
Please check: ner_snomed_term
New 6 ChunkMapper Models for Medical Code Mapping to Map Various Medical Terminologies Across Each Other
Introducing a suite of new ChunkMapper models designed to streamline medical code mapping tasks. These models include mappings between UMLS, LOINC, CPT, and SNOMED codes, offering a comprehensive solution for interoperability within medical systems.
Model Name | Description |
---|---|
umls_loinc_mapper |
Maps UMLS codes to corresponding LOINC codes. |
umls_cpt_mapper |
Maps UMLS codes to corresponding CPT codes. |
umls_snomed_mapper |
Maps UMLS codes to corresponding SNOMED codes. |
snomed_umls_mapper |
Maps SNOMED codes to corresponding UMLS codes. |
cpt_umls_mapper |
Maps CPT codes to corresponding UMLS codes. |
loinc_umls_mapper |
Maps LOINC codes to corresponding UMLS codes. |
Example:
chunkerMapper = ChunkMapperModel.pretrained("umls_loinc_mapper", "en", "clinical/models")\
.setInputCols(["umls_code"])\
.setOutputCol("mappings")\
.setRels(["loinc_code"])
text = "acebutolol"
Result:
chunk | UMLS | LOINC | relation |
---|---|---|---|
acebutolol | C0000946 | LP16015-7 | loinc_code |
Curated Pretrained Pipelines to Analyse Clinical Documents for Specific Clinical Tasks and Concepts at Once
We introduce a suite of advanced, hybrid pretrained pipelines, specifically designed to streamline the process of analyzing clinical documents. These pipelines are built upon multiple state-of-the-art (SOTA) pretrained models, delivering a comprehensive solution for extracting vital information with unprecedented ease.
What sets this release apart is the elimination of complexities typically involved in building and chaining models. Users no longer need to navigate the intricacies of constructing intricate pipelines from scratch or the uncertainty of selecting the most effective model combinations. Our new pretrained pipelines simplify these processes, offering a seamless, user-friendly experience.
Pipeline Name | Description |
---|---|
hcpcs_resolver_pipeline |
This pipeline extracts PROCEDURE entities and maps them to their corresponding Healthcare Common Procedure Coding System (HCPCS) codes. |
hgnc_resolver_pipeline |
This pipeline extracts GENE entities and maps them to their corresponding HUGO Gene Nomenclature Committee (HGNC) codes. |
icd10cm_generalised_resolver_pipeline |
This pipeline extracts the following entities and maps them to their ICD-10-CM codes. It predicts ICD-10-CM codes up to 3 characters (according to ICD-10-CM code structure the first three characters represent the general type of injury or disease). |
loinc_numeric_resolver_pipeline |
This pipeline extracts TEST entities and maps them to their corresponding Logical Observation Identifiers Names and Codes(LOINC) codes. It is trained with the numeric LOINC codes, without the inclusion of LOINC “Document Ontology” codes starting with the letter “L”. It also provides the official resolution of the codes within the brackets. |
snomed_procedures_measurements_resolver_pipeline |
This pipeline extracts Procedure and measurement (Test ) entities and maps them to their corresponding SNOMED codes. |
ncit_resolver_pipeline |
This advanced pipeline extracts oncological entities from clinical texts to map these entities to their corresponding National Cancer Institute Thesaurus (NCIt) codes. |
rxcui_resolver_pipeline |
This advanced pipeline extracts medication entities from clinical texts to map these entities to their corresponding RxNorm Concept Unique Identifier (RxCUI) codes. |
icd10pcs_resolver_pipeline |
This pipeline extracts Procedure entities from clinical texts and map them to their corresponding ICD-10-PCS codes. |
icdo_resolver_pipeline |
This pipeline extracts oncological entities from clinical texts and maps them to their corresponding ICD-O codes. |
loinc_resolver_pipeline |
This pipeline extracts Test entities from clinical texts and maps them to their corresponding Logical Observation Identifiers Names and Codes (LOINC) codes. |
clinical_deidentification_obfuscation_medium |
This pipeline can be used to detect the PHI information from medical texts and obfuscate (replace them with fake ones) in the resulting text. |
clinical_deidentification_obfuscation_small |
This pipeline can be used to detect the PHI information from medical texts and obfuscate (replace them with fake ones) in the resulting text. |
Enhanced Data Exploration with the New Flattener Annotator to Prettify the Pipeline Outputs in a Tabulated Format
Introducing the latest addition to our annotation toolkit: the Flattener Annotator. This powerful tool facilitates data exploration by returning exploded columns for each specified field containing annotation data. With customizable settings, users can select fields of interest and effortlessly explode their content for deeper analysis.
Example:
ner_converter = NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk") \
.setWhiteList(["SYMPTOM","VS_FINDING","DISEASE_SYNDROME_DISORDER","ADMISSION_DISCHARGE","PROCEDURE"])
clinical_assertion = AssertionDLModel.pretrained("assertion_jsl_augmented", "en", "clinical/models") \
.setInputCols(["sentence", "ner_chunk", "embeddings"]) \
.setOutputCol("assertion") \
# returns exploded columns for each specified field containing annotation data.
flattener = Flattener()\
.setInputCols("ner_chunk", "assertion") \
.setExplodeSelectedFields({"ner_chunk": ["result as ner_chunks",
"begin as begins",
"end as ends",
"metadata.entity as entity"],
"assertion":["result as assertions",
"metadata.confidence as confidence"]
})
text = """
GENERAL: He is an elderly gentleman in no acute distress. He is sitting up in bed eating his breakfast. He is alert and oriented and answering questions appropriately.
HEENT: Sclerae showed mild arcus senilis in the right. Left was clear. Pupils are equally round and reactive to light. Extraocular movements are intact. Oropharynx is clear.
NECK: Supple. Trachea is midline. No jugular venous pressure distention is noted. No adenopathy in the cervical, supraclavicular, or axillary areas.
ABDOMEN: Soft and not tender. There may be some fullness in the left upper quadrant, although I do not appreciate a true spleen with inspiration.
EXTREMITIES: There is some edema, but no cyanosis and clubbing .
"""
Result:
ner_chunks | begins | ends | entity | assertions | confidence |
---|---|---|---|---|---|
distress | 49 | 56 | SYMPTOM | Absent | 0.9999 |
arcus senilis | 196 | 208 | DISEASE_SYNDROME_DISORDER | Past | 1.0 |
jugular venous pressure distention | 380 | 413 | SYMPTOM | Absent | 1.0 |
adenopathy | 428 | 437 | SYMPTOM | Absent | 1.0 |
tender | 514 | 519 | SYMPTOM | Absent | 1.0 |
fullness | 540 | 547 | SYMPTOM | Possible | 1.0 |
edema | 665 | 669 | SYMPTOM | Present | 1.0 |
cyanosis | 679 | 686 | VS_FINDING | Absent | 1.0 |
clubbing | 692 | 699 | SYMPTOM | Absent | 1.0 |
Please check Flattener Notebook for more information.
Rule-based Entity Recognition/matcher Modules (TextMatcher
, RegexMatcher
and EntityRuler
) to Extract Mamed Entities with Rules and Dictionaries
- Efficient Regex Matching with the new
RegexMatcherInternal
annotator
The RegexMatcherInternal
class implements an internal annotator approach to match a set of regular expressions with a provided entity. This approach is utilized for associating specific patterns within text data with predetermined entities, such as dates, mentioned within the text.
The class allows users to define rules using regular expressions paired with entities, offering flexibility in customization. These rules can either be directly set using the setRules
method, with a specified delimiter, or loaded from an external file using the setExternalRules
method.
Example:
rules = '''
(\d{1,3}\.){3}\d{1,3}~IPADDR
\d{4}-\d{2}-\d{2}|\d{2}/\d{2}/\d{2}|\d{2}/\d{2}/\d{2}~DATE
'''
with open('./rules/regex_rules.txt', 'w') as f:
f.write(rules)
regex_matcher_internal = RegexMatcherInternal()\
.setInputCols('document')\
.setStrategy("MATCH_ALL")\
.setOutputCol("regex_matches")\
.setExternalRules(path='./rules/regex_rules.txt',
delimiter='~')
text = """Name : Hendrickson, Ora, Record date: 2093-01-13, MR #719435.
Dr. John Green, ID: 1231511863, IP 203.120.223.13
He is a 60-year-old male was admitted to the Day Hospital for cystectomy on 01/13/93
Patient's VIN : 1HGBH41JXMN109286, SSN #333-44-6666, Driver's license no: A334455B.
Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: smith@gmail.com."""
Result:
regex_result | begin | end | ner_label |
---|---|---|---|
2093-01-13 | 38 | 47 | DATE |
203.120.223.13 | 97 | 110 | IPADDR |
01/13/93 | 188 | 195 | DATE |
Please check Rule Based Entity Matchers Notebook for more information.
- Efficient Text Matching with the new
TextMatcherInternal
annotator
The TextMatcherInternal
annotator provides a robust solution for matching exact phrases against a given document. Users can specify phrases of interest in a source file, where each phrase is paired with its corresponding label or entity, separated by a delimiter
. The annotator allows for fine-tuned control over the matching process through various parameters. Users can choose to enable case sensitivity, merge overlapping matched chunks, and customize entity metadata fields. Additionally, options like setting the delimiter and specifying whether the matcher should operate on chunks or tokens offer further flexibility.
Text Matcher Pretrained Models:
model | entities |
---|---|
drug_matcher | DRUG |
biomarker_matcher | Biomarker |
Example:
matcher_drug = """
Aspirin 100mg#Drug
aspirin#Drug
paracetamol#Drug
amoxicillin#Drug
ibuprofen#Drug
lansoprazole#Drug
"""
with open ('matcher_drug.csv', 'w') as f:
f.write(matcher_drug)
entityExtractor = TextMatcherInternal()\
.setInputCols(["document", "token"])\
.setEntities("matcher_drug.csv")\
.setOutputCol("matched_text")\
.setCaseSensitive(False)\
.setDelimiter("#")\
.setMergeOverlapping(False)
text = """John's doctor prescribed aspirin 100mg for his heart condition, along with paracetamol for his fever, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD."""
Result:
chunk | begin | end | label |
---|---|---|---|
aspirin | 25 | 31 | Drug |
aspirin 100mg | 25 | 37 | Drug |
paracetamol | 75 | 85 | Drug |
amoxicillin | 102 | 112 | Drug |
ibuprofen | 134 | 142 | Drug |
lansoprazole | 170 | 181 | Drug |
Please check Rule Based Entity Matchers Notebook for more information.
- Efficient Text Matching with the new
EntityRulerInternal
annotator
The EntityRulerInternal
is a versatile annotator designed to match exact strings or regex patterns against a given document, assigning them named entities as specified. This powerful tool allows users to define custom rules in a file, accommodating any number of named entities. By leveraging this annotator, users can efficiently identify and classify specific text patterns within documents, enhancing the accuracy and efficiency of named entity recognition tasks.
Example:
data = [
{
"id": "drug-words",
"label": "Drug",
"patterns": ["paracetamol", "aspirin", "ibuprofen", "lansoprazol"]
},
{
"id": "disease-words",
"label": "Disease",
"patterns": ["heart condition","tonsilitis","GORD"]
},
{
"id": "symptom-words",
"label": "Symptom",
"patterns": ["fever","headache"]
},
]
with open("entities.json", "w") as f:
json.dump(data, f)
entityRuler = EntityRulerInternalApproach()\
.setInputCols(["document", "token"])\
.setOutputCol("entities")\
.setPatternsResource("entities.json")\
.setCaseSensitive(False)\
text = """John's doctor prescribed aspirin for his heart condition, along with paracetamol for his fever and headache, amoxicillin for his tonsilitis, ibuprofen for his inflammation, and lansoprazole for his GORD on 2023-12-01."""
Result:
chunk | begin | end | label |
---|---|---|---|
aspirin | 25 | 31 | Drug |
heart condition | 41 | 55 | Disease |
paracetamol | 69 | 79 | Drug |
fever | 89 | 93 | Symptom |
headache | 99 | 106 | Symptom |
tonsilitis | 129 | 138 | Disease |
ibuprofen | 141 | 149 | Drug |
lansoprazol | 177 | 187 | Drug |
GORD | 198 | 201 | Disease |
Please check Rule Based Entity Matchers Notebook for more information.
Deidentification now Supports Masking and Obfuscation at the Same Time without an Additional Stage
Explore how metadata masking is implemented in deidentification processes through the setMetadataMaskingPolicy function. This feature enables users to add a mask option to DEID within the metadata, providing enhanced data privacy measures. Options such as entity_labels
, same_length_chars
, and fixed_length_chars
offer flexibility in choosing the desired masking policy.
Example:
deidentification = DeIdentification()\
.setInputCols(["sentence", "token", "ner_chunk"]) \
.setOutputCol("deid") \
.setMode("obfuscate") \
.setObfuscateDate(True) \
.setObfuscateRefSource('faker') \
.setMetadataMaskingPolicy("entity_labels") # Options : 'entity_labels', 'same_length_chars', 'fixed_length_chars'
text = """Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 ."""
Result:
sentence | deidentified | masked | |
---|---|---|---|
0 | Record date : 2093-01-13 , David Hale , M.D . | Record date : 2093-02-12 , Docia Chuck , M.D . | Record date : <DATE> , <DOCTOR> , M.D . |
1 | , Name : Hendrickson , Ora MR # 7194334 Date : 01/13/93 . | , Name : Marisue Humble MR # 7185162 Date : 02/12/93 . | , Name : <PATIENT> MR # <MEDICALRECORD> Date : <DATE> . |
2 | Patient : Oliveira, 25 years-old , Record date : 2079-11-09 . | Patient : Consuella Lose, 35 years-old , Record date : 2079-12-09 . | Patient : <PATIENT>, <AGE> years-old , Record date : <DATE> . |
3 | Cocke County Baptist Hospital . | SHELBY REGIONAL MEDICAL CENTER . | <HOSPITAL> . |
4 | 0295 Keats Street | 600 North Sioux Point Road | <STREET> |
Please check Clinical DeIdentification Notebook for more information.
ChunkMerger
now Supports Dictionary Format for the Selective Merging
ChunkMergeModel and ChunkMergeApproach now offer support for the Dictionary format, enhancing its selectiveness and flexibility in chunk merging operations.
The ChunkMergeModel
includes setReplaceDict
for replacing entity labels and setFalsePositives
for enabling precise control over chunk merging outcomes.
Additionally, the ChunkMergeApproach
has setEntitiesConfidence
, allowing users to adjust entity confidence levels for further customization.
These enhancements empower users to optimize their data processing pipelines, improving accuracy and efficiency in chunk merging tasks.
Example:
chunk_merge_model = ChunkMergeModel() \
.setInputCols("clinical_ner_chunk","deid_ner_chunk") \
.setOutputCol("merged_chunk") \
.setReplaceDict({"DOCTOR": "NAME",
"PATIENT": "NAME"}) \
.setFalsePositives([["metformin", "TREATMENT", "DRUG"],
["glipizide","TREATMENT",""]])
text ='''
Jennifer is 58 years old. She was seen by Dr. John Green and discharged on metformin, glipizide for T2DM and atorvastatin and gemfibrozil for HTG.
'''
Result for without rules:
chunk | begin | end | entity | confidence |
---|---|---|---|---|
Jennifer | 1 | 8 | PATIENT | 0.9993 |
58 | 13 | 14 | AGE | 1.0 |
John Green | 48 | 57 | DOCTOR | 0.7381 |
metformin | 77 | 85 | TREATMENT | 0.9999 |
glipizide | 88 | 96 | TREATMENT | 1.0 |
T2DM | 102 | 105 | PROBLEM | 0.9988 |
atorvastatin | 111 | 122 | TREATMENT | 0.9999 |
gemfibrozil | 128 | 138 | TREATMENT | 1.0 |
HTG | 144 | 146 | PROBLEM | 0.9991 |
Result for with rules:
In the example, “metformin” has been classified as a DRUG entity, while “glipizide” has been removed due to the setFalsePositives rules. Additionally, “Jennifer” and “John Green” chunks have been labeled as NAME according to the setReplaceDict rules.
chunk | begin | end | entity | confidence |
---|---|---|---|---|
Jennifer | 1 | 8 | NAME | 0.9993 |
58 | 13 | 14 | AGE | 1.0 |
John Green | 48 | 57 | NAME | 0.7381 |
metformin | 77 | 85 | DRUG | 0.9999 |
T2DM | 102 | 105 | PROBLEM | 0.9988 |
atorvastatin | 111 | 122 | TREATMENT | 0.9999 |
gemfibrozil | 128 | 138 | TREATMENT | 1.0 |
HTG | 144 | 146 | PROBLEM | 0.9991 |
MedicalQuestionAnswering
Returns Score in Metadata
Our Medical Question Answering system now includes a significant enhancement: the ability to return a score in metadata. This update provides users with valuable additional information, allowing them to gauge the relevance of the provided answers.
Example:
med_qa = MedicalQuestionAnswering().pretrained("clinical_notes_qa_base_onnx", "en", "clinical/models")\
.setInputCols(["document_question", "document_context"])\
.setCustomPrompt("Context: {context} \n Question: {question} \n Answer: ")\
.setOutputCol("answer")\
context = '''
Patient with a past medical history of hypertension for 15 years.
(Medical Transcription Sample Report)\nHISTORY OF PRESENT ILLNESS:
The patient is a 74-year-old white woman who has a past medical history of hypertension for 15 years, history of CVA with no residual hemiparesis and uterine cancer with pulmonary metastases, who presented for evaluation of recent worsening of the hypertension. According to the patient, she had stable blood pressure for the past 12-15 years on 10 mg of lisinopril.
'''
question = "What is the primary issue reported by patient?"
Result:
Question | Answer | metadata |
---|---|---|
What is the primary issue reported by patient? | The primary issue reported by the patient is hypertension. | {score -> 0.97722054} |
New Speed Benchmarks for Various Pipelines Across Different Platforms Under Various Settings (EMR, Databricks etc.)
- Performance Evaluation of AWS EMR Cluster for Clinical Text Analysis
This study presents a benchmark assessment of an AWS EMR (Elastic MapReduce) cluster for analyzing clinical texts. The evaluation aims to assess the performance and scalability of the AWS EMR cluster configuration for clinical text analysis tasks.
Dataset: 340 Custom Clinical Texts, approx. 235 tokens per text Versions: - EMR Version: ERM.6.15.0 - spark-nlp Version: v5.2.2 - spark-nlp-jsl Version : v5.2.1 - Spark Version : v3.4.1 Instance Type: - Primary: m4.4xlarge, 16 vCore, 64 GiB memory - Worker : m4.4xlarge, 16 vCore, 64 GiB memory
ner_pipeline = Pipeline(stages = [
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner_jsl,
ner_jsl_converter])
resolver_pipeline = Pipeline(stages = [
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner_jsl,
ner_jsl_converter,
chunk2doc,
sbert_embeddings,
snomed_resolver])
Results Table
partition | NER Timing | NER and Resolver Timing |
---|---|---|
4 | 24.7 seconds | 1 minutes 8.5 seconds |
8 | 23.6 seconds | 1 minutes 7.4 seconds |
16 | 22.6 seconds | 1 minutes 6.9 seconds |
32 | 23.2 seconds | 1 minutes 5.7 seconds |
64 | 22.8 seconds | 1 minutes 6.7 seconds |
128 | 23.7 seconds | 1 minutes 7.4 seconds |
256 | 23.9 seconds | 1 minutes 6.1 seconds |
512 | 23.8 seconds | 1 minutes 8.4 seconds |
1024 | 25.9 seconds | 1 minutes 10.2 seconds |
- Performance Evaluation of ONNX and Base Embeddings in Resolver Benchmark
This study presents a benchmark evaluation of resolver performance using ONNX and base embeddings on clinical text datasets. The evaluation aims to assess the performance and efficiency of the resolver component under these different embedding configurations.
Dataset: 100 Custom Clinical Texts, approx. 595 tokens per text Versions: - spark-nlp Version: v5.2.2 - spark-nlp-jsl Version : v5.2.1 - Spark Version : v3.2.1 Instance Type: - 8 CPU Cores 52GiB RAM (Colab Pro - High RAM)
nlp_pipeline = Pipeline(
stages = [
document_assembler,
sentenceDetectorDL,
tokenizer,
word_embeddings,
clinical_ner,
ner_converter,
])
embedding_pipeline = PipelineModel(
stages = [
c2doc,
sbiobert_embeddings # base or onnx version
])
resolver_pipeline = PipelineModel(
stages = [
rxnorm_resolver
])
Results Table
partition | preprocessing | embeddings | resolver | onnx_embeddings | resolver_with_onnx_embeddings |
---|---|---|---|---|---|
4 | 25 sec | 25 sec | 7 min 46 sec | 9 sec | 8 min 29 sec |
8 | 21 sec | 25 sec | 5 min 12 sec | 9 sec | 4 min 53 sec |
16 | 21 sec | 25 sec | 4 min 41 sec | 9 sec | 4 min 30 sec |
32 | 20 sec | 24 sec | 5 min 4 sec | 9 sec | 4 min 34 sec |
64 | 21 sec | 24 sec | 4 min 44 sec | 9 sec | 5 min 2 sec |
128 | 20 sec | 25 sec | 5 min 4 sec | 10 sec | 4 min 51 sec |
256 | 22 sec | 26 sec | 4 min 34 sec | 10 sec | 5 min 13 sec |
512 | 24 sec | 27 sec | 4 min 46 sec | 12 sec | 4 min 22 sec |
1024 | 29 sec | 30 sec | 4 min 24 sec | 14 sec | 4 min 29 sec |
Please check Speed Benchmarks for more information.
Various Core Improvements: Bug Fixes, Enhanced Overall Robustness, And Reliability Of Spark NLP For Healthcare
- Consistent obfuscation is supported in
StructuredDeidentification
too - Added
deid_source
field to the metadata to infer the source of entity chunks coming from internal or external NER models and stages - Refactoring the Deidentification module for improved functionality
- Flushing the temporary files dumped by the
SentenceEntityResolver
- Fixed
IOBTagger
was returning zero instead of ‘O’ - Lighter jars for Spark NLP leading optimized Spark’s sessions
Resolution2Chunk
documentation updated- Updated the default value of the
customBoundsStrategy
parameter in theInternalDocumentSplitter
- Enhanced
InternalDocumentSplitter
with UUID Metadata Field
Updated Notebooks And Demonstrations For making Spark NLP For Healthcare Easier To Navigate And Understand
- New Flattener Notebook
- New Rule Based Entity Matchers Notebook
- New Opioid Notebook
- Updated Clinical Deidentification Improvement Notebook for latest models
- New Opioid Demo
- New Biomarker Classification Demo
- New SNOMED Term NER Demo
- New Multi Language NER Demo
- Updated Age Classification Demo with new
genericclassifier_age_e5
model
We Have Added And Updated A Substantial Number Of New Clinical Models And Pipelines, Further Solidifying Our Offering In The Healthcare Domain.
drug_matcher
biomarker_matcher
umls_loinc_mapper
umls_cpt_mapper
umls_snomed_mapper
snomed_umls_mapper
cpt_umls_mapper
loinc_umls_mapper
ner_opioid
ner_snomed_term
ner_deid_multilingual
ner_deid_name_multilingual_clinical
assertion_opioid_wip
assertion_opioid_drug_status_wip
assertion_opioid_general_symptoms_status_wip
genericclassifier_age_e5
bert_sequence_classifier_biomarker
ncit_resolver_pipeline
rxcui_resolver_pipeline
hgnc_resolver_pipeline
hcpcs_resolver_pipeline
snomed_procedures_measurements_resolver_pipeline
icdo_resolver_pipeline
icd10pcs_resolver_pipeline
icd10cm_generalised_resolver_pipeline
loinc_resolver_pipeline
loinc_numeric_resolver_pipeline
medication_resolver_pipeline
medication_resolver_transform_pipeline
clinical_deidentification_obfuscation_medium
clinical_deidentification_obfuscation_small
clinical_deidentification
clinical_deidentification_generic
clinical_deidentification_subentity
sbiobertresolve_snomed_bodyStructure
sbiobertresolve_snomed_drug
sbiobertresolve_snomed_conditions
sbiobertresolve_snomed_auxConcepts
sbiobertresolve_snomed_findings
sbiobertresolve_snomed_findings_aux_concepts
sbiobertresolve_snomed_procedures_measurements
For all Spark NLP for Healthcare models, please check: Models Hub Page
Versions
- 5.5.1
- 5.5.0
- 5.4.1
- 5.4.0
- 5.3.3
- 5.3.2
- 5.3.1
- 5.3.0
- 5.2.1
- 5.2.0
- 5.1.4
- 5.1.3
- 5.1.2
- 5.1.1
- 5.1.0
- 5.0.2
- 5.0.1
- 5.0.0
- 4.4.4
- 4.4.3
- 4.4.2
- 4.4.1
- 4.4.0
- 4.3.2
- 4.3.1
- 4.3.0
- 4.2.8
- 4.2.4
- 4.2.3
- 4.2.2
- 4.2.1
- 4.2.0
- 4.1.0
- 4.0.2
- 4.0.0
- 3.5.3
- 3.5.2
- 3.5.1
- 3.5.0
- 3.4.2
- 3.4.1
- 3.4.0
- 3.3.4
- 3.3.2
- 3.3.1
- 3.3.0
- 3.2.3
- 3.2.2
- 3.2.1
- 3.2.0
- 3.1.3
- 3.1.2
- 3.1.1
- 3.1.0
- 3.0.3
- 3.0.2
- 3.0.1
- 3.0.0
- 2.7.6
- 2.7.5
- 2.7.4
- 2.7.3
- 2.7.2
- 2.7.1
- 2.7.0
- 2.6.2
- 2.6.0
- 2.5.5
- 2.5.3
- 2.5.2
- 2.5.0
- 2.4.6
- 2.4.5
- 2.4.2
- 2.4.1
- 2.4.0