Spark NLP for Healthcare Release Notes

 

5.3.2

Highlights

We are delighted to announce remarkable enhancements and updates in our latest release of Spark NLP for Healthcare. This release comes with a brand new Few-Shot Assertion Classifier Model, a few new annotators, the first BioLord based entity resolver, Pipeline Output Parser module, 4 new clinical pretrained models, and pipelines.

  • Introducing new Few-Shot Assertion Classifier Model for higher accuracy with less data
  • Introducing a new lightweight deidentification annotator LightDeidentification that avoids token input, verbose regex patterns, and unnecessary parameters.
  • BioLord based sentence entity resolver model for RxNorm terminologies
  • Explore the new annotator MultiChunkToDoc that merges chunks into documents to concat named entities into one piece.
  • Improved version of Social Determinants of Health (SDoH) named entity recognition model.
  • New Profiling Pipeline for De-identification to run multiple deidentification models at once in a single line.
  • directionSensitive and filterByTokenDistance parameters added to the RelationExtraction annotator to allow further customization and reduce FPs.
  • Efficiency analysis and cost evaluation of deidentification pipelines on cloud platforms.
  • Output parser to prettify Pretrained Pipeline outputs into structured JSONs/dictionaries.
  • Various core improvements; bug fixes, enhanced overall robustness and reliability of Spark NLP for Healthcare
    • Fixed issue with WindowedSentenceModel not being imported correctly from the sparknlp_jsl.annotator module
    • Addressed inconsistent obfuscation behavior in the deidentification annotator on DataProc, where different fake names were produced despite using the same seed value across runs
    • Solved compatibility issue between MPNetEmbeddings and SentenceEntityResolverModel in Spark NLP for Healthcare, where the resolver fails due to missing token metadata from MPNetEmbeddings.
  • Updated notebooks and demonstrations for making Spark NLP for Healthcare easier to navigate and understand
  • The addition and update of numerous new clinical models and pipelines continue to reinforce our offering in the healthcare domain

These enhancements will elevate your experience with Spark NLP for Healthcare, enabling more efficient, accurate, and streamlined analysis of healthcare-related natural language data.

Introducing new Few-Shot Assertion Classifier Model for Higher Accuracy with Less Data

The Few-Shot Assertion Classifier Model is an advanced annotator designed to get higher accuracy with fewer data samples inspired by the SetFit framework. Few-Shot Assertion models consist of a sentence embedding component paired with a classifier (or head). While current support is focused on MPNet-based Few-Shot Assertion models, future updates will extend compatibility to include other popular models like Bert, DistillBert, and Roberta.

This classifier model supports various classifier types, including sklearn’s LogisticRegression and custom PyTorch models, providing flexibility for different model setups. Users are required to specify the classifier type during model export to SparkNLP.

As you can see from the following chart, the few-shot assertion classifier can reach comparable performance metrics using only 50% of the training dataset.

image

Example:


few_shot_assertion_classifier = FewShotAssertionClassifierModel().pretrained("assertion_fewshotclassifier", "en", "clinical/models")\
    .setInputCols(["sentence", "ner_chunk"])\
    .setOutputCol("assertion_fewshot")

texts = [
    ["Includes hypertension and chronic obstructive pulmonary disease."],
    ["Her former vascular no arteriovenous malformations are identified; there is no evidence of recurrence of her former vascular malformation."],
    ["He is an elderly gentleman in no acute distress. He is sitting up in bed eating his breakfast."],
    ["Trachea is midline. No jugular venous pressure distention is noted. No adenopathy in the cervical, supraclavicular, or axillary areas."],
    ["Soft and not tender. There may be some fullness in the left upper quadrant, although I do not appreciate a true spleen with inspiration."]
]

Result:

chunk begin end entity assertion confidence
hypertension 9 20 Hypertension present 1.0
chronic obstructive pulmonary disease 26 62 Disease_Syndrome_Disorder present 1.0
arteriovenous malformations 23 49 Disease_Syndrome_Disorder absent 1.0
vascular malformation 116 136 Disease_Syndrome_Disorder absent 0.99
distress 39 46 Symptom absent 1.0
jugular venous pressure distention 23 56 Symptom absent 1.0
adenopathy 71 80 Symptom absent 1.0
tender 13 18 Symptom absent 0.99
fullness 39 46 Symptom present 0.68

Please check the assertion_fewshotclassifier model and Few-Shot Assertion Classifier Notebook for more information

Introducing a new Lightweight Deidentification Annotator LightDeidentification that Avoids Token Input, Verbose Regex Patterns, and Unnecessary Parameters

Light DeIdentification is a light version of DeIdentification. It replaces sensitive information in a text with obfuscated or masked fakers. It is designed to work with healthcare data, and it can be used to de-identify patient names, dates, and other sensitive information. It can also be used to obfuscate or mask any other type of sensitive information, such as doctor names, hospital names, and other types of sensitive information.

Additionally, it supports millions of embedded fakers and If desired, custom external fakers can be set with the setCustomFakers function. It also supports multiple languages such as English, Spanish, French, German, and Arabic. It supports multi-mode de-identification with the setSelectiveObfuscationModes function at the same time.

Example:

light_deidentification = LightDeIdentification() \
    .setInputCols(["ner_chunk", "sentence"]) \
    .setOutputCol("obfuscated") \
    .setMode("obfuscate") \
    .setObfuscateDate(True) \
    .setDateFormats(["MM/dd/yyyy"]) \
    .setDays(5) \
    .setObfuscateRefSource('custom') \
    .setAgeRanges([1, 4, 12, 20, 40, 60, 80])\
    .setCustomFakers({"Doctor": ["John", "Mark"],
                      "Patient": ["John", "Mark"],
                      "Hospital": ["Medical Center"],
                      "Street" : ["Main Road"],
                      "MEDICALRECORD": ["XYZ", "ABC"],
                      }) \
    .setLanguage("en") \
    .setSeed(10) \
    .setDateEntities(["DATE"]) \

light_deidentification2 = LightDeIdentification() \
    .setInputCols(["ner_chunk", "sentence"]) \
    .setOutputCol("masked") \
    .setMode("mask") \
    .setMaskingPolicy("masked") \
    .setObfuscateDate(True) \
    .setDateFormats(["yyyy"]) \
    .setUnnormalizedDateMode("skip") \
    .setAgeRanges([1, 4, 12, 20, 40, 60, 80])\
    .setSelectiveObfuscationModes({"OBFUSCATE": ["DATE", "STREET", "DOCTOR", "Patient"],
                                    "mask_same_length_chars": ["MEDICALRECORD"],
                                    "mask_entity_labels": ["HOSPITAL"],
                                    }) \

text ='''
Record date : 2093-01-13 ,M.D David Hale.
Name: Hendrickson, Ora MR # 7194334 Date: 01/13/93.
Patient: Oliveira, 25 years-old.
Cocke County Baptist Hospital, 0295 Keats Street
'''

Result:

Original Sentence Deidentified Masked
Record date : 2093-01-13 ,M.D David Hale. Record date : 28-01-1979 ,M.D John. Record date : 2093-01-13 ,M.D <DOCTOR>.
Name: Hendrickson, Ora MR # 7194334 Date: 01/13/93. Name: Mark MR # 1610960 Date: 19-08-1941. Name: <PATIENT> MR # [*****] Date: 01/13/93.
Patient: Oliveira, 25 years-old. Patient: Mark, years-old. Patient: <PATIENT>, <AGE> years-old.
Cocke County Baptist Hospital, 0295 Keats Street Medical Center, Main Road , 9500 Euclid Avenue

Please check LightDeidentification Notebook for more details.

BioLord based Sentence Entity Resolver Model for RxNorm Terminologies

This model maps clinical entities and concepts (like drugs/ingredients) to RxNorm codes using BioLord sentence embeddings (mpnet_embeddings_biolord_2023_c). It is trained on the augmented version of the dataset used in previous RxNorm resolver models. Additionally, this model returns concept classes of the drugs in the all_k_aux_labels column. Predicted Entities: RxNorm Codes and Concept Classes

Example:

rxnorm_resolver = SentenceEntityResolverModel.pretrained("biolordresolve_rxnorm_augmented", "en", "clinical/models")\
    .setInputCols(["embeddings"]) \
    .setOutputCol("rxnorm_code")\
    .setDistanceFunction("EUCLIDEAN")


text= "The patient was prescribed Albuterol inhaler, two puffs every 4 hours as needed for asthma. She was seen by the endocrinology service and was discharged on avandia 4 mg at night , Coumadin 5 mg with meals , and metformin 1000 mg two times a day and Lisinopril 10 mg daily"

Result:

ner_chunk rxnorm_sbert code rxnorm_sbert resolution rxnorm_sbert code rxnorm_sbert resolution
aspirin 10 meq/ 5 ml oral sol 979446 alupent 10 mg per 5 ml oral solution [metaproterenol sulfate 2 mg/ml oral solution [alupent]] 685589 aspirin 1.5 mg/ml oral solution [aspirin 1.5 mg/ml oral solution]
coumadin 50 mg oral tablet 212233 zoloft 25 mg oral tablet [sertraline 25 mg oral tablet [zoloft]] 855314 warfarin sodium 2.5 mg oral tablet [coumadin]
Fluorometholone 5 mg/ 2 ml 328715 fluorometholone 2.5 mg/ml [fluorometholone 2.5 mg/ml] 328715 fluorometholone 2.5 mg/ml [fluorometholone 2.5 mg/ml]
metformin 1000 mg 316255 metformin 1000 mg [metformin 1000 mg] 316255 metformin 1000 mg [metformin 1000 mg]

Please check the model card

Explore the New annotator MultiChunkToDoc That Merges Chunks into Documents to Concat Named Entities into one Piece

Introducing the capabilities of the new annotator, which combines chunks into a document output type, sets separators via setSeparator(“,”), filters entities label, and structures annotations with details such as start and end points, results, and metadata; the workflow integrates steps from NER to Resolver through a series of transformations including Chunks, MultiChunksToDoc, and Embeddings.

Example:

clinical_ner = MedicalNerModel.pretrained("ner_clinical_large", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner_clinical")

multi_chunk2doc = MultiChunk2Doc() \
    .setInputCols(["ner_chunk"]) \
    .setOutputCol("multi_chunk_doc") \
    .setCaseSensitive(False) \
    .setPrefix("< ") \
    .setSeparator(", ") \
    .setSuffix(" >")\
    .setWhiteList(["Problem"])

text = """She was on metformin, glipizide, and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG. She had been on dapagliflozin for six months at the time of presentation.
Physical examination on presentation was significant for dry oral mucosa; significantly, her abdominal examination was benign with no tenderness, guarding, or rigidity."""

Ner Chunk Result:

ner_chunk begin end ner_label
metformin 11 19 TREATMENT
glipizide 22 30 TREATMENT
dapagliflozin 37 49 TREATMENT
T2DM 55 58 PROBLEM
atorvastatin 64 75 TREATMENT
gemfibrozil 81 91 TREATMENT
HTG 97 99 PROBLEM
dapagliflozin 118 130 TREATMENT
Physical examination 176 195 TEST
dry oral mucosa 233 247 PROBLEM
her abdominal examination 266 290 TEST
tenderness 311 320 PROBLEM
guarding 323 330 PROBLEM
rigidity 336 343 PROBLEM

Multi Chunk Result:

multi_chunk_doc
[{document, 0, 81, < T2DM, HTG, dry oral mucosa, tenderness, guarding, rigidity > {document -> 0, chunk_count -> 6}, []}]

In the above result, you will see that only the problem entities are combined into a single document.

Please check the MultiChunk2Doc for more information

Improved from Version of Social Determinants of Health (SDoH) Named Entity Recognition Model

We are introducing our new Social Determinants of Health (SDoH) Named Entity Recognition model that has been specifically designed to identify and extract entities related to various social determinants of health. This new model is an improvement over our previous version, aimed at better understanding and tracking the impacts of social factors on health.

Example:

ner_model = MedicalNerModel.pretrained("ner_sdoh_core", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

sample_texts = [["""Smith is 55 years old, living in New York, a divorced Mexcian American woman with financial problems. She speaks Spanish and Portuguese. She lives in an apartment. She has been struggling with diabetes for the past 10 years and has recently been experiencing frequent hospitalizations due to uncontrolled blood sugar levels. Smith works as a cleaning assistant and cannot access health insurance or paid sick leave. She has a son, a student at college. Pt with likely long-standing depression. She is aware she needs rehab. Pt reports having her catholic faith as a means of support as well.  She has a long history of etoh abuse, beginning in her teens. She reports she has been a daily drinker for 30 years, most recently drinking beer daily. She smokes a pack of cigarettes a day. She had DUI in April and was due to court this week."""]]

Result:

chunk begin end label
New York 33 40 Geographic_Entity
financial problems 82 99 Financial_Status
apartment 153 161 Housing
hospitalizations 268 283 Other_SDoH_Keywords
access health insurance 372 394 Insurance_Status
son 426 428 Family_Member
student 433 439 Education
college 444 450 Education
depression 482 491 Mental_Health
rehab 517 521 Access_To_Care
catholic faith 546 559 Spiritual_Beliefs
support 575 581 Social_Support
daily 682 686 Substance_Frequency
30 years 700 707 Substance_Duration
daily 738 742 Substance_Frequency
a pack 756 761 Substance_Quantity
a day 777 781 Substance_Frequency
DUI 792 794 Legal_Issues

Please check the model card

New Profiling Pipeline for De-identification to Run Multiple Deidentification Models at Once in a Single Line

We are excited to introduce our new profiling pipeline that focuses on De-identification domains. We can use pretrained NER profiling pipelines to explore all the available pretrained NER models simultaneously. This profiling pipeline offers powerful tools for extracting meaningful information from medical text data in the respective domains. It assists in uncovering patterns, trends, and insights that are crucial for research, analysis, and decision-making in healthcare and related fields. Here’s a brief overview of the pipeline and the included NER model:

Pipeline Name included NER Models
ner_profiling_deidentification ner_deid_augmented, ner_deid_enriched, ner_deid_generic_augmented, ner_deid_name_multilingual_clinical, ner_deid_sd, ner_deid_subentity_augmented, ner_deid_subentity_augmented_i2b2, ner_deid_synthetic, ner_jsl, ner_jsl_enriched

Example:

from sparknlp.pretrained import PretrainedPipeline

ner_profiling_pipeline = PretrainedPipeline("ner_profiling_deidentification", 'en', 'clinical/models')

result = ner_profiling_pipeline.annotate("""Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 .""")

Result:

******************** ner_deid_name_multilingual_clinical Model Results ******************** 

('David Hale', 'NAME') ('Hendrickson Ora', 'NAME') ('Oliveira', 'NAME')

******************** ner_deid_subentity_augmented_i2b2 Model Results ******************** 

('2093-01-13', 'DATE') ('David Hale', 'DOCTOR') ('Hendrickson Ora', 'PATIENT') ('7194334', 'MEDICALRECORD') ('01/13/93', 'DATE') ('Oliveira', 'PATIENT') ('25', 'AGE') ('2079-11-09', 'DATE') ('Cocke County Baptist Hospital', 'HOSPITAL') ('0295 Keats Street', 'STREET') ('55-555-5555', 'PHONE')

******************** ner_deid_large Model Results ******************** 

('2093-01-13', 'DATE') ('David Hale', 'NAME') ('Hendrickson Ora', 'NAME') ('7194334', 'ID') ('01/13/93', 'DATE') ('Oliveira', 'NAME') ('25', 'AGE') ('2079-11-09', 'DATE') ('Cocke County Baptist Hospital', 'LOCATION') ('0295 Keats Street', 'LOCATION') ('55-555-5555', 'CONTACT')

******************** ner_jsl_enriched Model Results ******************** 

('01/13/93', 'Date') ('25 years-old', 'Age') ('2079-11-09', 'Date')

******************** ner_deid_sd_large Model Results ******************** 

('2093-01-13', 'DATE') ('David Hale', 'NAME') ('Hendrickson Ora', 'NAME') ('7194334', 'ID') ('01/13/93', 'DATE') ('Oliveira', 'NAME') ('2079-11-09', 'DATE') ('Cocke County Baptist Hospital', 'LOCATION') ('0295 Keats Street', 'LOCATION') ('55-555-5555', 'CONTACT')

............

directionSensitive and filterByTokenDistance Parameters Added to the RelationExtraction Annotator to Allow Further Customization and Reduce FPs

The features directionSensitive and filterByTokenDistance have been implemented. The directionSensitive setting determines how entity relations are considered. If set to true, only relations in the form of ENTITY1-ENTITY2 are considered. If set to false, both ENTITY1-ENTITY2 and ENTITY2-ENTITY1 relations are considered. The filterByTokenDistance setting is a criterion for filtering based on the number of tokens between entities. The model only identifies relations where the entities are separated by fewer than the specified number of tokens.

directionSensitive Example:

clinical_re_Model = RelationExtractionModel()\
    .pretrained("re_clinical", "en", 'clinical/models')\
    .setInputCols(["embeddings", "pos_tags", "ner_chunk", "dependencies"])\
    .setOutputCol("relations")\
    .setRelationPairs(["problem-test", #"test-problem",
                       "treatment-problem",#"treatment-problem",
                       ])\
    .setMaxSyntacticDistance(4)\
    .setDirectionSensitive(True)

text ="""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ),
one prior episode of HTG-induced pancreatitis three years prior to presentation,  associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 ,
presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection .
She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation.
Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity .
Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 .
Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia .
The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission .
However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L .
The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again .
The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours .
Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use .
The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day .
It was determined that all SGLT2 inhibitors should be discontinued indefinitely .
She had close follow-up with endocrinology post discharge .
"""

directionSensitive Result:

index sentence entity1_begin entity1_end chunk1 entity1 entity2_begin entity2_end chunk2 entity2 relation confidence
2 1 511 521 amoxicillin TREATMENT 527 555 a respiratory tract infection PROBLEM TrAP 0.9999393
3 2 570 578 metformin TREATMENT 616 619 T2DM PROBLEM TrAP 0.99999905
4 2 582 590 glipizide TREATMENT 616 619 T2DM PROBLEM TrAP 0.9999999
5 2 598 610 dapagliflozin TREATMENT 616 619 T2DM PROBLEM TrAP 0.99999976
6 2 598 610 dapagliflozin TREATMENT 658 660 HTG PROBLEM TrAP 0.9998598
7 2 625 636 atorvastatin TREATMENT 658 660 HTG PROBLEM TrAP 0.99999547
8 2 642 652 gemfibrozil TREATMENT 658 660 HTG PROBLEM TrAP 1.0
10 10 1552 1565 still elevated PROBLEM 1655 1660 lipase TEST TeRP 0.99998677
11 10 1552 1565 still elevated PROBLEM 1669 1671 U/L TEST TeRP 0.9999951
12 12 1936 1950 an insulin drip TREATMENT 1956 1960 euDKA PROBLEM TrAP 0.9996302
13 12 1936 1950 an insulin drip TREATMENT 1966 1968 HTG PROBLEM TrAP 0.9839804
14 12 1956 1960 euDKA PROBLEM 1990 2002 the anion gap TEST PIP 0.9738799
16 12 1966 1968 HTG PROBLEM 1990 2002 the anion gap TEST TeRP 0.9764806
17 12 1966 1968 HTG PROBLEM 2014 2026 triglycerides TEST TeRP 0.9998374

filterByTokenDistance Example:

clinical_re_Model = RelationExtractionModel()\
    .pretrained("re_clinical", "en", 'clinical/models')\
    .setInputCols(["embeddings", "pos_tags", "ner_chunk", "dependencies"])\
    .setOutputCol("relations")\
    .setRelationPairs(["problem-test", #"test-problem",
                       "problem-treatment",#"treatment-problem",
                       ])\
    .setFilterByTokenDistance(2)\
    .setDirectionSensitive(False)

text ="""A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ),
one prior episode of HTG-induced pancreatitis three years prior to presentation,  associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 ,
presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection .
She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG . She had been on dapagliflozin for six months at the time of presentation.
Physical examination on presentation was significant for dry oral mucosa ; significantly , her abdominal examination was benign with no tenderness , guarding , or rigidity .
Pertinent laboratory findings on admission were : serum glucose 111 mg/dl , bicarbonate 18 mmol/l , anion gap 20 , creatinine 0.4 mg/dL , triglycerides 508 mg/dL , total cholesterol 122 mg/dL , glycated hemoglobin ( HbA1c ) 10% , and venous pH 7.27 .
Serum lipase was normal at 43 U/L . Serum acetone levels could not be assessed as blood samples kept hemolyzing due to significant lipemia .
The patient was initially admitted for starvation ketosis , as she reported poor oral intake for three days prior to admission .
However , serum chemistry obtained six hours after presentation revealed her glucose was 186 mg/dL , the anion gap was still elevated at 21 , serum bicarbonate was 16 mmol/L , triglyceride level peaked at 2050 mg/dL , and lipase was 52 U/L .
The β-hydroxybutyrate level was obtained and found to be elevated at 5.29 mmol/L - the original sample was centrifuged and the chylomicron layer removed prior to analysis due to interference from turbidity caused by lipemia again .
The patient was treated with an insulin drip for euDKA and HTG with a reduction in the anion gap to 13 and triglycerides to 1400 mg/dL , within 24 hours .
Her euDKA was thought to be precipitated by her respiratory tract infection in the setting of SGLT2 inhibitor use .
The patient was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , and metformin 1000 mg two times a day .
It was determined that all SGLT2 inhibitors should be discontinued indefinitely .
She had close follow-up with endocrinology post discharge .
"""

filterByTokenDistance Result:

index sentence entity1_begin entity1_end chunk1 entity1 entity2_begin entity2_end chunk2 entity2 relation confidence
1 1 511 521 amoxicillin TREATMENT 527 555 a respiratory tract infection PROBLEM TrAP 0.9999393
2 2 598 610 dapagliflozin TREATMENT 616 619 T2DM PROBLEM TrAP 0.99999976
4 2 642 652 gemfibrozil TREATMENT 658 660 HTG PROBLEM TrAP 1.0
5 8 1245 1257 blood samples TEST 1264 1273 hemolyzing PROBLEM TrWP 0.8292767
6 10 1534 1546 the anion gap TEST 1552 1565 still elevated PROBLEM TeRP 1.0
7 11 1837 1844 analysis TEST 1853 1879 interference from turbidity PROBLEM TeRP 0.9999999
8 12 1936 1950 an insulin drip TREATMENT 1956 1960 euDKA PROBLEM TrAP 0.9996302

Please check Clinical Relation Extraction Notebook for more details.

Efficiency Analysis and Cost Evaluation of Deidentification Pipelines on Cloud Platforms

These results show speed benchmarks and cost evaluations for deidentification pipelines deployed across diverse cloud platforms, including AWS EMR and EC2. Additionally, forthcoming results from DataBricks promise to further enrich the analysis, offering deeper insights into de-identification pipeline performance.

Partition EMR
Base Pipeline
EMR
Optimized Pipeline
EC2 Instance
Base Pipeline
EC2 Instance
Optimized Pipeline
Databricks
Base Pipeline
Databricks
Optimized Pipeline
1024 5 min 1 sec 2 min 45 sec 7 min 6 sec 3 min 26 sec 10 min 10 sec 6 min 2 sec
512 4 min 52 sec 2 min 30 sec 6 min 56 sec 3 min 41 sec 10 min 16 sec 6 min 11 sec
256 4 min 50 sec 2 min 30 sec 9 min 10 sec 5 min 18 sec 10 min 22 sec 6 min 14 sec
128 4 min 55 sec 2 min 30 sec 14 min 30 sec 7 min 51 sec 10 min 21 sec 5 min 53 sec
64 6 min 24 sec 3 min 8 sec 18 min 59 sec 9 min 9 sec 12 min 42 sec 6 min 50 sec
32 7 min 15 sec 3 min 43 sec 18 min 47.2 sec 9 min 18 sec 12 min 55 sec 7 min 40 sec
16 11 min 6 sec 4 min 57 sec 12 min 47.5 sec 6 min 14 sec 15 min 59 sec 9 min 18 sec
8 19 min 13 se 8 min 8 sec 16 min 52 sec 8 min 48 sec 22 min 40 sec 13 min 26 sec

Estimated Minimum Costs:

  • EMR Base Pipeline: partition number: 256, 10K cost:$1.04, 1M cost:$104.41
  • EMR Optimized Pipeline: partition number: 256, 10K cost:$0.54, 1M cost:$54.04
  • EC2 Instance Base Pipeline: partition number: 512, 10K cost:$0.36, 1M cost:$35.70
  • EC2 Instance Optimized Pipeline: partition number: 1024, 10K cost:$0.18, 1M cost:$17.85
  • DataBricks Base Pipeline: partition number: 1024, 10K cost:$0.46, 1M cost:$45.76
  • DataBricks Optimized Pipeline: partition number: 1024, 10K cost:$0.27, 1M cost:$27.13

Please check the benchmark

Output Parser to Prettify Pretrained Pipeline Outputs into Structured JSONs/dictionaries

The output parser module seamlessly integrates with existing systems to return clear, easy-to-read & process prettified results in dictionary format from pretrained pipelines. It is designed to enhance API integration, and user understanding and streamline data analysis workflows.

Example:


from sparknlp.pretrained import PretrainedPipeline

oncology_pipeline = PretrainedPipeline("oncology_biomarker_pipeline", "en", "clinical/models")

text = ["""Immunohistochemistry was negative for thyroid transcription factor-1 and napsin A. The test was positive for ER and PR, and negative for HER2."""]

results = oncology_pipeline.fullAnnotate(text)


column_maps = {
    "document_identifier": "XYZ-123",
    "document_text": "document",
    "entities": [
        {
            "ner_chunk_column_name": "assertion_chunk",
            "assertion_column_name": "assertion_oncology_wip",
            "resolver_column_name": None
        },
        {
            "ner_chunk_column_name": "merged_chunk",
            "assertion_column_name": None,
            "resolver_column_name": None
        }
    ],
    "relations": ["re_oncology_wip", "re_oncology_granular_wip"],
    "summary": None, 
    "deidentification": [
        {
            "original": None,
            "obfuscated": None,
            "masked": None,
        }
    ]
}

from sparknlp_jsl.pipeline_output_parser import PipelineOutputParser
pipeline_parser = PipelineOutputParser(column_maps)
pipeline_parser.__dict__

parsed_result = pipeline_parser.run(results)

Result:


{'result': [{'document_identifier': 'XXXX',
   'document_text': 'Immunohistochemistry was negative for thyroid transcription factor-1 and napsin A. The test was positive for ER and PR, and negative for HER2.',
   'entities': [
    [{'chunk_id': '0',
      'begin': 0,
      'end': 19,
      'chunk': 'Immunohistochemistry',
      'label': 'Pathology_Test',
      'assertion': 'Past',
      'term_code': None},
      {'chunk_id': '1',
      'begin': 38,
      'end': 67,
      'chunk': 'thyroid transcription factor-1',
      'label': 'Biomarker',
      'assertion': 'Present',
      'term_code': None},
      ...],
    [{'chunk_id': '0',
      'begin': 0,
      'end': 19,
      'chunk': 'Immunohistochemistry',
      'label': 'Pathology_Test',
      'assertion': None,
      'term_code': None},
     {'chunk_id': '1',
      'begin': 25,
      'end': 32,
      'chunk': 'negative',
      'label': 'Biomarker_Result',
      'assertion': None,
      'term_code': None},
      ...]
    ],
   'relations': [
    [{'relation': 'O',
      'entity1': 'Pathology_Test',
      'entity1_begin': '0',
      'entity1_end': '19',
      'chunk1': 'Immunohistochemistry',
      'entity2': 'Biomarker_Result',
      'entity2_begin': '25',
      'entity2_end': '32',
      'chunk2': 'negative',
      'confidence': '0.97084755'},
      ...],
    [...
     {'relation': 'is_finding_of',
      'entity1': 'Biomarker_Result',
      'entity1_begin': '25',
      'entity1_end': '32',
      'chunk1': 'negative',
      'entity2': 'Biomarker',
      'entity2_begin': '38',
      'entity2_end': '67',
      'chunk2': 'thyroid transcription factor-1',
      'confidence': '0.9239829'},
      ...]
    ],
   'summary': [],
   'deidentification': [{'original': [], 'obfuscated': [], 'masked': []}]}]}

Various Core Improvements: Bug Fixes, Enhanced Overall Robustness, and Reliability of Spark NLP for Healthcare

  • Fixed issue with WindowedSentenceModel not being imported correctly from sparknlp_jsl.annotator module
  • Addressed inconsistent obfuscation behavior in the deidentification annotator on DataProc, where different fake names were produced despite using the same seed value across runs

Updated Notebooks And Demonstrations For making Spark NLP For Healthcare Easier To Navigate And Understand

We Have Added And Updated A Substantial Number Of New Clinical Models And Pipelines, Further Solidifying Our Offering In The Healthcare Domain.

  • ner_sdoh_core
  • ner_profiling_deidentification
  • assertion_fewshotclassifier
  • biolordresolve_rxnorm_augmented

For all Spark NLP for Healthcare models, please check: Models Hub Page

Previous versions

Last updated