5.0.0
Highlights
We are delighted to announce a suite of remarkable enhancements and updates in our latest release of Spark NLP for Healthcare. This release comes with the first Few Shot Text Classifier module and ONNX-optimized sBioBert sentence embeddings as well as 21 new clinical pretrained models and pipelines. It is a testament to our commitment to continuously innovate and improve, furnishing you with a more sophisticated and powerful toolkit for healthcare natural language processing.
- Introducing the very first Few Shot Classifier model to our toolkit to train classifier models with limited labeled data.
- New ONNX Sentence BioBERT Embeddings model, designed to enhance performance and accuracy
- 2 New Medical Question Answering models based on SOTA LLMs, designed to provide accurate answers to your inquiries against clinical notes
- 7 new NER models for Social Determinants of Health(SDOH), broadening our ability to identify and analyze crucial factors that impact health outcomes.
- New profiling pipelines for Social Determinants Of Health (SDOH), Voice Of The Patient (VOP), and Oncology to run multiple models at once in a single line
- New clinical multi-class classifier models for classification of articles based on cancer hallmarks and Covid-19 topics
- New Patient Urgency Text Classifier model, designed to analyze the level of emergency in medical situations requiring immediate assistance
- Brand-new Dutch clinical NER models, empowering accurate recognition and extraction of clinical entities in Dutch language
- New German sentence entity resolver model exclusively tailored for ICD-10-GM codes
- New feature to
InternalResourceDownloader
for point cache folder UpdateModels
is now more flexible and can be used to update existing models in the cache folder- New feature for
ChunkFilterer
to enable filtering chunks according to confidence thresholds - New feature for
StructuredDeidentification
to make it flexible for different languages - Enhanced ALAB module with Relation Extraction model training data preparation ability using document-level annotations
- Various core improvements; bug fixes, enhanced overall robustness and reliability of Spark NLP for Healthcare
- Improved Deidentification performance with refactoring
- Updated
clinical_deidentification
pipeline by enhancing theAGE
entity extraction capability - Minor corrections have been made to the calculation formulas in the Medicare Risk Adjustment Module
- Updated notebooks and demonstrations for making Spark NLP for Healthcare easier to navigate and understand
- The addition and update of numerous new clinical models and pipelines continue to reinforce our offering in the healthcare domain
We believe that these enhancements will elevate your experience with Spark NLP for Healthcare, enabling more efficient, accurate, and streamlined analysis of healthcare-related natural language data.
Introducing The Very First Few Shot Classifier Model To Our Toolkit To Train Classifier Models With Limited Labeled Data
The FewShotClassifierApproach
and FewShotClassifierModel
annotators are new additions to the set of annotators available in the Spark NLP for Healthcare library. These annotators specifically target few-shot classification tasks, which involve training a model to make accurate predictions with limited labeled data.
These new annotators provide a valuable capability for handling scenarios where labeled data is scarce or expensive to obtain. By effectively utilizing limited labeled examples, the few-shot classification approach enables the creation of models that can generalize and classify new instances accurately, even with minimal training data.
In our experiment, we compared the Few-Shot Classifier trained on partial data, equivalent to 40% of our entire dataset, against the ClassifierDL trained on both full (80% of the dataset) and partial data. To maintain fairness, the test set was constant at 20% of the entire dataset for all cases, and the same sentence embeddings were employed across the board. The Few-Shot Classifier achieved a macro F1 score of 0.867, outperforming outperform that of the ClassifierDL using the full dataset, which scored a macro F1 score of 0.847. The ClassifierDL using partial data also showed comparable results to its full data counterpart, demonstrating its robustness with less training data, but it was still surpassed by the Few-Shot Classifier. This superior performance from the Few-Shot Classifier with less data signifies that it is highly efficient and effective, making it an excellent choice for scenarios where data scarcity is a concern. We’re excited to see how this innovative feature will enhance the future of text classification tasks in our library. Stay tuned for more updates as we continue to optimize and improve our offerings.
macro-f1-score | weighted-f1-score | accuracy | |
---|---|---|---|
ClassifierDL_full_Data | 0.85 | 0.84 | 0.84 |
ClassifierDL_partial_Data | 0.84 | 0.84 | 0.84 |
FewShot_partial_Data | 0.87 | 0.87 | 0.87 |
The FewShotClassifier
is designed to process sentence embeddings as input. It generates category annotations, providing labels along with confidence scores that range from 0 to 1. Input annotation types supported by this model include SENTENCE_EMBEDDINGS
, while the output annotation type is CATEGORY
.
Example:
few_shot_approach = FewShotClassifierApproach()\
.setLabelColumn("label")\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("prediction")\
.setModelFile(f"/tmp/log_reg_graph.pb")\
.setEpochsNumber(10)\
.setBatchSize(1)\
.setLearningRate(0.001)
pipeline = Pipeline(
stages=[
document_asm,
sentence_embeddings,
graph_builder,
few_shot_approach
])
data = [
["ADE_positive", 'Both PAN and methotrexate have been independently demonstrated to cause sensorineural hearing loss.'],
["ADE_positive", 'Adrenal suppression in a fetus due to administration of methylprednisolone has hitherto been rarely published.'],
["ADE_negative", 'Pathogenic mechanisms for the development of pseudomembranous colitis and the epidemiology of this condition in patients with AIDS are discussed.'],
["ADE_negative", 'I report a patient who developed the syndrome during treatment for schizophrenia with the antipsychotic agent molindone hydrochloride.']
]
model = pipeline.fit(train_data)
tests = [
'Bleomycin pneumonitis potentiated by oxygen administration.',
'Enzymes derived from two different bacterial sources (Escherichia coli and Erwinia carotovora) are in common use.',
]
Result:
text | prediction category |
---|---|
Bleomycin pneumonitis potentiated by oxygen administration. | ADE_positive |
Enzymes derived from two different bacterial sources (Escherichia coli and Erwinia carotovora) are in common use. | ADE_negative |
please check: Text Classification with FewShotClassifier Notebook for more information
New ONNX Sentence BioBERT Embeddings Model, Designed To Enhance Performance And Accuracy
Spark NLP 5.0.0 introduced support for ONNX Runtime that can handle machine learning models in the ONNX format and has been proven to significantly boost inference performance across a multitude of models. This integration leads to substantial improvements when serving our LLM models, including BERT. We now introduce the first medical sentence embeddings, that is called sbiobert_base_cased_mli_onnx
and optimized with ONNX, generating two times faster inference.
Example:
sbiobert_embeddings = BertSentenceEmbeddings\
.pretrained("sbiobert_base_cased_mli_onnx", "en", "clinical/models")\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("sbert_embeddings")
Result:
Gives a 768 dimensional vector representation of the sentence.
Please see the model card
2 New Medical Question Answering Models Based On SOTA LLMs, Designed To Provide Accurate Answers To Your Inquiries Against Clinical Notes
Now we have clinical_notes_qa_base
and clinical_notes_qa_large
models that are capable of open-book question answering on Medical Notes.
These new medical question answering models empower users to extract valuable information and insights from medical notes effectively. Whether you are a healthcare professional, researcher, or enthusiast, the clinical_notes_qa_base
and clinical_notes_qa_large
models offer advanced tools for retrieving targeted information from medical documents and enhancing your understanding of the medical domain.
Example:
med_qa = sparknlp_jsl.annotators.MedicalQuestionAnswering()\
.pretrained("clinical_notes_qa_base", "en", "clinical/models")\
.setInputCols(["document_question", "document_context"])\
.setCustomPrompt("Context: {context} \n Question: {question} \n Answer: ")\
.setOutputCol("answer")\
note_text = """Patient with a past medical history of hypertension for 15 years.\n(Medical Transcription Sample Report)\nHISTORY OF PRESENT ILLNESS:\nThe patient is a 74-year-old white woman who has a past medical history of hypertension for 15 years, history of CVA with no residual hemiparesis and uterine cancer with pulmonary metastases, who presented for evaluation of recent worsening of the hypertension. According to the patient, she had stable blood pressure for the past 12-15 years on 10 mg of lisinopril."""
question = "What is the primary issue reported by patient?"
Result:
"The primary issue reported by the patient is hypertension."
please check: MEDICAL LLM Demo
7 New NER Models For Social Determinants Of Health (SDOH), Broadening Our Ability To Identify And Analyze Crucial Factors That Impact Health Outcomes
Introducing our new set of SDOH NER models that are specifically designed to identify and extract entities related to various social determinants of health. Here is a brief overview of each model and the entities it predicts:
model name | description | predicted entities |
---|---|---|
ner_sdoh_access_to_healthcare | extract entities related to access to healthcare | Access_To_Care , Healthcare_Institution , Insurance_Status |
ner_sdoh_community_condition | identify and extract entities associated with different community conditions | Community_Safety , Environmental_Condition , Food_Insecurity , Housing , Transportation |
ner_sdoh_demographics | extract entities associated with different demographic factors | Age , Family_Member , Gender , Geographic_Entity , Language , Race_Ethnicity , Spiritual_Beliefs |
ner_sdoh_health_behaviours_problems | extract entities associated with health behaviors and problems | Communicable_Disease , Diet , Disability , Eating_Disorder , Exercise , Hyperlipidemia , Hypertension , Mental_Health , Obesity , Other_Disease , Quality_Of_Life , Sexual_Activity |
ner_sdoh_income_social_status | extract entities associated with income and social status | Education , Employment , Financial_Status , Income , Marital_Status , Population_Group |
ner_sdoh_social_environment | extract entities associated with different aspects of the social environment | Chidhood_Event , Legal_Issues , Social_Exclusion , Social_Support , Violence_Or_Abuse |
ner_sdoh_substance_usage | extract entities associated with substance usage | Alcohol , Smoking , Substance_Duration , Substance_Frequency , Substance_Quantity , Substance_Use |
Example:
ner_model = MedicalNerModel.pretrained("ner_sdoh_health_behaviours_problems", "en", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
sample_text = """The patient is a 54-year-old female with a complex medical history, including anxiety, depression, bulimia nervosa, elevated cholesterol, substance abuse, hypertension, and hyperlipidemia. Her partner has been diagnosed with hepatitis C. She reports a lack of regular exercise and a departure from a healthy diet for approximately two years due to chronic sciatic pain. Her sedentary lifestyle and poor diet have contributed to obesity, leading to a negative impact on her self-esteem. The patient is motivated to make lifestyle improvements, including weight loss, addressing her mental well-being, and enhancing her sexual satisfaction."""
Result:
chunk | begin | end | ner_label |
---|---|---|---|
anxiety | 78 | 84 | Mental_Health |
depression | 87 | 96 | Mental_Health |
bulimia nervosa | 99 | 113 | Eating_Disorder |
elevated cholesterol | 116 | 135 | Hyperlipidemia |
hypertension | 155 | 166 | Hypertension |
hyperlipidemia | 173 | 186 | Hyperlipidemia |
hepatitis C | 225 | 235 | Communicable_Disease |
regular exercise | 261 | 276 | Exercise |
healthy diet | 301 | 312 | Diet |
chronic sciatic pain | 349 | 368 | Other_Disease |
sedentary lifestyle | 375 | 393 | Exercise |
poor diet | 399 | 407 | Diet |
obesity | 429 | 435 | Obesity |
self-esteem | 474 | 484 | Quality_Of_Life |
lifestyle improvements | 521 | 542 | Quality_Of_Life |
mental well-being | 583 | 599 | Mental_Health |
sexual satisfaction | 620 | 638 | Sexual_Activity |
please check Social Determinant of Health Notebook for more information
New Profiling Pipelines For Social Determinants Of Health (SDOH), Voice Of The Patient (VOP), and Oncology To Run Multiple Models At Once In A Single Line
We are excited to introduce our new profiling pipelines that focus on Social Determinants of Health (SDOH), Voice of Patient (VOP), and Oncology domains. We can use pretrained NER profiling pipelines for exploring all the available pretrained NER models at once. These profiling pipelines offer powerful tools for extracting meaningful information from medical text data in the respective domains. They assist in uncovering patterns, trends, and insights that are crucial for research, analysis, and decision-making in healthcare and related fields. Here’s a brief overview of each pipeline and the included NER models:
Pipeline Name | included NER Models |
---|---|
ner_profiling_oncology | ner_oncology_unspecific_posology , ner_oncology_tnm , ner_oncology_therapy , ner_oncology_test , ner_oncology_response_to_treatment , ner_oncology_posology , ner_oncology , ner_oncology_limited_80p_for_benchmarks , ner_oncology_diagnosis , ner_oncology_demographics , ner_oncology_biomarker , ner_oncology_anatomy_granular , ner_oncology_anatomy_general |
ner_profiling_sdoh | ner_sdoh , ner_sdoh_social_environment_wip , ner_sdoh_mentions , ner_sdoh_demographics , ner_sdoh_community_condition , ner_sdoh_substance_usage , ner_sdoh_access_to_healthcare , ner_sdoh_health_behaviours_problems , ner_sdoh_income_social_status |
ner_profiling_vop | ner_vop_clinical_dept , ner_vop_temporal , ner_vop_test , ner_vop , ner_vop_problem , ner_vop_problem_reduced , ner_vop_treatment , ner_vop_demographic , ner_vop_anatomy |
Example:
from sparknlp.pretrained import PretrainedPipeline
ner_profiling_pipeline = PretrainedPipeline("ner_profiling_oncology", 'en', 'clinical/models')
For results and different examples, please see
New Clinical Multi-Class Classifier Models for Classification Of Articles Based On Cancer Hallmarks And Covid-19 Topics
We are pleased to introduce our two new multi-classifier models. Here’s a brief overview of each model and the entities they predict:
model name | description | predicted entities |
---|---|---|
multiclassifierdl_hoc | This model makes a semantic classification of the article according to the hallmarks of cancer based on its abstract | Activating_Invasion_And_Metastasis , Avoiding_Immune_Destruction , Cellular_Energetics , Enabling_Replicative_Immortality , Evading_Growth_Suppressors , Genomic_Instability_And_Mutation , Inducing_Angiogenesis , Resisting_Cell_Death , Sustaining_Proliferative_Signaling , Tumor_Promoting_Inflammation |
multiclassifierdl_litcovid | This model determines the relevant COVID-19 topics of the article based on its abstract. | Mechanism , Transmission , Diagnosis , Treatment , Prevention , Case_Report , Epidemic_Forecasting |
These multi-classifier models enhance the classification and analysis of articles by providing predictions related to specific domains. They facilitate efficient information retrieval and assist researchers and practitioners in quickly identifying articles relevant to cancer hallmarks or specific COVID-19 topics based on abstract content
Example:
multi_classifier_dl = MultiClassifierDLModel.pretrained("multiclassifierdl_litcovid", "en", "clinical/models")\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("category")
text = """Low level of plasminogen increases risk for mortality in COVID-19 patients. The pathophysiology of coronavirus disease 2019 (COVID-19), caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), and especially of its complications is still not fully understood. In fact, a very high number of patients with COVID-19 die because of thromboembolic causes. A role of plasminogen, as precursor of fibrinolysis, has been hypothesized. In this study, we aimed to investigate the association between plasminogen levels and COVID-19-related outcomes in a population of 55 infected Caucasian patients (mean age: 69.8 +/- 14.3, 41.8% female). Low levels of plasminogen were significantly associated with inflammatory markers (CRP, PCT, and IL-6), markers of coagulation (D-dimer, INR, and APTT), and markers of organ dysfunctions (high fasting blood glucose and decrease in the glomerular filtration rate). A multidimensional analysis model, including the correlation of the expression of coagulation with inflammatory parameters, indicated that plasminogen tended to cluster together with IL-6, hence suggesting a common pathway of activation during disease's complication. Moreover, low levels of plasminogen strongly correlated with mortality in COVID-19 patients even after multiple adjustments for presence of confounding. These data suggest that plasminogen may play a pivotal role in controlling the complex mechanisms beyond the COVID-19 complications, and may be useful both as biomarker for prognosis and for therapeutic target against this extremely aggressive infection."""
Result:
text | result |
---|---|
Low level of plasminogen increases risk for mortality in COVID-19 patients. The pathophysiology of coronavirus diseas… | [Mechanism, Treatment, Diagnosis] |
New Patient Urgency Text Classifier Model, Designed To Analyze The Level Of Emergency In Medical Situations Requiring Immediate Assistance
The Patient Urgency Text Classifier model is designed to analyze the level of emergency in medical situations that demand immediate assistance from medical organizations.
bert_sequence_classifier_patient_urgency
: This model has undergone training using a dataset of emergency calls, which have been labeled with three distinct classes (High
, Medium
, Low
).
Example:
sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_patient_urgency", "en", "clinical/models")\
.setInputCols(["document", "token"])\
.setOutputCol("prediction")
sample_text_list = [
"I think my father is having a stroke. His face is drooping, he can’t move his right side and he’s slurring his speech. He is breathing, but it’s really ragged. And, he is not responding when I talk to him…he seems out of it.",
"My old neighbor has fallen and cannot get up. She is conscious, but she is in a lot of pain and cannot move.",
"My wife has been in pain all morning. She had an operation a few days ago. This morning, she woke up in pain and is having a hard time moving around. The pain is around the surgery area. It is not severe, but it’s making her uncomfortable. She does not have fever, nausea or vomiting. There’s some slight feeling of being bloated."
]
Result:
text | result |
---|---|
I think my father is having a stroke. His face is drooping, he can’t move his right side and he’s… | High |
My old neighbor has fallen and cannot get up. She is conscious, but she is in a lot of pain and c… | Medium |
My wife has been in pain all morning. She had an operation a few days ago. This morning, she woke… | Low |
Brand-new Dutch Clinical NER Models, Empowering Accurate Recognition And Extraction Of Clinical Entities In Dutch Language
ner_clinical
and bert_token_classifier_ner_clinical
: These two Dutch clinical NER models provide valuable tools for processing and analyzing Dutch clinical texts. They assist in automating the extraction of important clinical information, facilitating research, medical documentation, and other applications within the Dutch healthcare domain.
Example:
ner_model = MedicalNerModel.pretrained("ner_clinical", "nl", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
text = """Dhr. Van Dijk, 58 jaar oud, kwam naar de kliniek met klachten van aanhoudende hoest, koorts en kortademigheid. We hebben besloten om een röntgenfoto van de borst, bloedonderzoek en een CT-scan te laten uitvoeren. De resultaten wezen op een ernstige longontsteking, een verhoogd aantal witte bloedcellen en mogelijk COPD. Hem is een antibiotica kuur en een sterke hoestsiroop voorgeschreven. Daarnaast adviseren we hem een voedzaam dieet te volgen."""
Result:
chunk | begin | end | ner_label | confidence |
---|---|---|---|---|
aanhoudende hoest | 66 | 82 | PROBLEM | 0.82 |
koorts | 85 | 90 | PROBLEM | 0.99 |
kortademigheid | 95 | 108 | PROBLEM | 0.99 |
röntgenfoto van de borst | 137 | 160 | TEST | 0.61 |
bloedonderzoek | 163 | 176 | TEST | 0.92 |
een CT-scan | 181 | 191 | TEST | 0.73 |
ernstige longontsteking | 240 | 262 | PROBLEM | 0.78 |
een verhoogd aantal witte bloedcellen | 265 | 301 | PROBLEM | 0.45 |
COPD | 315 | 318 | PROBLEM | 0.98 |
antibiotica kuur | 332 | 347 | TREATMENT | 0.63 |
een sterke hoestsiroop | 352 | 373 | TREATMENT | 0.47 |
een voedzaam dieet | 418 | 435 | TREATMENT | 0.69 |
New German Sentence Entity Resolver Model Exclusively Tailored For ICD-10-GM Codes
robertaresolve_icd10gm
: This model maps extracted medical entities to ICD10-GM codes for the German language using xlmroberta_embeddings_paraphrase_mpnet_base_v2
embeddings.
With this German Sentence Entity Resolver, you can efficiently analyze German medical texts and obtain the relevant ICD-10-GM codes associated with the extracted medical entities. This enables precise categorization and classification of medical data, enhancing medical research, coding, and analysis in the German healthcare domain.
Example:
icd10gm_resolver = SentenceEntityResolverModel.pretrained("robertaresolve_icd10gm", "de", "clinical/models") \
.setInputCols(["sentence_embeddings"]) \
.setOutputCol("icd10gm_code")
text = ["Dyspnoe", "Lymphknoten"]
Result:
chunks | code | resolutions | all_codes | all_distances |
---|---|---|---|---|
Dyspnoe | R06.0 | Dyspnoe:::Dysphagie:::Dysurie… | R06.0:::R13:::R30.0… | 0.00:::1.09:::1.17… |
Lymphknoten | D36.0 | Lymphknoten:::Lymphknotenvergrößerung… | D36.0:::R59:::Q82.0… | 0.00:::0.04:::0.12… |
New Feature To InternalResourceDownloader
To Point Cache Folder
By setting the cache_folder_path
, you can control where the downloaded resources are stored, enabling easy access and reuse of the downloaded models in subsequent operations or workflows.
Example:
from sparknlp_jsl.pretrained import InternalResourceDownloader
#The first argument is the path to the zip file and the second one is the folder.
InternalResourceDownloader.downloadModelDirectly("clinical/models/ner_clinical_large_en_2.5.0_2.4_1590021302624.zip",
"clinical/models",
unzip=False,
cache_folder_path="/content")
UpdateModels
Is Now More Flexible And Can Be Used To Update Existing Models In The Cache Folder
UpdateModels
is a helper class that provides functionality to update existing pretrained models located in the cache folder. It offers two main methods: updateCacheModels
and updateModels
.
UpdateModels.updateCacheModels(cache_folder='')
: This method refreshes all the pretrained models located in the cache pretrained folder.
UpdateModels.updateModels()
: This method downloads all the new pretrained models that have been released since the specified date interval.
model_names
: A list of names of the models to be downloaded.language
: The language of the models, with a default value of “en”.start_date
: The starting date used to filter the models, in the format “yyyy-MM-dd”.end_date
: The ending date used to filter the models, in the format “yyyy-MM-dd”.cache_folder
: The path indicating where the models will be downloaded and stored.
Example:
from sparknlp_jsl.updateModels import UpdateModels
UpdateModels.updateModels(start_date = "2021-01-01",
end_date = "2023-07-07",
model_names=["ner_clinical","ner_jsl"],
language="en",
remote_loc="clinical/models",
cache_folder="/content/jsl_models"
)
ls /content/jsl_models/
Result:
ner_clinical_en_3.0.0_3.0_1617208419368/
ner_jsl_en_4.2.0_3.0_1666181370373/
New Feature For ChunkFilterer
To Enable Filtering Chunks According To Confidence Thresholds
We have added a new setEntitiesConfidence
parameter to ChunkFilterer
annotator that enables filtering the chunks according to the confidence thresholds. The only thing you need to do is provide a dictionary that has the NER labels as keys and the confidence thresholds as values.
Example:
posology_ner = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverterInternal()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("posology_ner_chunk")
chunk_filterer = ChunkFilterer()\
.setInputCols("sentence","posology_ner_chunk")\
.setOutputCol("chunk_filtered")\
.setFilterEntity("entity")\
.setEntitiesConfidence({"DRUG":0.9,
"FREQUENCY":0.9,
"DOSAGE":0.9,
"DURATION":0.9,
"STRENGTH":0.9})
sample_text = 'The patient was prescribed 1 capsule of Advil for 5 days. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night.'
Detected chunks:
sentence_id | chunks | entities | confidence |
---|---|---|---|
0 | 1 | DOSAGE | 0.99 |
0 | capsule | FORM | 0.99 |
0 | Advil | DRUG | 0.99 |
0 | for 5 days | DURATION | 0.71 |
1 | 40 units | DOSAGE | 0.85 |
1 | insulin glargine | DRUG | 0.83 |
1 | at night | FREQUENCY | 0.81 |
Filtered by confidence scores:
sentence_id | chunks | entitie | confidence |
---|---|---|---|
0 | 1 | DOSAGE | 0.99 |
0 | capsule | FORM | 0.99 |
0 | Advil | DRUG | 0.99 |
New Feature For StructuredDeidentification
To Make It Flexible For Different Languages
The new language feature added to StructuredDeidentification
enhances its flexibility by supporting different languages for deidentification tasks.
Example:
from sparknlp_jsl.structured_deidentification import StructuredDeidentification
obfuscator = StructuredDeidentification(spark,
{"NAME":"PATIENT",
"AGE":"AGE",
"ADDRESS":"LOCATION",
"DOB":"DATE"},
obfuscateRefSource = "faker",
language="de")
obfuscator_df = obfuscator.obfuscateColumns(df)
Original Dataframe:
NAME | DOB | AGE | ADDRESS |
---|---|---|---|
Cecilia Chapman | 04/02/1935 | 83 | 711-2880 Nulla St. Mankato Mississippi |
Iris Watson | 03/10/2009 | 9 | 283 8562 Fusce Rd. Frederick Nebraska |
Bryar Pitts | 11/01/1921 | 98 | 5543 Aliquet St. Fort Dodge GA |
Theodore Lowe | 13/02/2002 | 16 | Ap #867-859 Sit Rd. Azusa New York |
Calista Wise | 20/08/1942 | 76 | 7292 Dictum Av. San Antonio MI |
Obfuscated Result:
NAME | DOB | AGE | ADDRESS |
---|---|---|---|
Giesela Janzen | 19/03/1935 | 86 | Annie-Lübs-Platz 8/0 |
Folker Sonntag | 30/10/2009 | 5 | Georg-Albers-Platz 8/7 |
Matthäus Koch | 13/02/1921 | 99 | Annelore-Schmidt-Straße 6/2 |
Elly Metz | 23/03/2002 | 17 | Klemens-Thanel-Straße 4 |
Friederike Heinrich | 30/09/1942 | 75 | Rita-Süßebier-Weg 550 |
Enhanced ALAB Module With Relation Extraction Model Training Data Preparation Ability Using Document-Level Annotations
In order to facilitate the preparation of document-level annotated data for training Relation Extraction models, we have introduced a new parameter called doc_wise_annot
to the get_relation_extraction_data
method in the ALAB module. By setting the doc_wise_annot
parameter to True
, the method will return the dataframe with sentence-cross annotations, if they exist. The default value is False
.
Example:
alab.get_relation_extraction_data(
spark=spark,
input_json_path='alab_demo.json',
ground_truth=True,
...
doc_wise_annot=True
)
Various Core Improvements: Bug Fixes, Enhanced Overall Robustness, And Reliability Of Spark NLP For Healthcare
- Improved Deidentification performance with refactoring
- Updated
clinical_deidentification
pipeline by enhancing theAGE
entity extraction capability - Minor corrections have been made to the calculation formulas in the Medicare Risk Adjustment Module
Updated Notebooks And Demonstrations For making Spark NLP For Healthcare Easier To Navigate And Understand
- New Text Classification with Few Shot Classifier Notebook
- New Voice of Patient Notebook
- New Social Determinant of Health Notebook
- Updated Oncology Notebook for latest models
- New All-In-One Social Determinant of Health Demo
- Updated Medical LLM Demo
- Updated German ICD10GM Resolver Demo
We Have Added And Updated A Substantial Number Of New Clinical Models And Pipelines, Further Solidifying Our Offering In The Healthcare Domain.
clinical_notes_qa_base
clinical_notes_qa_large
ner_profiling_vop
ner_profiling_sdoh
ner_profiling_oncology
ner_sdoh_access_to_healthcare
ner_sdoh_community_condition
ner_sdoh_demographics
ner_sdoh_health_behaviours_problems
ner_sdoh_income_social_status
ner_sdoh_social_environment
ner_sdoh_substance_usage
multiclassifierdl_hoc
multiclassifierdl_litcovid
bert_sequence_classifier_patient_urgency
ner_clinical
->nl
bert_token_classifier_ner_clinical
->nl
robertaresolve_icd10gm
->de
icd10gm_resolver_pipeline
->de
clinical_deidentification
sbiobert_base_cased_mli_onnx
For all Spark NLP for Healthcare models, please check: Models Hub Page
Versions
- 5.5.0
- 5.4.1
- 5.4.0
- 5.3.3
- 5.3.2
- 5.3.1
- 5.3.0
- 5.2.1
- 5.2.0
- 5.1.4
- 5.1.3
- 5.1.2
- 5.1.1
- 5.1.0
- 5.0.2
- 5.0.1
- 5.0.0
- 4.4.4
- 4.4.3
- 4.4.2
- 4.4.1
- 4.4.0
- 4.3.2
- 4.3.1
- 4.3.0
- 4.2.8
- 4.2.4
- 4.2.3
- 4.2.2
- 4.2.1
- 4.2.0
- 4.1.0
- 4.0.2
- 4.0.0
- 3.5.3
- 3.5.2
- 3.5.1
- 3.5.0
- 3.4.2
- 3.4.1
- 3.4.0
- 3.3.4
- 3.3.2
- 3.3.1
- 3.3.0
- 3.2.3
- 3.2.2
- 3.2.1
- 3.2.0
- 3.1.3
- 3.1.2
- 3.1.1
- 3.1.0
- 3.0.3
- 3.0.2
- 3.0.1
- 3.0.0
- 2.7.6
- 2.7.5
- 2.7.4
- 2.7.3
- 2.7.2
- 2.7.1
- 2.7.0
- 2.6.2
- 2.6.0
- 2.5.5
- 2.5.3
- 2.5.2
- 2.5.0
- 2.4.6
- 2.4.5
- 2.4.2
- 2.4.1
- 2.4.0