Spark NLP for Healthcare Release Notes 5.1.3

 

5.1.3

Highlights

We are delighted to announce a suite of remarkable enhancements and updates in our latest release of Spark NLP for Healthcare. This release comes with the very first Patient Frailty classification as well as 7 new clinical pretrained models and pipelines. It is a testament to our commitment to continuously innovate and improve, furnishing you with a more sophisticated and powerful toolkit for healthcare natural language processing.

  • 3 new augmented NER models by leveraging the capabilities of the LangTest library to boost their robustness significantly
  • New clinical NER models for extracting clinical entities in the Hebrew language
  • New Social Determinants of Healthcare (SDoH) model for patient frailty classification
  • New rule-based ContextualParserModel designed to identify date-of-death (DOD) entities with contextual awareness
  • RelationExtractionModel has undergone significant optimization, resulting in a substantial improvement in inference speed
  • OCR Deidentification Module now supports the deidentification of handwritten or printed text
  • Introducing a customizable random seed algorithm in Spark for enhanced data privacy
  • Various core improvements; bug fixes, enhanced overall robustness and reliability of SparkNLP for Healthcare
    • Some minor naming changes have been made in Deidentification
    • Added new unnormalized date formats into the date faker list in Deidentification annotator
    • Default locale operating system language set as English when starting the session
    • License validation process has been fastened now
    • Fixed the issues in ALAB Module; get_conll_data “missing sentence detector” failure and get_assertion_data method for getting more annotations from annotated JSON file
  • New and updated demos

These enhancements will elevate your experience with Spark NLP for Healthcare, enabling more efficient, accurate, and streamlined healthcare-related natural language data analysis.

3 New Augmented NER Models by Leveraging the Capabilities of the LangTest Library to Boost Their Robustness Significantly

Newly introduced 3 augmented NER models that are powered by the innovative LangTest library. This cutting-edge NLP toolkit is at the forefront of language processing advancements, incorporating state-of-the-art techniques and algorithms to enhance the capabilities of our models significantly.

Model Name Predicted Entities
ner_human_phenotype_gene_clinical_langtest GENE, HP
ner_human_phenotype_go_clinical_langtest GO, HP
ner_risk_factors_langtest CAD, DIABETES, FAMILY_HIST, HYPERLIPIDEMIA, HYPERTENSION, MEDICATION, OBESE, PHI, SMOKER
  • The table below shows the robustness of overall test results for these models.
model names original robustness new robustness
ner_human_phenotype_gene_clinical_langtest 48.79% 82.60%
ner_human_phenotype_go_clinical_langtest 43.48% 85.57%
ner_risk_factors_langtest 89.69% 93.75%

New Clinical Named Entity Recognition (NER) Models for Extracting Clinical Entities in the Hebrew Language

We have a new clinical NER model specifically designed for the Hebrew language. This model excels at identifying clinical entities, enabling the automation of critical clinical data extraction. It proves invaluable for various healthcare-related tasks, such as research, medical record documentation, and other applications within the Hebrew-speaking healthcare sector.

Model Name Predicted Entities Language
ner_clinical PROBLEM TEST TREATMENT he

New Social Determinants of Healthcare (SDoH) Model for Frailty Classification

Introducing a new frailty classification model trained on a diverse dataset and it provides accurate label assignments and confidence scores for its predictions. The primary goal of this model is to categorize text into two key labels: Frail and Non_Frail.

Example:

sequenceClassifier = MedicalBertForSequenceClassification.pretrained("bert_sequence_classifier_sdoh_frailty", "en", "clinical/models")\
    .setInputCols(["document", "token"])\
    .setOutputCol("prediction")

sample_texts=[  
  "Patient demonstrates a marked decrease in muscle strength and endurance, requiring assistance for basic activities.",
  "Clinical evaluation indicates robust health with no signs of physical debilitation.",
  "Noted significant weight loss and diminished muscle mass over the past several months.",
  "Follow-up examinations show complete remission of previous oncological concerns.",
  "The patient exhibits increased susceptibility to skin tears and bruising with minimal contact.",
  "Laboratory results reveal the patient's complete recovery from hepatitis, with normal liver function tests."
]

Result:

text result
Patient demonstrates a marked decrease in muscle strength and endurance, requiring assistance for… Frail
Clinical evaluation indicates robust health with no signs of physical debilitation. Non_Frail
Noted significant weight loss and diminished muscle mass over the past several months. Frail
Follow-up examinations show complete remission of previous oncological concerns. Non_Frail
The patient exhibits increased susceptibility to skin tears and bruising with minimal contact. Frail
Laboratory results reveal the patient’s complete recovery from hepatitis, with normal liver funct… Non_Frail

Please check Social Determinant Sequence Classification Demo

New Rule-Based ContextualParserModel Designed to Identify Date-Of-Death (DOD) Entities With Contextual Awareness

We are releasing a new ContextualParserModel that can extract date-of-death (DOD) entities in clinical texts.

Example:

dod_contextual_parser = ContextualParserModel.pretrained("date_of_death_parser", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("chunk_dod")

sample_text = """
Record date : 2081-01-04
DB : 11.04.1962
DT : 12-03-1978
DOD : 10.25.23

SOCIAL HISTORY:
Jane Doe was born on November 4, 1962, in London, and she got married on April 5, 1979.
When she got pregnant on May 15, 1979, the doctor wanted to verify her date of birth, which was confirmed to be November 4, 1962.
Jane was 45 years old when she sadly passed away on September 25, 2007.

PROCEDURES:
Patient Jane Doe was evaluated on March 15, 1988, for allergies. She was seen by the endocrinology service and was discharged on September 23, 1988.

MEDICATIONS:
1. Coumadin 1 mg daily. Jane's last INR was measured on August 14, 2007, and it was 2.3."""

Result:

sentence_id chunk begin end ner_label
3 10.25.23 64 71 DOD
5 September 25, 2007 360 377 DOD

Please check Model Card and Contextual Parser Rule Based NER Notebook for more information.

RelationExtractionModel Has Undergone Significant Optimization, Resulting In A Substantial Improvement In Inference Speed

Through meticulous optimization efforts, we have significantly improved the RelationExtractionModel annotator, ensuring a notably expedited process for extracting relations between clinical entities.

OCR Deidentification Module Now Supports The Deidentification Of Handwritten Or Printed Text

In this update, a new text_type parameter has been introduced for the ocr_nlp_processor module, allowing users to choose the type of text to be handled and deidentified, supporting all available styles: colored_box, bounding_box as well as highlight. The available text_type values are described as follows:

- `printed`: just the detected printed entities will be deidentified
- `handwritten`: just the detected handwritten text will be deidentified
- `both`: both the detected printed entities and the detected handwritten text will be deidentified

Example:

from sparknlp_jsl.utils.ocr_nlp_processor import ocr_entity_processor

# Bounding Box with a text type parameter
# Handling just handwritten text

path='content/*.pdf'
box = "bounding_box"

ocr_entity_processor(spark=spark,
                    file_path = path,
                    ner_pipeline = nlp_model,
                    chunk_col = "merged_chunk",
                    style = box,
                    save_dir = "bounding_box",
                    label= False,
                    display_result = True,
                    outline_width = 4,
                    outline_color = (42, 170, 138),
                    text_type = "handwritten")    ## Default = "printed"

Please check Spark_OCR_Utility_Module Notebook for more examples

Introducing a Customizable Random Seed Algorithm in Spark for Enhanced Data Privacy

In this release, a new configuration parameter has been introduced in Spark NLP for Healthcare - spark.jsl.settings.seed.numberGenerationAlgorithm, allowing users to select a SecureRandom algorithm for random seed generation. The available algorithms include SHA1PRNG when in obfuscation mode, which is used by DeIdentification for generating fake data. Users now have the flexibility to choose the desired generation algorithm, impacting the quality of fake data, system performance, and potential blocking issues.

The alternative Number Generation Algorithms are NativePRNG, NativePRNGBlocking, NativePRNGNonBlocking, PKCS11, SHA1PRNG, and Windows-PRNG, for more information please see this documentation.

Example:

import sparknlp_jsl

params = {"spark.jsl.settings.seed.numberGenerationAlgorithm": "SHA1PRNG"}

spark = sparknlp_jsl.start(license_keys['SECRET'], params=params)

Various Core Improvements; Bug Fixes, Enhanced Overall Robustness and Reliability of SparkNLP for Healthcare

  • Some minor naming changes have been made in Deidentification: useShifDays() -> useShiftDays()
  • Added new unnormalized date formats into the date faker list in Deidentification annotator
  • Default locale operating system language set as English when starting the session
  • License validation process has been fastened now
  • Fixed the issues in ALAB Module; get_conll_data “missing sentence detector” failure and get_assertion_data method for getting more annotations from annotated JSON file

Updated Notebooks And Demonstrations For making Spark NLP For Healthcare Easier To Navigate And Understand

We Have Added And Updated A Substantial Number Of New Clinical Models And Pipelines, Further Solidifying Our Offering In The Healthcare Domain.

  • ner_clinical -> he
  • date_of_death_parser
  • date_of_birth_parser
  • bert_sequence_classifier_sdoh_frailty
  • ner_human_phenotype_go_clinical_langtest
  • ner_human_phenotype_gene_clinical_langtest
  • ner_risk_factors_langtest

For all Spark NLP for Healthcare models, please check: Models Hub Page

Versions

Last updated