Healthcare NLP v6.2.1 Release Notes

6.2.1

Highlights

We are delighted to announce remarkable enhancements and updates in our latest release of Healthcare NLP. This release focuses on making NER training and clinical document analysis significantly faster and easier with a major upgrade to Annotation2Training, major improvements and reduced memory usage in MedicalNerApproach training, new one-liner pretrained pipelines and models for end-to-end clinical document understanding. In addition, this version delivers core robustness improvements, refreshed notebooks and demonstrations, and 5 new and updated clinical models and pipelines to further strengthen our Healthcare NLP.

Major training performance improvements for MedicalNerApproach, including significantly reduced memory usage and faster multi-epoch NER training.
Annotation2Training makes it much easier to prepare high-quality NER training datasets with built-in label filtering, relabeling, CoNLL export, and quick label distribution insights
Clinical document analysis with one-liner pretrained-pipelines for specific clinical tasks and concepts
Introducing a new city TextMatcher model for extracting city names from clinical text
New blog post to understand how to deploy Medical LLMs on Databricks
Various core improvements; bug fixes, enhanced overall robustness and reliability of Healthcare NLP
Updated notebooks and demonstrations for making Healthcare NLP easier to navigate and understand
- New Databricks Generative AI to Ner Training Notebook
- Updated Generative AI to Ner Training Notebook
The addition and update of numerous new clinical models and pipelines continue to reinforce our offering in the healthcare domain

These enhancements will elevate your experience with Healthcare NLP, enabling more efficient, accurate, and streamlined analysis of healthcare-related natural language data.

MedicalNerApproach Training Optimizations: Faster & More Memory-Efficient NER Training

Significant performance upgrades have been introduced to MedicalNerApproach, making NER model training faster, lighter, and more efficient—especially when working with BERT-based embeddings or memory-constrained environments.

Reduced Memory Usage with BERT-Based Embeddings
Optimized internal handling of output embeddings substantially reduces the peak memory footprint during training. Users can expect up to 2× lower RAM consumption, particularly with transformer-based embeddings.
Automatic Dataset Caching for Multi-Epoch Training
When using setEnableMemoryOptimizer(true) and running with maxEpoch > 1, input datasets are now automatically cached, resulting in faster epoch transitions and less data reloading overhead.
TensorFlow Graph Metadata Reuse
The MedicalNerDLGraphChecker now stores TensorFlow graph metadata that can be reused by MedicalNerApproach, reducing redundant initialization work and improving overall training responsiveness.
Smarter Graph Selection for NER Models

We’ve enhanced the graph selection logic in MedicalNerApproach. It now identifies the most efficient (smallest compatible) TensorFlow graph, resulting in improved performance and lower memory consumption.

`Annotation2Training` makes it much easier to prepare high-quality NER training datasets with built-in label filtering, relabeling, CoNLL export, and quick label distribution insights

Annotation2Training now turns raw annotation JSON into training-ready NER datasets much faster and with less manual work. With built-in label filtering, on-the-fly relabeling, CoNLL export, and label distribution inspection, you can clean, reshape, and export your data in a few lines of code—making NER training more efficient, shorter to set up, and easier to iterate on.

CoNLL file generation (generateConll)

Generates a CoNLL-2003 formatted file directly from a Spark NLP NER DataFrame, writing TOKEN -X- -X- LABEL lines with -DOCSTART- document headers and sentence boundaries. Supports local paths and DBFS, and enforces unique document IDs to prevent training issues.

Example:

annotation2training.generateConll(training_df_json, "main.conll")

Label whitelisting & blacklisting (convertJson2NerDF)

Adds white_list and black_list parameters so you can explicitly include or exclude specific entity labels when converting JSON annotations to a NER DataFrame.

Example:

JSON_PATH = "/content/result.json"

from sparknlp_jsl.training import Annotation2Training
annotation2training = Annotation2Training(spark)
training_df_json = annotation2training.convertJson2NerDF(
    json_path = JSON_PATH,                   # Path to the input JSON file.
    pipeline_model = base_pipeline_model,    # A pre-trained Spark NLP PipelineModel that includes at least a DocumentAssembler, and Tokenizer.
    repartition = (os.cpu_count() * 4),      # Number of partitions to use when creating the DataFrame (default is 32).
    token_output_col = "token",              # The name of the column containing token annotations (default is "token").
    ner_label_col = "label",                  # The name of the output column for NER labels (default is "label").
    black_list = ["Quit_Attempts"],
    replace_labels = {"Smoking_Type":"Smoking_Status"})

Label distribution explorer (showLabelDistributions)

A utility method to quickly display the frequency distribution of entity labels in the training DataFrame. It explodes the label column and prints counts per label, helping you spot class imbalance, missing labels, or noisy annotations before training.

Example:

annotation2training.showLabelDistributions(training_df_json, "label")

Please check Generative AI to Ner Training notebook for more detail

Clinical Document Analysis with One-Liner Pretrained-Pipelines for Specific Clinical Tasks and Concepts

We introduce a suite of advanced, hybrid pretrained pipelines, specifically designed to streamline the clinical document analysis process. These pipelines are built upon multiple state-of-the-art (SOTA) pretrained models, delivering a comprehensive solution for quickly extracting vital information.

What sets this release apart is the elimination of complexities typically involved in building and chaining models. Users no longer need to navigate the intricacies of constructing intricate pipelines from scratch or the uncertainty of selecting the most effective model combinations. Our new pretrained pipelines simplify these processes, offering a seamless, user-friendly experience.

Model Name	Description
`pp_docwise_benchmark_large_preann`	Detect PHI entities in medical texts using Named Entity Recognition (NER).
`pp_docwise_benchmark_medium_preann`	Detect PHI entities in medical texts using Named Entity Recognition (NER).
`clinical_deidentification_sentwise_benchmark_large`	Deidentify PHI information from medical texts.
`clinical_deidentification_sentwise_benchmark_medium`	Deidentify PHI information from medical texts.

Example:

from johnsnowlabs import nlp, medical

ner_docwise = nlp.PretrainedPipeline("ner_docwise_benchmark_large_preann", "en", "clinical/models")

text = """Dr. John Lee, from Royal Medical Clinic in Chicago, attended to the patient on 11/05/2024.
The patient’s medical record number is 56467890.
The patient, Emma Wilson, is 50 years old, her Contact number: 444-456-7890 ."""

Result:

chunk	begin	end	ner_label
John Lee	4	11	DOCTOR
Royal Medical Clinic	19	38	HOSPITAL
Chicago	43	49	CITY
11/05/2024	79	88	DATE
56467890	130	137	IDNUM
Emma Wilson	153	163	PATIENT
50 years old	169	180	AGE
444-456-7890	203	214	PHONE

Introducing a new city TextMatcher model for extracting city names from clinical text

A new TextMatcherInternal model from John Snow Labs automatically extracts city names from clinical text, enabling structured geo-location information for downstream analytics and reporting.

Example:

text_matcher = TextMatcherInternalModel.pretrained("city_matcher","en","clinical/models") \
    .setInputCols(["sentence", "token"])\
    .setOutputCol("city_name")\
    .setMergeOverlapping(True)

data = spark.createDataFrame([["""Name: Johnson, Alice, Record date: 2093-03-22, MR: 846275.
Dr. Emily Brown, IP 192.168.1.1.
She is a 55-year-old female who was admitted to the Global Hospital in Los Angeles for hip replacement on 03/22/93.
Patient's VIN: 2HGFA165X8H123456, SSN: 444-55-8888, Driver's license no: C789012D.
Phone: (212) 555-7890, 4321 Oak Street, New York City, USA, E-MAIL: alice.johnson@example.com.
Patient has traveled to Tokyo, Paris, and Sydney in the past year."""]]).toDF("text")

Result:

chunk	begin	end	label
Los Angeles	163	173	CITY
New York City	331	343	CITY
Tokyo	410	414	CITY
Paris	417	421	CITY
Sydney	428	433	CITY

New blog post to understand how to deploy Medical LLMs on Databricks

Deploying John Snow Labs Medical LLMs on Databricks: Three Flexible Deployment Options This blog post explores how to securely and efficiently deploy healthcare-focused Large Language Models (LLMs) and Vision Language Models (VLMs) on Databricks. It introduces three flexible deployment options developed in collaboration with Databricks, each balancing performance, cost, and operational control while meeting strict healthcare security and compliance requirements. The post walks through these options using a real customer scenario, illustrating how to run state-of-the-art medical LLMs on Databricks infrastructure and helping teams choose the setup that best fits their clinical workflows and IT constraints.

Updated Notebooks And Demonstrations For making Healthcare NLP Easier To Navigate And Understand

New Databricks Generative AI to Ner Training Notebook
Updated Generative AI to Ner Training Notebook

We Have Added And Updated A Substantial Number Of New Clinical Models And Pipelines, Further Solidifying Our Offering In The Healthcare Domain.

pp_docwise_benchmark_large_preann
pp_docwise_benchmark_medium_preann
clinical_deidentification_sentwise_benchmark_large
clinical_deidentification_sentwise_benchmark_medium
city_matcher

For all Healthcare NLP models, please check: Models Hub Page

Versions

Version
Version
Version

PREVIOUSVersion Compatibility

6.2.1

Highlights

MedicalNerApproach Training Optimizations: Faster & More Memory-Efficient NER Training

Annotation2Training makes it much easier to prepare high-quality NER training datasets with built-in label filtering, relabeling, CoNLL export, and quick label distribution insights

Clinical Document Analysis with One-Liner Pretrained-Pipelines for Specific Clinical Tasks and Concepts

Introducing a new city TextMatcher model for extracting city names from clinical text

New blog post to understand how to deploy Medical LLMs on Databricks

Updated Notebooks And Demonstrations For making Healthcare NLP Easier To Navigate And Understand

We Have Added And Updated A Substantial Number Of New Clinical Models And Pipelines, Further Solidifying Our Offering In The Healthcare Domain.

Versions

`Annotation2Training` makes it much easier to prepare high-quality NER training datasets with built-in label filtering, relabeling, CoNLL export, and quick label distribution insights