6.2.1
Highlights
We are delighted to announce remarkable enhancements and updates in our latest release of Healthcare NLP. This release focuses on making NER training and clinical document analysis significantly faster and easier with a major upgrade to Annotation2Training, major improvements and reduced memory usage in MedicalNerApproach training, new one-liner pretrained pipelines and models for end-to-end clinical document understanding. In addition, this version delivers core robustness improvements, refreshed notebooks and demonstrations, and 5 new and updated clinical models and pipelines to further strengthen our Healthcare NLP.
- Major training performance improvements for
MedicalNerApproach, including significantly reduced memory usage and faster multi-epoch NER training. Annotation2Trainingmakes it much easier to prepare high-quality NER training datasets with built-in label filtering, relabeling, CoNLL export, and quick label distribution insights- Clinical document analysis with one-liner pretrained-pipelines for specific clinical tasks and concepts
- Introducing a new city TextMatcher model for extracting city names from clinical text
- New blog post to understand how to deploy Medical LLMs on Databricks
- Various core improvements; bug fixes, enhanced overall robustness and reliability of Healthcare NLP
- Updated notebooks and demonstrations for making Healthcare NLP easier to navigate and understand
- New Databricks Generative AI to Ner Training Notebook
- Updated Generative AI to Ner Training Notebook
- The addition and update of numerous new clinical models and pipelines continue to reinforce our offering in the healthcare domain
These enhancements will elevate your experience with Healthcare NLP, enabling more efficient, accurate, and streamlined analysis of healthcare-related natural language data.
MedicalNerApproach Training Optimizations: Faster & More Memory-Efficient NER Training
Significant performance upgrades have been introduced to MedicalNerApproach, making NER model training faster, lighter, and more efficient—especially when working with BERT-based embeddings or memory-constrained environments.
-
Reduced Memory Usage with BERT-Based Embeddings
Optimized internal handling of output embeddings substantially reduces the peak memory footprint during training. Users can expect up to 2× lower RAM consumption, particularly with transformer-based embeddings. -
Automatic Dataset Caching for Multi-Epoch Training
When usingsetEnableMemoryOptimizer(true)and running withmaxEpoch > 1, input datasets are now automatically cached, resulting in faster epoch transitions and less data reloading overhead. -
TensorFlow Graph Metadata Reuse
TheMedicalNerDLGraphCheckernow stores TensorFlow graph metadata that can be reused byMedicalNerApproach, reducing redundant initialization work and improving overall training responsiveness. -
Smarter Graph Selection for NER Models
We’ve enhanced the graph selection logic in MedicalNerApproach. It now identifies the most efficient (smallest compatible) TensorFlow graph, resulting in improved performance and lower memory consumption.
Annotation2Training makes it much easier to prepare high-quality NER training datasets with built-in label filtering, relabeling, CoNLL export, and quick label distribution insights
Annotation2Training now turns raw annotation JSON into training-ready NER datasets much faster and with less manual work. With built-in label filtering, on-the-fly relabeling, CoNLL export, and label distribution inspection, you can clean, reshape, and export your data in a few lines of code—making NER training more efficient, shorter to set up, and easier to iterate on.
- CoNLL file generation (generateConll)
Generates a CoNLL-2003 formatted file directly from a Spark NLP NER DataFrame, writing TOKEN -X- -X- LABEL lines with -DOCSTART- document headers and sentence boundaries. Supports local paths and DBFS, and enforces unique document IDs to prevent training issues.
Example:
annotation2training.generateConll(training_df_json, "main.conll")
- Label whitelisting & blacklisting (convertJson2NerDF)
Adds white_list and black_list parameters so you can explicitly include or exclude specific entity labels when converting JSON annotations to a NER DataFrame.
Example:
JSON_PATH = "/content/result.json"
from sparknlp_jsl.training import Annotation2Training
annotation2training = Annotation2Training(spark)
training_df_json = annotation2training.convertJson2NerDF(
json_path = JSON_PATH, # Path to the input JSON file.
pipeline_model = base_pipeline_model, # A pre-trained Spark NLP PipelineModel that includes at least a DocumentAssembler, and Tokenizer.
repartition = (os.cpu_count() * 4), # Number of partitions to use when creating the DataFrame (default is 32).
token_output_col = "token", # The name of the column containing token annotations (default is "token").
ner_label_col = "label", # The name of the output column for NER labels (default is "label").
black_list = ["Quit_Attempts"],
replace_labels = {"Smoking_Type":"Smoking_Status"})
- Label distribution explorer (showLabelDistributions)
A utility method to quickly display the frequency distribution of entity labels in the training DataFrame. It explodes the label column and prints counts per label, helping you spot class imbalance, missing labels, or noisy annotations before training.
Example:
annotation2training.showLabelDistributions(training_df_json, "label")
Please check Generative AI to Ner Training notebook for more detail
Clinical Document Analysis with One-Liner Pretrained-Pipelines for Specific Clinical Tasks and Concepts
We introduce a suite of advanced, hybrid pretrained pipelines, specifically designed to streamline the clinical document analysis process. These pipelines are built upon multiple state-of-the-art (SOTA) pretrained models, delivering a comprehensive solution for quickly extracting vital information.
What sets this release apart is the elimination of complexities typically involved in building and chaining models. Users no longer need to navigate the intricacies of constructing intricate pipelines from scratch or the uncertainty of selecting the most effective model combinations. Our new pretrained pipelines simplify these processes, offering a seamless, user-friendly experience.
| Model Name | Description |
|---|---|
pp_docwise_benchmark_large_preann |
Detect PHI entities in medical texts using Named Entity Recognition (NER). |
pp_docwise_benchmark_medium_preann |
Detect PHI entities in medical texts using Named Entity Recognition (NER). |
clinical_deidentification_sentwise_benchmark_large |
Deidentify PHI information from medical texts. |
clinical_deidentification_sentwise_benchmark_medium |
Deidentify PHI information from medical texts. |
Example:
from johnsnowlabs import nlp, medical
ner_docwise = nlp.PretrainedPipeline("ner_docwise_benchmark_large_preann", "en", "clinical/models")
text = """Dr. John Lee, from Royal Medical Clinic in Chicago, attended to the patient on 11/05/2024.
The patient’s medical record number is 56467890.
The patient, Emma Wilson, is 50 years old, her Contact number: 444-456-7890 ."""
Result:
| chunk | begin | end | ner_label |
|---|---|---|---|
| John Lee | 4 | 11 | DOCTOR |
| Royal Medical Clinic | 19 | 38 | HOSPITAL |
| Chicago | 43 | 49 | CITY |
| 11/05/2024 | 79 | 88 | DATE |
| 56467890 | 130 | 137 | IDNUM |
| Emma Wilson | 153 | 163 | PATIENT |
| 50 years old | 169 | 180 | AGE |
| 444-456-7890 | 203 | 214 | PHONE |
Introducing a new city TextMatcher model for extracting city names from clinical text
A new TextMatcherInternal model from John Snow Labs automatically extracts city names from clinical text, enabling structured geo-location information for downstream analytics and reporting.
Example:
text_matcher = TextMatcherInternalModel.pretrained("city_matcher","en","clinical/models") \
.setInputCols(["sentence", "token"])\
.setOutputCol("city_name")\
.setMergeOverlapping(True)
data = spark.createDataFrame([["""Name: Johnson, Alice, Record date: 2093-03-22, MR: 846275.
Dr. Emily Brown, IP 192.168.1.1.
She is a 55-year-old female who was admitted to the Global Hospital in Los Angeles for hip replacement on 03/22/93.
Patient's VIN: 2HGFA165X8H123456, SSN: 444-55-8888, Driver's license no: C789012D.
Phone: (212) 555-7890, 4321 Oak Street, New York City, USA, E-MAIL: alice.johnson@example.com.
Patient has traveled to Tokyo, Paris, and Sydney in the past year."""]]).toDF("text")
Result:
| chunk | begin | end | label |
|---|---|---|---|
| Los Angeles | 163 | 173 | CITY |
| New York City | 331 | 343 | CITY |
| Tokyo | 410 | 414 | CITY |
| Paris | 417 | 421 | CITY |
| Sydney | 428 | 433 | CITY |
New blog post to understand how to deploy Medical LLMs on Databricks
- Deploying John Snow Labs Medical LLMs on Databricks: Three Flexible Deployment Options This blog post explores how to securely and efficiently deploy healthcare-focused Large Language Models (LLMs) and Vision Language Models (VLMs) on Databricks. It introduces three flexible deployment options developed in collaboration with Databricks, each balancing performance, cost, and operational control while meeting strict healthcare security and compliance requirements. The post walks through these options using a real customer scenario, illustrating how to run state-of-the-art medical LLMs on Databricks infrastructure and helping teams choose the setup that best fits their clinical workflows and IT constraints.
Updated Notebooks And Demonstrations For making Healthcare NLP Easier To Navigate And Understand
- New Databricks Generative AI to Ner Training Notebook
- Updated Generative AI to Ner Training Notebook
We Have Added And Updated A Substantial Number Of New Clinical Models And Pipelines, Further Solidifying Our Offering In The Healthcare Domain.
pp_docwise_benchmark_large_preannpp_docwise_benchmark_medium_preannclinical_deidentification_sentwise_benchmark_largeclinical_deidentification_sentwise_benchmark_mediumcity_matcher
For all Healthcare NLP models, please check: Models Hub Page
Versions
- 6.2.1
- 6.2.0
- 6.1.1
- 6.1.0
- 6.0.4
- 6.0.3
- 6.0.2
- 6.0.1
- 6.0.0
- 5.5.3
- 5.5.2
- 5.5.1
- 5.5.0
- 5.4.1
- 5.4.0
- 5.3.3
- 5.3.2
- 5.3.1
- 5.3.0
- 5.2.1
- 5.2.0
- 5.1.4
- 5.1.3
- 5.1.2
- 5.1.1
- 5.1.0
- 5.0.2
- 5.0.1
- 5.0.0
- 4.4.4
- 4.4.3
- 4.4.2
- 4.4.1
- 4.4.0
- 4.3.2
- 4.3.1
- 4.3.0
- 4.2.8
- 4.2.4
- 4.2.3
- 4.2.2
- 4.2.1
- 4.2.0
- 4.1.0
- 4.0.2
- 4.0.0
- 3.5.3
- 3.5.2
- 3.5.1
- 3.5.0
- 3.4.2
- 3.4.1
- 3.4.0
- 3.3.4
- 3.3.2
- 3.3.1
- 3.3.0
- 3.2.3
- 3.2.2
- 3.2.1
- 3.2.0
- 3.1.3
- 3.1.2
- 3.1.1
- 3.1.0
- 3.0.3
- 3.0.2
- 3.0.1
- 3.0.0
- 2.7.6
- 2.7.5
- 2.7.4
- 2.7.3
- 2.7.2
- 2.7.1
- 2.7.0
- 2.6.2
- 2.6.0
- 2.5.5
- 2.5.3
- 2.5.2
- 2.5.0
- 2.4.6
- 2.4.5
- 2.4.2
- 2.4.1
- 2.4.0