Healthcare NLP v6.4.1 Release Notes

 

6.4.1

Highlights

We are delighted to announce notable enhancements and updates in Healthcare NLP 6.4.1. The headline addition is LLM-Based Structured Clinical Entity Extraction module, an end-to-end LLM-powered entity extraction annotator for healthcare and clinical text, built on top of the MedicalLLM models. This module enables unified extraction of structured clinical entities directly from unstructured medical text, streamlining clinical NLP workflows by removing the need for task-specific pipelines.

This release also introduces important improvements in CDA de-identification through enhancements to the CDA DeIdentification module, featuring a table-aware, dual-model de-identification framework designed to jointly handle structured tabular data and unstructured clinical text, improving PHI detection accuracy in complex real-world CDA documents.

In addition, StructuredDeIdentification has been extended with native support for TIMESTAMP and TIMESTAMP_WITH_TIMEZONE entities, enabling deterministic time shifting in structured medical tables with configurable seconds-based offsets and flexible timestamp format parsing.

This release also delivers a comprehensive UMLS 2026AA Metathesaurus refresh, encompassing 7 updated Entity Resolver models and an expanded ChunkMapper suite of 36 models — including 17 new mappers extending coverage to 8 additional medical coding systems — enabling broader and more current clinical concept mapping across standardized terminologies.

This release is further complemented by new benchmarking results and applied research contributions across clinical NLP and LLM-based healthcare extraction. Internal benchmarks demonstrate improved domain-specific PHI detection performance and CPU efficiency compared to general-purpose privacy filters. New notebooks introduce end-to-end workflows for LLM-based structured extraction, oncology entity recognition, and large-scale Spark NLP benchmarking on Databricks, leveraging MedicalLLM models and Medallion architecture (a bronze/silver/gold lakehouse pattern) for production-grade Healthcare NLP pipelines.

  • LLM-Based Structured Clinical Entity Extraction with Constrained Decoding
  • Table-Aware, Free-Text Extended CDA De-Identification with Improved PHI Accuracy in CDA Tables
  • Pretrained Zero-Shot Multi-Task Named Entity Recognition (NER) Speed Comparison on GPU vs CPU Benchmark
  • StructuredDeIdentification – Timestamp Support & Time Shift Enhancement
  • Updated 7 UMLS Entity Resolver Models and 19 ChunkMapper Models to the UMLS 2026AA Metathesaurus and Introduced 17 New ChunkMapper Models Covering 8 Additional Medical Coding Systems
  • New Blog Posts & Technical Deep Dives
  • Updated notebooks and demonstrations for making Healthcare NLP easier to navigate and understand
  • The addition and update of numerous new clinical models and pipelines continue to reinforce our offering in the healthcare domain

These enhancements will elevate your experience with Healthcare NLP, enabling more efficient, accurate, and streamlined analysis of healthcare-related natural language data.

LLM-Based Structured Clinical Entity Extraction with Constrained Decoding

MedicalLLMEntityExtractor is an end-to-end LLM-powered entity extraction annotator for healthcare and clinical text, built on top of the MedicalLLM models. It leverages GGUF-format Large Language Models (via llama.cpp backend) and enforces strict structured Spark NLP outputs using BNF grammars, ensuring deterministic and schema-compliant extraction results.

The annotator follows:

  • Few-shot prompting for domain adaptation
  • Constrained decoding via grammar rules (BNF)
  • Post-processing alignment using string matching for accurate character offsets (begin, end)

This enables reliable extraction of clinically relevant entities such as DRUG, DOSAGE, ROUTE, FREQUENCY, DURATION, and other configurable entity types, with precise span alignment back to the source document.

Unlike traditional NER models, MedicalLLMEntityExtractor is model-agnostic within the Medical LLM ecosystem, meaning it can load and run any compatible Medical LLM annotator model (GGUF-based) from the Healthcare NLP model hub. Users can simply select a pretrained model, configure inference parameters, and define extraction schema dynamically.

This makes it suitable for:

  • Rapid prototyping of clinical extraction pipelines
  • Custom entity schema design without retraining
  • Replacement or augmentation of traditional NER pipelines

Pretrained Model Example

MedicalLLMEntityExtractor can be used with any compatible MedicalLLM model.

Model Name Description
jsl_meds_4b_q16_v5 General-purpose medical LLM (4B quantized) optimized for clinical entity extraction, summarization, and structured information retrieval tasks.

Example Pipeline

document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

entity_extractor = MedicalLLMEntityExtractor.pretrained(
    "jsl_meds_4b_q16_v5", "en", "clinical/models"
) \
    .setInputCols(["document"]) \
    .setOutputCol("entities") \
    .setNGpuLayers(99) \
    .setNCtx(4096) \
    .setNPredict(500) \
    .setTemperature(0.1) \
    .setBatchSize(4) \
    .setTopK(40) \
    .setTopP(0.9)

Few-Shot Configuration Example

medication_few_shot = [
    (
        "Patient prescribed amoxicillin 500mg PO TID for 10 days.",
        '{"extractions": [{"entity": "DRUG", "text": "amoxicillin"}, '
        '{"entity": "DOSAGE", "text": "500mg"}, '
        '{"entity": "ROUTE", "text": "PO"}, '
        '{"entity": "FREQUENCY", "text": "TID"}, '
        '{"entity": "DURATION", "text": "10 days"}]}'
    ),
    (
        "Vancomycin 1.25g IV Q12H x 7 days for MRSA bacteremia.",
        '{"extractions": [{"entity": "DRUG", "text": "Vancomycin"}, '
        '{"entity": "DOSAGE", "text": "1.25g"}, '
        '{"entity": "ROUTE", "text": "IV"}, '
        '{"entity": "FREQUENCY", "text": "Q12H"}, '
        '{"entity": "DURATION", "text": "7 days"}]}'
    ),
    (
        "Metformin 1000mg PO twice daily with meals and lisinopril 10mg PO once daily.",
        '{"extractions": [{"entity": "DRUG", "text": "Metformin"}, '
        '{"entity": "DOSAGE", "text": "1000mg"}, '
        '{"entity": "ROUTE", "text": "PO"}, '
        '{"entity": "FREQUENCY", "text": "twice daily"}, '
        '{"entity": "DRUG", "text": "lisinopril"}, '
        '{"entity": "DOSAGE", "text": "10mg"}, '
        '{"entity": "ROUTE", "text": "PO"}, '
        '{"entity": "FREQUENCY", "text": "once daily"}]}'
    ),
]

entity_extractor.setFewShotExamples(medication_few_shot)

Entity Schema Definition

entity_extractor.setEntityTypes([
  "DRUG::The exact medication or drug name as written in the text. Examples: aspirin, metformin, vancomycin.",
  "DOSAGE::The exact dose amount including units. Examples: 500mg, 1.25g, 40mg, 0.1 mcg/kg/min.",
  "ROUTE::The administration route. Examples: PO, IV, SQ, IM, inhaled, sublingual, topical.",
  "FREQUENCY::The dosing schedule or timing. Examples: BID, TID, Q12H, daily, PRN, twice daily.",
  "DURATION::The treatment duration. Examples: 7 days, 6 weeks, 3 months, until follow-up.",
]) 

Result:

+-----+-------------+---------+
|begin|chunk        |label    |
+-----+-------------+---------+
|26   |Metformin    |DRUG     |
|36   |1000mg       |DOSAGE   |
|43   |PO           |ROUTE    |
|46   |BID          |FREQUENCY|
|65   |Lisinopril   |DRUG     |
|76   |10mg         |DOSAGE   |
|81   |PO           |ROUTE    |
|84   |daily        |FREQUENCY|
|94   |Atorvastatin |DRUG     |
|107  |40mg         |DOSAGE   |
|112  |PO           |ROUTE    |
|115  |at bedtime   |FREQUENCY|
|130  |Aspirin      |DRUG     |
|138  |81mg         |DOSAGE   |
|143  |PO           |ROUTE    |
|146  |daily        |FREQUENCY|
|156  |Empagliflozin|DRUG     |
|170  |10mg         |DOSAGE   |
+-----+-------------+---------+

Table-Aware, Free-Text Extended CDA De-Identification with Improved PHI Accuracy in CDA Tables

CDADeIdentification is a CDA (Clinical Document Architecture) de-identification transformer for Healthcare NLP pipelines, designed to anonymize both structured XML nodes and embedded free-text narratives such as section.text using Healthcare NLP models and pretrained de-identification pipelines.

The new generation of CDADeIdentification introduces a context-aware table processing mechanism with header-aware semantics, where structured elements such as HTML tables and definition lists are no longer treated as isolated text nodes. Instead, cell values are enriched with their corresponding headers (e.g., "header : value") before being sent to the de-identification pipeline, enabling significantly better entity recognition in compact clinical fields.

This approach substantially improves PHI detection accuracy in structured clinical tables, particularly in cases where traditional de-identification methods fail due to the absence of contextual signals in short text cells (e.g., lab values, patient attributes, billing records).

By combining:

  • header-aware table parsing
  • structured XML path de-identification
  • free-text NLP pipeline integration

CDADeIdentification achieves higher recall and more consistent PHI masking across mixed structured and unstructured CDA documents, while preserving the original clinical document structure and readability.

New Pretrained Models

Model Name Description
cda_deidentification_patient De-identifies patient-related CDA XML paths including identifiers, demographics, and structured patient metadata while preserving CDA schema integrity.
cda_deidentification_extend_free_text Extends de-identification to CDA free-text sections (e.g., section.text, narrative blocks) by applying a full Healthcare NLP de-identification pipeline to unstructured clinical content.

Table-Aware De-Identification (Major Enhancement)

Table Handling Control

With setTableHandling(True) (default behavior), CDA tables are processed using semantic header-value pairing instead of independent node processing.

  • How it works:

Instead of treating each <td> independently, the system constructs:

"<header> : <cell value>"

and sends it through the Healthcare NLP de-identification pipeline.

Free-Text Exclusion Control

Parameter Description
setExcludeFreeTextTags(list) Defines XML tags whose subtree is fully excluded from NLP processing (no tokenization or inference). Useful for preserving footnotes, superscripts, or custom clinical metadata blocks.

Backward Compatibility

  • Disable table-aware processing:
    .setTableHandling(False)
    

Pipeline Integration Example

deid_pipeline = PretrainedPipeline(
    "clinical_deidentification_docwise_benchmark_medium_v2",
    "en",
    "clinical/models"
)

cda_deidentification = (
    CdaDeIdentification
    .pretrained("cda_deidentification_extend_free_text", "en", "clinical/models")
    .setSeed(42)
    .setDays(2)
    .setTableHandling(True)
    .setPipeline(spark, deid_pipeline, "masked")
)

Example Input

<?xml version="1.0" encoding="UTF-8"?><ClinicalDocument xmlns="urn:hl7-org:v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:hl7-org:v3 CDA.xsd">
    <!-- ===================== BODY ===================== -->
    <component>
        <structuredBody>
            <component>
                <section>
                    <code code="10164-2" codeSystem="2.16.840.1.113883.6.1" displayName="History of Present Illness"/>
                    <title>History of Present Illness</title>
                    <text>
                      <table>
                        <thead>
                          <tr>
                            <th>Patient Name</th>
                            <th>Patient ID</th>
                            <th>Primary Diagnosis</th>
                            <th>Date of Birth</th>
                            <th>Contact Number</th>
                            <th>Home Address</th>
                          </tr>
                        </thead>
                        <tbody>
                          <tr>
                            <td>Michael Thompson</td>
                            <td>458921</td>
                            <td>Type 2 Diabetes Mellitus</td>
                            <td>08/14/1975</td>
                            <td>+1 212-555-0198</td>
                            <td>45 Park Avenue, New York, NY 10016</td>
                          </tr>
                          <tr>
                            <td>Emily Rodriguez</td>
                            <td>771204</td>
                            <td>Hypertension</td>
                            <td>11/22/1980</td>
                            <td>+1 415-555-0133</td>
                            <td>78 Sunset Blvd, Los Angeles, CA 90028</td>
                          </tr>
                          <tr>
                            <td>David Chen</td>
                            <td>993817</td>
                            <td>Chronic Kidney Disease</td>
                            <td>03/09/1972</td>
                            <td>+1 646-555-0177</td>
                            <td>1200 Market Street, San Francisco, CA 94102</td>
                          </tr>
                          <tr>
                            <td>Sophia Patel</td>
                            <td>224581</td>
                            <td>Asthma</td>
                            <td>07/30/1990</td>
                            <td>+1 770-900-0455</td>
                            <td>22 Baker Street, London NW1 6XE</td>
                          </tr>
                          <tr>
                            <td>James Wilson</td>
                            <td>665432</td>
                            <td>Coronary Artery Disease</td>
                            <td>01/18/1965</td>
                            <td>+61 412 345 678</td>
                            <td>89 George Street, Sydney NSW 2000</td>
                          </tr>
                        </tbody>
                      </table>
                    </text>
                </section>
            </component>
        </structuredBody>
    </component>
</ClinicalDocument>"""

Result:

<?xml version="1.0" encoding="UTF-8" standalone="no"?><ClinicalDocument xmlns="urn:hl7-org:v3" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:hl7-org:v3 CDA.xsd">
    <!-- ===================== BODY ===================== -->
    <component>
        <structuredBody>
            <component>
                <section>
                    <code code="10164-2" codeSystem="2.16.840.1.113883.6.1" displayName="History of Present Illness"/>
                    <title>History of Present Illness</title>
                    <text>
                      <table>
                        <thead>
                          <tr>
                            <th>Patient Name</th>
                            <th>Patient ID</th>
                            <th>Primary Diagnosis</th>
                            <th>Date of Birth</th>
                            <th>Contact Number</th>
                            <th>Home Address</th>
                          </tr>
                        </thead>
                        <tbody>
                          <tr>
                            <td>PATIENT</td>
                            <td>IDNUM</td>
                            <td>Type 2 Diabetes Mellitus</td>
                            <td>DATE</td>
                            <td>PHONE</td>
                            <td>STREET, STATE, STATE ZIP</td>
                          </tr>
                          <tr>
                            <td>PATIENT</td>
                            <td>IDNUM</td>
                            <td>Hypertension</td>
                            <td>DATE</td>
                            <td>PHONE</td>
                            <td>STREET, CITY, STATE ZIP</td>
                          </tr>
                          <tr>
                            <td>PATIENT</td>
                            <td>IDNUM</td>
                            <td>Chronic Kidney Disease</td>
                            <td>DATE</td>
                            <td>PHONE</td>
                            <td>STREET, CITY, STATE ZIP</td>
                          </tr>
                          <tr>
                            <td>PATIENT</td>
                            <td>IDNUM</td>
                            <td>Asthma</td>
                            <td>DATE</td>
                            <td>PHONE</td>
                            <td>STREET, CITY ZIP</td>
                          </tr>
                          <tr>
                            <td>PATIENT</td>
                            <td>IDNUM</td>
                            <td>Coronary Artery Disease</td>
                            <td>DATE</td>
                            <td>PHONE</td>
                            <td>STREET, CITY STATE ZIP</td>
                          </tr>
                        </tbody>
                      </table>
                    </text>
                </section>
            </component>
        </structuredBody>
    </component>
</ClinicalDocument>

Pretrained Zero-Shot Multi-Task Named Entity Recognition (NER) Speed Comparison on GPU vs CPU Benchmark

We benchmarked the PretrainedZeroShotMultiTask architecture using the zeroshot_multitask_base model on a dataset of ~500 tokens per row.

Hardware setup:

  • CPU: 8 cores, 52 GB System RAM
  • GPU: NVIDIA T4, 24 GB VRAM

The workload was tested on two dataset sizes (1k rows and 100 rows), each with 48 repartitions.

Spark NLP Pipeline

pipeline = Pipeline(
    stages = [
        document_assembler,
        sentence_detector,
        pretrained_zeroshot_multitask
])

Summary

  • The GPU significantly reduced wall time compared to CPU.
  • On the 1k dataset, GPU reduced total runtime from 8h 45m → 42m (~12× faster).
  • On the 100-row dataset, GPU reduced runtime from 51m → 4m (~12× faster).
  • GPU acceleration scales consistently across dataset sizes.

Benchmark Results

Hardware Dataset Size Repartition CPU Time (user+sys) Wall Time
CPU (8 core, 52GB RAM) 1k rows (~500 tokens) 48 5.93 s 8h 45m 46s
CPU (8 core, 52GB RAM) 100 rows (~500 tokens) 48 625 ms 51m 43s
GPU (NVIDIA T4, 24GB) 1k rows (~500 tokens) 48 400 ms 42m 35s
GPU (NVIDIA T4, 24GB) 100 rows (~500 tokens) 48 52 ms 4m 12s

StructuredDeIdentification – Timestamp Support & Time Shift Enhancement

Overview

The StructuredDeIdentification module has been enhanced to support timestamp-based de-identification for structured medical tables, enabling more realistic temporal obfuscation in clinical datasets.

This improvement allows clients to safely shift temporal information in structured healthcare data while preserving schema consistency and analytical usability.

Timestamp Entity Support

StructuredDeIdentification now supports the following timestamp entity types:

  • TIMESTAMP
  • TIMESTAMP_WITH_TIMEZONE

This enables time-based obfuscation for structured medical records such as admissions, lab events, prescriptions, and encounter logs.

Time Shifting Capability

Timestamp values can now be shifted forward or backward in time using a configurable offset.

This is particularly useful for:

  • De-identifying medical timelines
  • Preserving temporal relationships without exposing real dates
  • Enabling synthetic cohort generation with realistic time progression

New Parameters

timeStampFormats: list[str]

Optional list of timestamp patterns used to parse string-based TIMESTAMP or TIMESTAMP_WITH_TIMEZONE entities.

  • If provided, formats are evaluated sequentially in order
  • The first matching format is used for both parsing and formatting
  • If empty, a predefined set of common timestamp formats is applied automatically

Example formats:

  • "yyyy-MM-dd HH:mm:ss"
  • "yyyy-MM-dd'T'HH:mm:ss"
  • "yyyy-MM-dd HH:mm:ss.SSS"

seconds: int

Defines the number of seconds used to shift timestamp entities during obfuscation. Seconds-based offset provides the finest granularity while supporting all larger units via conversion.

  • Positive values shift timestamps forward in time
  • Negative values shift timestamps backward in time
  • Default value: 0 (no shift applied)

Medical Use Case Enhancement

This feature significantly improves support for clinical structured tables, where timestamp fields such as:

  • Admission time
  • Discharge time
  • Medication administration time
  • Lab sampling time

must be preserved structurally while being anonymized.

Example Usage

from sparknlp_jsl.structured_deidentification import StructuredDeIdentification

data = [
    (1001, "2026-06-01 08:15:22"),
    (1002, "2026-06-01 14:42:10"),
    (1003, "N/A"),
    (1004, "2026-06-02 18:27:31"),
    (1005, "2026-06-03 11:58:04"),
]

df = spark.createDataFrame(data, ["patient_id", "admission_timestamp"])

obfuscator = StructuredDeIdentification(
    spark=spark,
    columns={"admission_timestamp": "TIMESTAMP"},
    obfuscateRefSource="faker",
    timeStampFormats=["yyyy-MM-dd HH:mm:ss"],
    seconds=43200
)

result = obfuscator.obfuscateColumns(
    df,
    outputAsArray=False,
    overwrite=False,
    suffix="_deid"
)

result.show(truncate=False)

Result:

+----------+-------------------+------------------------+
|patient_id|admission_timestamp|admission_timestamp_deid|
+----------+-------------------+------------------------+
|1001      |2026-06-01 08:15:22|2026-06-01 20:15:22     |
|1002      |2026-06-01 14:42:10|2026-06-02 02:42:10     |
|1003      |N/A                |N/A                     |
|1004      |2026-06-02 18:27:31|2026-06-03 06:27:31     |
|1005      |2026-06-03 11:58:04|2026-06-03 23:58:04     |
+----------+-------------------+------------------------+

Updated 7 UMLS Entity Resolver Models and 19 ChunkMapper Models to the UMLS 2026AA Metathesaurus and Introduced 17 New ChunkMapper Models Covering 8 Additional Medical Coding Systems

We are delivering a full refresh of our UMLS model suite, now trained on the UMLS 2026AA Metathesaurus. This includes updated versions of all 7 UMLS Entity Resolver models and a significantly expanded ChunkMapper suite — 19 updated models retrained on 2026AA data and 17 new models covering 8 additional medical coding systems.

Updated UMLS 2026AA Entity Resolver Models

These models map clinical entities to UMLS Concept Unique Identifiers (CUI) using sbiobert_base_cased_mli_onnx sentence embeddings (ONNX, CPU-compatible), retrained on the UMLS 2026AA Metathesaurus. The biolordresolve_umls_general_concepts model uses mpnet_embeddings_biolord_2023_c embeddings.

Model Name Description
sbiobertresolve_umls_findings Maps clinical finding entities to their corresponding UMLS CUI codes
sbiobertresolve_umls_clinical_drugs Maps drug entities to UMLS CUI codes
sbiobertresolve_umls_disease_syndrome Maps clinical entities (“Disease or Syndrome”) to UMLS CUI codes
sbiobertresolve_umls_drug_substance Maps drug and substance entities to UMLS CUI codes
sbiobertresolve_umls_general_concepts Maps clinical entities and concepts to the following 4 UMLS CUI code categories: Disease, Symptom, Medication and Procedure
sbiobertresolve_umls_major_concepts Maps clinical entities and concepts to 4 major categories of UMLS CUI codes: Clinical Findings, Medical Devices, Anatomical Structures, Injuries & Poisoning terms
biolordresolve_umls_general_concepts Maps clinical entities to 4 UMLS CUI code categories using mpnet_embeddings_biolord_2023_c embeddings: Disease, Symptom, Medication, and Procedure

Updated UMLS 2026AA ChunkMapper Models (19 Models)

These models are retrained on UMLS 2026AA data. The bidirectional code mappers use a DocumentAssembler → Doc2Chunk → ChunkMapper pipeline. The NER-based mappers apply a full NER pipeline before mapping.

⚠️ CPT mapper models are available only to users with a valid AMA license. Contact support@johnsnowlabs.com for access.

UMLS → Code Code → UMLS Coding System
umls_rxnorm_mapper rxnorm_umls_mapper RxNorm
umls_snomed_mapper snomed_umls_mapper SNOMED CT (US Edition)
umls_loinc_mapper loinc_umls_mapper LOINC
umls_mesh_mapper mesh_umls_mapper MeSH
umls_icd10cm_mapper icd10cm_umls_mapper ICD-10-CM
umls_hpo_mapper hpo_umls_mapper Human Phenotype Ontology
umls_cpt_mapper ⚠️ cpt_umls_mapper ⚠️ CPT
Model Name NER Model Description
umls_clinical_findings_mapper ner_clinical_large Maps clinical findings (PROBLEM, TEST, TREATMENT) to UMLS CUI codes
umls_disease_syndrome_mapper ner_clinical_large Maps disease and syndrome entities to UMLS CUI codes
umls_drug_substance_mapper ner_posology_greedy Maps drug substance entities to UMLS CUI codes
umls_clinical_drugs_mapper ner_posology_greedy Maps clinical drug entities to UMLS CUI codes
umls_major_concepts_mapper ner_medmentions_coarse Maps body parts, devices, injuries, and findings to UMLS CUI codes

New UMLS 2026AA ChunkMapper Models (17 Models)

These models are introduced for the first time, covering 8 additional medical coding systems and one new NER-based domain.

⚠️ MedDRA mapper models are available only to users with a valid license. Contact support@johnsnowlabs.com for access.

UMLS → Code Code → UMLS Coding System
umls_icd10pcs_mapper icd10pcs_umls_mapper ICD-10-PCS
umls_nci_mapper nci_umls_mapper NCI Thesaurus
umls_icd9cm_mapper icd9cm_umls_mapper ICD-9-CM
umls_hgnc_mapper hgnc_umls_mapper HGNC Gene Nomenclature
umls_atc_mapper atc_umls_mapper WHO ATC
umls_hcpcs_mapper hcpcs_umls_mapper HCPCS
umls_snomedvet_mapper snomedvet_umls_mapper SNOMED CT Veterinary
umls_meddra_mapper ⚠️ meddra_umls_mapper ⚠️ MedDRA
Model Name NER Model Description
umls_general_concepts_mapper ner_clinical Maps general clinical concepts (Disease, Symptom, Device, Procedure) to UMLS CUI codes

Example (entity resolution — sbiobertresolve_umls_disease_syndrome):

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner_jsl")

ner_converter = NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner_jsl"]) \
    .setOutputCol("ner_chunk") \
    .setWhiteList(["Disease_Syndrome_Disorder", "Symptom"])

chunk2doc = Chunk2Doc() \
    .setInputCols("ner_chunk") \
    .setOutputCol("ner_chunk_doc")

sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli_onnx", "en", "clinical/models") \
    .setInputCols(["ner_chunk_doc"]) \
    .setOutputCol("sbert_embeddings") \
    .setCaseSensitive(False)

resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_umls_disease_syndrome", "en", "clinical/models") \
    .setInputCols(["sbert_embeddings"]) \
    .setOutputCol("resolution") \
    .setDistanceFunction("EUCLIDEAN")

pipeline = Pipeline(stages=[
    documentAssembler, sentenceDetector, tokenizer, word_embeddings,
    ner_model, ner_converter, chunk2doc, sbert_embedder, resolver
])

data = spark.createDataFrame([[
    "The patient has a history of systemic lupus erythematosus, multiple sclerosis, and fibromyalgia. "
    "She was admitted with sepsis secondary to bacterial pneumonia and developed acute respiratory distress syndrome. "
    "Imaging showed findings consistent with pulmonary sarcoidosis and Crohn's disease."
]]).toDF("text")

result = pipeline.fit(data).transform(data)

Result

| ner_chunk | entity | umls_code | resolution | all_k_results | all_k_distances | all_k_cosine_distances | all_k_resolutions | |—|—|—|—|—|—|—|—| | systemic lupus erythematosus | Disease_Syndrome_Disorder | C0024141 | systemic lupus erythematosus | C0024141:::C0409974:::C0024137:::C1274838:::C6022675:::C0409977:::C0409976:::C07… | 0.0067:::3.4325:::4.0055:::4.4309:::4.4907:::4.5642:::4.5922:::4.6192:::4.6675::… | 0.0000:::0.0184:::0.0251:::0.0308:::0.0317:::0.0322:::0.0328:::0.0336:::0.0334::… | systemic lupus erythematosus:::lupus erythematosus:::cutaneous lupus erythematos… | | sclerosis | Disease_Syndrome_Disorder | C0036412 | sclera disease | C0036412:::C0263009:::C0036421:::C0007795:::C0237854:::C0004712:::C0036416:::C00… | 6.4563:::6.5690:::6.7129:::6.8109:::6.8550:::6.9680:::7.2263:::7.5483:::7.5862::… | 0.0689:::0.0695:::0.0752:::0.0738:::0.0762:::0.0792:::0.0860:::0.0915:::0.0957::… | sclera disease:::sclerosis skin:::system; sclerosis:::diffuse sclerosis:::sclero… | | fibromyalgia | Disease_Syndrome_Disorder | C0016053 | fibromyalgia | C0016053:::C0751153:::C0751152:::C0015674:::C4703320 | 0.0070:::3.9297:::4.7923:::6.1085:::7.4228 | 0.0000:::0.0242:::0.0364:::0.0572:::0.0843 | fibromyalgia:::secondary fibromyalgia:::fibromyalgia primary:::chronic fatigue-f… | | sepsis | Disease_Syndrome_Disorder | C0036690 | sepsis | C0036690:::C3164780:::C0242966:::C0152965:::C1141927:::C0684256:::C1141926:::C17… | 0.0084:::4.0589:::4.4404:::4.9625:::5.2109:::5.7252:::5.8590:::5.9812:::5.9983::… | 0.0000:::0.0260:::0.0316:::0.0396:::0.0428:::0.0522:::0.0539:::0.0546:::0.0564::… | sepsis:::clinical sepsis:::syndrome sepsis:::staph sepsis:::wound sepsis:::sepsi… | | bacterial pneumonia | Disease_Syndrome_Disorder | C0004626 | bacterial pneumonia | C0004626:::C0339952:::C0276523:::C0339951:::C1443238:::C0264386:::C0155860:::C05… | 0.0078:::4.8777:::6.3237:::6.3566:::6.6580:::6.8243:::6.8403:::6.8537:::6.8800::… | 0.0000:::0.0377:::0.0640:::0.0656:::0.0706:::0.0734:::0.0762:::0.0770:::0.0777::… | bacterial pneumonia:::bacterial pneumonia secondary:::aids with bacterial pneumo… | | respiratory distress syndrome | Disease_Syndrome_Disorder | C0035220 | respiratory distress syndrome | C0035220:::C0852283:::C0035222:::C0158940:::C0877339:::C5420230:::C3810183:::C54… | 0.0062:::4.0045:::4.1765:::5.1745:::6.0413:::6.2379:::6.2459:::6.4251:::6.4578::… | 0.0000:::0.0239:::0.0259:::0.0397:::0.0535:::0.0576:::0.0574:::0.0608:::0.0607::… | respiratory distress syndrome:::respiratory distress syndromes:::acquired respir… | | pulmonary sarcoidosis | Disease_Syndrome_Disorder | C0036205 | pulmonary sarcoidosis | C0036205:::C0036202:::C0406396:::C1302844:::C0396073:::C0036206:::C0340201:::C13… | 0.0074:::4.8495:::5.0965:::5.2098:::5.3056:::5.3212:::5.4128:::5.4719:::5.5281::… | 0.0000:::0.0373:::0.0412:::0.0433:::0.0452:::0.0452:::0.0470:::0.0479:::0.0488::… | pulmonary sarcoidosis:::sarcoidosis:::nodular sarcoidosis:::skin sarcoidosis:::l… | | Crohn’s disease | Disease_Syndrome_Disorder | C0010346 | crohn’s disease | C0010346:::C0399497:::C0156147:::C1301260:::C0941042:::C1301261:::C5686651:::C60… | 0.0071:::4.6627:::5.2420:::5.2609:::5.6188:::5.6324:::5.6705:::5.6709:::6.0026 | 0.0000:::0.0340:::0.0424:::0.0428:::0.0487:::0.0486:::0.0508:::0.0501:::0.0560 | crohn’s disease:::orofacial crohn’s disease:::crohn’s colitis:::gastrointestinal… |

Example (code-level mapping — umls_rxnorm_mapper):

document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("doc")

doc2chunk = Doc2Chunk() \
    .setInputCols(["doc"]) \
    .setOutputCol("ner_chunk")

mapper = ChunkMapperModel.pretrained("umls_rxnorm_mapper", "en", "clinical/models") \
    .setInputCols(["ner_chunk"]) \
    .setOutputCol("mappings")

pipeline = Pipeline(stages=[document_assembler, doc2chunk, mapper])

data = spark.createDataFrame([["C1126248"], ["C0978482"], ["C0691677"]]).toDF("text")

result = pipeline.fit(data).transform(data)

Result

| umls_code | rxnorm_code | |—|—| | C1126248 | 330565 | | C0978482 | 861004 | | C0691677 | 198776 |

Example (NER-based mapping — umls_general_concepts_mapper):

document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("clinical_ner")

ner_converter = NerConverterInternal() \
    .setInputCols(["sentence", "token", "clinical_ner"]) \
    .setOutputCol("ner_chunk")

mapper = ChunkMapperModel.pretrained("umls_general_concepts_mapper", "en", "clinical/models") \
    .setInputCols(["ner_chunk"]) \
    .setOutputCol("mappings") \
    .setRels(["umls_code"]) \
    .setLowerCase(True)

pipeline = Pipeline(stages=[
    document_assembler, sentence_detector, tokenizer,
    word_embeddings, ner_model, ner_converter, mapper
])

data = spark.createDataFrame([[
    "The patient presents with dyspnea and fever due to pneumonia. "
    "Treatment includes bronchoscopy, catheter placement, and chemotherapy."
]]).toDF("text")

result = pipeline.fit(data).transform(data)

Result

| ner_chunk | umls_code | |—|—| | dyspnea | C0013404 | | fever | C0015967 | | pneumonia | C0032285 | | bronchoscopy | C5979970 | | catheter placement | C0883301 | | chemotherapy | C0013216 |

New Blog Posts & Technical Deep Dives

  • John Snow Labs detects 54% more clinical PHI than OpenAI’s Privacy Filter, at 5.8× the speed on CPU : This blog post benchmarks the OpenAI Privacy Filter against a healthcare-specific de-identification pipeline from John Snow Labs on nearly 382K tokens of real clinical text. It explains how the John Snow Labs pipeline achieved substantially higher PHI detection accuracy (0.95 F1 vs. 0.55) while also running 5.8× faster on CPU, highlighting the importance of domain-specific clinical NLP over general-purpose PII detection. The article also covers strict label-mapping methodology, benchmark design, CPU-optimized deployment pipelines, and practical challenges in healthcare de-identification such as identifying hospital names, medical IDs, and clinical abbreviations in real-world notes.

Updated Notebooks And Demonstrations For Making Healthcare NLP Easier To Navigate And Understand

  • New MedicalLLMEntityExtractor Notebook This notebook introduces JSL’s LLM-based clinical NER annotator with grammar-enforced JSON, runtime-defined entity types, and CHUNK outputs with character offsets. Five end-to-end examples: default clinical NER, medication fields, PHI de-identification, oncology few-shot, and custom ADR prompts. Requires Spark NLP Healthcare license and supported JSL MedS/MedM GGUF models.

  • New Benchmarking John Snow Labs Healthcare NLP Pipelines for Optimal Spark Config at Million-Doc Scale Databricks Solution Accelerator Notebook
    • This accelerator benchmarks five pretrained John Snow Labs Healthcare NLP pipelines to identify the most efficient Apache Spark configurations for large-scale document processing.
    • The notebook performs million-document throughput testing across multiple Healthcare NLP pipelines, evaluating how different Spark settings impact performance, scalability, and resource utilization.
    • By comparing execution times, parallelism strategies, and cluster tuning parameters, this benchmark helps users determine the optimal Spark configuration for maximum NLP throughput in enterprise-scale healthcare and clinical text processing workloads.
    • Utilizing a Medallion architecture (Bronze, Silver, Gold layers) and crash-safe checkpoints, the accelerator demonstrates how to efficiently fine-tune key Spark parameters across various complex clinical NLP tasks. Its main benefit is enabling data teams to significantly reduce processing times and compute costs when deploying healthcare NLP models on Databricks clusters.
    • Use cases
      • Optimizing Apache Spark configurations (shuffle_partitions, default_parallelism, and repartition_count) for high-throughput clinical NLP processing.
      • Benchmarking the performance, execution time, and stability of John Snow Labs Healthcare NLP pipelines at varying data scales (10 to 1,000,000 rows).
      • Fast and balanced Protected Health Information (PHI) de-identification for massive volumes of medical documents.
      • Granular clinical Named Entity Recognition (NER) and specialized oncology entity extraction from clinical text.
      • High-volume, large-scale ICD-10-CM code resolution from raw medical transcriptions.
  • New LLM-Based Oncology Entity Extraction Databricks Solution Accelerator Notebook
    • This solution accelerator demonstrates an end-to-end LLM-based oncology entity extraction pipeline built on the Databricks Lakehouse using Spark NLP Healthcare by John Snow Labs.
    • It ingests raw clinical oncology notes, extracts 43 structured entity types using a locally running LLM (qwen3_4b), resolves extracted entities to ICD-O codes via biomedical sentence embeddings, and publishes curated Delta tables following the Medallion architecture (Bronze → Silver → Gold).
    • All inference runs on-cluster with no external API calls, ensuring full data privacy and HIPAA-aligned processing.
    • Use cases
      • Automated cancer registry abstraction from unstructured clinical notes
      • Structured oncology entity extraction (diagnoses, treatments, biomarkers, staging, tumor findings) for downstream analytics
      • ICD-O code assignment to free-text oncology concepts using semantic similarity
      • PHI-safe clinical NLP pipeline with optional de-identification before LLM processing
      • Foundation for real-world evidence (RWE) generation from electronic health records

We Have Added And Updated A Substantial Number Of New Clinical Models And Pipelines, Further Solidifying Our Offering In The Healthcare Domain.

  • clinical_deidentification_docwise_benchmark_multitask
  • clinical_deidentification_subentity_optimized_scala3
  • zeroshot_multitask_oncology
  • zeroshot_multitask_oncology_generic
  • cda_deidentification_extend_free_text
  • cda_deidentification_patient
  • sbiobertresolve_umls_findings
  • sbiobertresolve_umls_clinical_drugs
  • sbiobertresolve_umls_disease_syndrome
  • sbiobertresolve_umls_drug_substance
  • sbiobertresolve_umls_general_concepts
  • sbiobertresolve_umls_major_concepts
  • biolordresolve_umls_general_concepts
  • umls_rxnorm_mapper
  • rxnorm_umls_mapper
  • umls_snomed_mapper
  • snomed_umls_mapper
  • umls_snomedvet_mapper
  • snomedvet_umls_mapper
  • umls_loinc_mapper
  • loinc_umls_mapper
  • umls_mesh_mapper
  • mesh_umls_mapper
  • umls_icd10cm_mapper
  • icd10cm_umls_mapper
  • umls_icd10pcs_mapper
  • icd10pcs_umls_mapper
  • umls_icd9cm_mapper
  • icd9cm_umls_mapper
  • umls_nci_mapper
  • nci_umls_mapper
  • umls_hpo_mapper
  • hpo_umls_mapper
  • umls_hgnc_mapper
  • hgnc_umls_mapper
  • umls_atc_mapper
  • atc_umls_mapper
  • umls_hcpcs_mapper
  • hcpcs_umls_mapper
  • umls_cpt_mapper
  • cpt_umls_mapper
  • umls_meddra_mapper
  • meddra_umls_mapper
  • umls_clinical_findings_mapper
  • umls_disease_syndrome_mapper
  • umls_drug_substance_mapper
  • umls_clinical_drugs_mapper
  • umls_major_concepts_mapper
  • umls_general_concepts_mapper

Versions

Last updated