Healthcare NLP v6.2.2 Release Notes

6.2.2

Highlights

We are delighted to announce remarkable enhancements and updates in our latest release of Spark NLP for Healthcare. This release introduces a new zero-shot NER chunking annotator, PretrainedZeroShotNERChunker, and advanced ZIP obfuscation controls for Deidentification, StructuredDeidentification, and LightDeidentification workflows.

Introducing a new zero-shot NER chunking annotator, PretrainedZeroShotNERChunker, enabling direct span-level entity extraction without token alignment
Advanced ZIP obfuscation controls with the new setObfuscateZipKeepDigits parameter, now supported across Deidentification, StructuredDeidentification, and LightDeidentification workflows
7 new TextMatcher and parser models for enhanced clinical entity recognition, including specialized matchers for drug forms, medical devices, and other clinical concepts
A new clinical classification model for identifying sentences containing quantitative measurements
4 new entity resolver models using bge_base_en_v1_5_onnx embeddings and 4 new mapper models for improved accuracy in medical code mapping
The addition and update of numerous new clinical models and pipelines continue to reinforce our offering in the healthcare domain
Various core improvements, bug fixes, enhanced overall robustness, and reliability of Spark NLP for Healthcare
- In pipelines that use the splitter component, NerConverterInternal could, in rare cases, miss tokens when overlapping spans occurred. This issue has been fixed, resulting in more robust handling of overlaps and improved consistency in entity conversion.

These enhancements will elevate your experience with Healthcare NLP, enabling more efficient, accurate, and streamlined analysis of healthcare-related natural language data.

New Zero-Shot NER Chunking Annotator: `PretrainedZeroShotNERChunker`

A new zero-shot NER annotator, PretrainedZeroShotNERChunker, has been introduced to simplify entity extraction workflows by producing direct span-level entity chunks from raw text. Unlike PretrainedZeroShotNER, this annotator does not require token alignment and enables fast, lightweight zero-shot entity extraction without any task-specific training data.

Key capabilities:

Direct chunk-level entity extraction from DOCUMENT input
No token alignment required, reducing pipeline complexity
Fully zero-shot architecture—no annotated data needed
Ideal for rapid prototyping, weak supervision, and exploratory NER workflows

Example:

PretrainedZeroShot NER (token-aligned, requires NerConverter)

from johnsnowlabs import nlp, medical

text = """
Dr. Eleanor Vance, a neurologist at MountSinaiHospital,NewYorkCity recently consulted with patient JohnDoe.
Mr.Doe,55yearsold, has an appointment scheduled for October262025, regarding his condition.
"""

labels = ['DOCTOR', 'PATIENT', 'AGE', 'DATE', 'HOSPITAL', 'CITY', 'STREET', 'STATE', 'COUNTRY', 'PHONE', 'IDNUM', 'EMAIL','ZIP', 'ORGANIZATION', 'PROFESSION', 'USERNAME']

zeroshot_ner = medical.PretrainedZeroShotNER() \
    .pretrained("zeroshot_ner_deid_subentity_merged_medium", "en", "clinical/models") \
    .setInputCols("document", "token") \
    .setOutputCol("ner_zero_shot") \
    .setPredictionThreshold(0.5) \
    .setLabels(labels)

ner_converter = medical.NerConverter() \
    .setInputCols("document", "token", "ner_zero_shot") \
    .setOutputCol("ner_chunk_zero_shot")

Results

begin	end	chunk	entity	confidence
4	16	Eleanor Vance	DOCTOR	1.0
99	105	JohnDoe	PATIENT	1.0
160	172	October262025	DATE	1.0

PretrainedZeroShotNERChunker (direct chunks)

labels = ['DOCTOR', 'PATIENT', 'AGE', 'DATE', 'HOSPITAL', 'CITY', 'STREET', 'STATE', 'COUNTRY', 'PHONE', 'IDNUM', 'EMAIL','ZIP', 'ORGANIZATION', 'PROFESSION', 'USERNAME']

zeroshot_chunker = medical.PretrainedZeroShotNERChunker() \
    .pretrained("zeroshot_ner_deid_subentity_merged_medium", "en", "clinical/models") \
    .setInputCols("document") \
    .setOutputCol("ner_chunk_zero_shot") \
    .setPredictionThreshold(0.5) \
    .setLabels(labels)

Results

begin	end	chunk	entity	confidence
4	16	Eleanor Vance	DOCTOR	0.99998724
36	53	MountSinaiHospital	HOSPITAL	0.99986887
55	65	NewYorkCity	CITY	0.99928530
99	105	JohnDoe	PATIENT	0.99999845
111	113	Doe	PATIENT	0.99999870
115	124	55yearsold	AGE	0.99999976
160	172	October262025	DATE	1.0

Why does PretrainedZeroShotNERChunker find more entities?

Works directly at the span level, not token-by-token
Not limited by tokenization (e.g., “55yearsold”, “NewYorkCity”, “MountSinaiHospital”)
Detects nested or overlapping mentions (e.g., “JohnDoe” + “Doe”)
Avoids IOB conversion issues caused by Spark NLP Tokenizer
Produces more complete chunks because it evaluates the full text span rather than individual tokens

Advanced ZIP Code Obfuscation Controls Across All Deidentification Workflows

A new parameter, setObfuscateZipKeepDigits, has been added to provide finer control over ZIP code masking when applying HIPAA-compliant geographic de-identification. This enhancement allows users to customize how many leading ZIP digits are preserved before the remaining digits are masked.
The parameter is now supported across Deidentification, StructuredDeidentification, and LightDeidentification workflows, ensuring consistent behavior in all de-identification pipelines.

Key capabilities:

Preserves the first N digits of a ZIP code while masking the remainder
Masks all trailing digits—including ZIP+4 extensions—with *
Accepts values from 0 to 5 (default: 3)
Overrides the default HIPAA Safe Harbor ZIP generalization pattern
Enables customizable expert-determination–aligned obfuscation strategies

Examples

Input ZIP	Keep = 3 (default)	Keep = 2
`12345`	`123**`	`12***`

This new parameter provides significantly more flexibility for organizations seeking to balance privacy protection with analytical utility in their de-identification workflows.

7 New TextMatcher and Parser Models for Enhanced Clinical Entity Recognition, Including Specialized Matchers for Drug Forms, Medical Devices, and Other Clinical Concepts

We have added 7 new TextMatcher and ContextualParser models that expand our clinical entity recognition capabilities with specialized dictionaries for various healthcare domains. These lightweight, dictionary-based models enable fast and accurate identification of domain-specific entities in clinical text.

Model Name	Description
`drug_form_matcher`	Identify drug form entities in clinical text
`drug_route_matcher`	Identifies drug administration route entities in clinical text
`drug_strength_parser`	Identifies drug strength entities in clinical text
`medical_device_matcher`	Identifies medical device entities in clinical text
`procedure_matcher`	Identifies medical procedure entities in clinical text
`symptom_matcher`	Identify symptom entities in clinical text
`test_result_parser`	Identifies test result entities in clinical text

Example:

drug_form_matcher = TextMatcherInternalModel.pretrained("drug_form_matcher", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("matched_drug_form")


text = """The patient was started on Metformin 500mg tablet twice daily with meals. For pain, Oxycodone 5mg capsule every 6 hours PRN was prescribed. Insulin glargine injection 20 units subcutaneously at bedtime. Apply Hydrocortisone 1% cream to affected areas twice daily. Fentanyl 25mcg/hr transdermal patch applied every 72 hours. Albuterol solution via nebulizer every 4 hours PRN."""

data = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(data).transform(data)

Result:

chunk	begin	end	label
tablet	44	49	DRUG_FORM
capsule	97	103	DRUG_FORM
injection	149	157	DRUG_FORM
cream	220	224	DRUG_FORM
transdermal patch	270	286	DRUG_FORM
solution	330	337	DRUG_FORM
nebulizer	343	351	DRUG_FORM

A New Clinical Classification Model for Identifying Sentences Containing Quantitative Measurements

A new binary classification model, generic_classifier_measurement, has been introduced to identify clinical sentences containing quantitative measurement information. This classifier streamlines the extraction and filtering of measurement-related content from clinical documentation, making it easier to isolate vital signs, lab results, and other numerical health data.

Classes:

MEASUREMENT: Contains numerical health measurements, vital signs, lab results, or test values
OTHER: Doesn’t contain quantitative measurement information

Example:

generic_classifier = GenericClassifierModel.pretrained("generic_classifier_measurement", "en", "clinical/models")\
    .setInputCols(["features"])\
    .setOutputCol("prediction")


sample_texts = [
    ["This is a 58-year-old male who started out having a toothache in the left lower side of the mouth that is now radiating into his jaw and towards his left ear."],
    ["The database was available at this point of time. WBC count is elevated at 19,000 with the left shift, hemoglobin of 17.7, and hematocrit of 55.8 consistent with severe dehydration."],
    ["Temperature max is 99, heart rate was 133 to 177, blood pressure is 114/43 (while moving), respiratory rate was 28 to 56 with O2 saturations 97 to 100% on room air."],
    ["I will see her back in followup in 3 months, at which time she will be recovering from a shoulder surgery."]
]

data = spark.createDataFrame(sample_texts).toDF("text")
result = clf_pipeline.fit(data).transform(data)

Result:

text	prediction
This is a 58-year-old male who started out having a toothache…	OTHER
The database was available at this point of time. WBC count is elevated at 19,000…	MEASUREMENT
Temperature max is 99, heart rate was 133 to 177, blood pressure is 114/43…	MEASUREMENT
I will see her back in followup in 3 months…	OTHER

3 New Entity Resolvers and 1 Mapper Using by `bge_base_en_v1_5_onnx` Embeddings for Improved Accuracy in Medical Code Mapping

We are introducing 3 new entity resolvers and 1 mapper built on the state-of-the-art bge_base_en_v1_5_onnx embeddings. These models leverage BGE (BAAI General Embeddings) for superior semantic understanding, resulting in more accurate medical code resolution and entity mapping across various healthcare terminologies.

Model Name	Description
`bgeresolve_icd10cm`	Maps clinical entities to ICD-10-CM codes using `bge_base_en_v1_5_onnx` embeddings
`bgeresolve_rxnorm`	Maps drug entities to RxNorm codes using `bge_base_en_v1_5_onnx` embeddings
`bgeresolve_snomed`	Maps clinical entities to SNOMED codes using `bge_base_en_v1_5_onnx` embeddings
`bgeresolve_cpt`	Maps clinical entities to CPT codes using `bge_base_en_v1_5_onnx` embeddings
`rxnorm_mapper`	Maps drug entities to their corresponding RxNorm codes
`snomed_mapper`	Maps clinical entities to their corresponding SNOMED codes
`icd10cm_mapper`	Maps clinical entities to their corresponding ICD10CM codes
`cpt_mapper`	Maps clinical entities to their corresponding CPT codes

Example:

bge_embeddings = BGEEmbeddings.pretrained("bge_base_en_v1_5_onnx", "en")\
    .setInputCols(["ner_chunk_doc"])\
    .setOutputCol("bge_embeddings")

icd10cm_resolver = SentenceEntityResolverModel.pretrained("bgeresolve_icd10cm", "en", "clinical/models")\
    .setInputCols(["bge_embeddings"])\
    .setOutputCol("icd10cm_code")\
    .setDistanceFunction("EUCLIDEAN")

text = """A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with acute hepatitis and obesity, presented with a one-week history of polyuria, polydipsia, and vomiting."""

data = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(data).transform(data)

Result:

ner_chunk	entity	icd10cm_code	resolutions	all_codes	all_resolutions
gestational diabetes mellitus	PROBLEM	O24.4	gestational diabetes mellitus	[O24.4, O24.41, O24.43, O24.42, O24.414, …]	[gestational diabetes mellitus, gestational di…]
subsequent type two diabetes mellitus	PROBLEM	E11	type 2 diabetes mellitus	[E11, E11.69, E11.6, E11.64, E13, E11.65, …]	[type 2 diabetes mellitus, type 2 diabetes mel…]
T2DM	PROBLEM	E11	type 2 diabetes mellitus	[E11, E11.65, E11.64, E11.9, E11.44, E11.5, …]	[type 2 diabetes mellitus, type 2 diabetes mel…]
HTG-induced pancreatitis	PROBLEM	K85.3	drug induced acute pancreatitis	[K85.3, K86.0, K85.2, K85.31, K85.21, …]	[drug induced acute pancreatitis, alcohol-indu…]
acute hepatitis	PROBLEM	B15	acute hepatitis a	[B15, B16, B17.1, B17, B17.2, B17.9, …]	[acute hepatitis a, acute hepatitis b, acute h…]
obesity	PROBLEM	E66	overweight and obesity	[E66, E66.9, E66.0, E66.8, E66.3, O99.21, …]	[overweight and obesity, obesity, unspecified,…]
polyuria	PROBLEM	R35	polyuria	[R35, R35.89, R35.8, R35.81, R80, R80.8, …]	[polyuria, other polyuria, other polyuria, noc…]
polydipsia	PROBLEM	R63.1	polydipsia	[R63.1, O40, R35, R35.89, R35.8, R63.2, …]	[polydipsia, polyhydramnios, polyuria, other p…]
vomiting	PROBLEM	R11.1	vomiting	[R11.1, R11, R11.12, R11.10, R11.11, R11.13, …]	[vomiting, nausea and vomiting, projectile vom…]

New blog post ?

Updated Notebooks And Demonstrations For making Healthcare NLP Easier To Navigate And Understand ?

We Have Added And Updated A Substantial Number Of New Clinical Models And Pipelines, Further Solidifying Our Offering In The Healthcare Domain.

drug_form_matcher
drug_route_matcher
drug_strength_parser
medical_device_matcher
procedure_matcher
symptom_matcher
test_result_parser
generic_classifier_measurement
bgeresolve_icd10cm
bgeresolve_rxnorm
bgeresolve_snomed
bgeresolve_cpt
rxnorm_mapper
snomed_mapper
icd10cm_mapper
cpt_mapper

For all Healthcare NLP models, please check: Models Hub Page

Versions

Version
Version
Version

PREVIOUSVersion Compatibility

6.2.2

Highlights

New Zero-Shot NER Chunking Annotator: PretrainedZeroShotNERChunker

Advanced ZIP Code Obfuscation Controls Across All Deidentification Workflows

7 New TextMatcher and Parser Models for Enhanced Clinical Entity Recognition, Including Specialized Matchers for Drug Forms, Medical Devices, and Other Clinical Concepts

A New Clinical Classification Model for Identifying Sentences Containing Quantitative Measurements

3 New Entity Resolvers and 1 Mapper Using by bge_base_en_v1_5_onnx Embeddings for Improved Accuracy in Medical Code Mapping

New blog post ?

Updated Notebooks And Demonstrations For making Healthcare NLP Easier To Navigate And Understand ?

We Have Added And Updated A Substantial Number Of New Clinical Models And Pipelines, Further Solidifying Our Offering In The Healthcare Domain.

Versions

New Zero-Shot NER Chunking Annotator: `PretrainedZeroShotNERChunker`

3 New Entity Resolvers and 1 Mapper Using by `bge_base_en_v1_5_onnx` Embeddings for Improved Accuracy in Medical Code Mapping