6.2.2
Highlights
We are delighted to announce remarkable enhancements and updates in our latest release of Spark NLP for Healthcare. This release introduces a new zero-shot NER chunking annotator, PretrainedZeroShotNERChunker, and advanced ZIP obfuscation controls for Deidentification, StructuredDeidentification, and LightDeidentification workflows.
- Introducing a new zero-shot NER chunking annotator,
PretrainedZeroShotNERChunker, enabling direct span-level entity extraction without token alignment - Advanced ZIP obfuscation controls with the new
setObfuscateZipKeepDigitsparameter, now supported across Deidentification, StructuredDeidentification, and LightDeidentification workflows - 7 new TextMatcher and parser models for enhanced clinical entity recognition, including specialized matchers for drug forms, medical devices, and other clinical concepts
- A new clinical classification model for identifying sentences containing quantitative measurements
- 4 new entity resolver models using
bge_base_en_v1_5_onnxembeddings and 4 new mapper models for improved accuracy in medical code mapping - The addition and update of numerous new clinical models and pipelines continue to reinforce our offering in the healthcare domain
- Various core improvements, bug fixes, enhanced overall robustness, and reliability of Spark NLP for Healthcare
- In pipelines that use the splitter component,
NerConverterInternalcould, in rare cases, miss tokens when overlapping spans occurred. This issue has been fixed, resulting in more robust handling of overlaps and improved consistency in entity conversion.
- In pipelines that use the splitter component,
These enhancements will elevate your experience with Healthcare NLP, enabling more efficient, accurate, and streamlined analysis of healthcare-related natural language data.
New Zero-Shot NER Chunking Annotator: PretrainedZeroShotNERChunker
A new zero-shot NER annotator, PretrainedZeroShotNERChunker, has been introduced to simplify entity extraction workflows by producing direct span-level entity chunks from raw text. Unlike PretrainedZeroShotNER, this annotator does not require token alignment and enables fast, lightweight zero-shot entity extraction without any task-specific training data.
Key capabilities:
- Direct chunk-level entity extraction from
DOCUMENTinput - No token alignment required, reducing pipeline complexity
- Fully zero-shot architecture—no annotated data needed
- Ideal for rapid prototyping, weak supervision, and exploratory NER workflows
Example:
PretrainedZeroShot NER (token-aligned, requires NerConverter)
from johnsnowlabs import nlp, medical
text = """
Dr. Eleanor Vance, a neurologist at MountSinaiHospital,NewYorkCity recently consulted with patient JohnDoe.
Mr.Doe,55yearsold, has an appointment scheduled for October262025, regarding his condition.
"""
labels = ['DOCTOR', 'PATIENT', 'AGE', 'DATE', 'HOSPITAL', 'CITY', 'STREET', 'STATE', 'COUNTRY', 'PHONE', 'IDNUM', 'EMAIL','ZIP', 'ORGANIZATION', 'PROFESSION', 'USERNAME']
zeroshot_ner = medical.PretrainedZeroShotNER() \
.pretrained("zeroshot_ner_deid_subentity_merged_medium", "en", "clinical/models") \
.setInputCols("document", "token") \
.setOutputCol("ner_zero_shot") \
.setPredictionThreshold(0.5) \
.setLabels(labels)
ner_converter = medical.NerConverter() \
.setInputCols("document", "token", "ner_zero_shot") \
.setOutputCol("ner_chunk_zero_shot")
Results
| begin | end | chunk | entity | confidence |
|---|---|---|---|---|
| 4 | 16 | Eleanor Vance | DOCTOR | 1.0 |
| 99 | 105 | JohnDoe | PATIENT | 1.0 |
| 160 | 172 | October262025 | DATE | 1.0 |
PretrainedZeroShotNERChunker (direct chunks)
labels = ['DOCTOR', 'PATIENT', 'AGE', 'DATE', 'HOSPITAL', 'CITY', 'STREET', 'STATE', 'COUNTRY', 'PHONE', 'IDNUM', 'EMAIL','ZIP', 'ORGANIZATION', 'PROFESSION', 'USERNAME']
zeroshot_chunker = medical.PretrainedZeroShotNERChunker() \
.pretrained("zeroshot_ner_deid_subentity_merged_medium", "en", "clinical/models") \
.setInputCols("document") \
.setOutputCol("ner_chunk_zero_shot") \
.setPredictionThreshold(0.5) \
.setLabels(labels)
Results | begin | end | chunk | entity | confidence | |——-|—–|——————–|———-|————| | 4 | 16 | Eleanor Vance | DOCTOR | 0.99998724 | | 36 | 53 | MountSinaiHospital | HOSPITAL | 0.99986887 | | 55 | 65 | NewYorkCity | CITY | 0.99928530 | | 99 | 105 | JohnDoe | PATIENT | 0.99999845 | | 111 | 113 | Doe | PATIENT | 0.99999870 | | 115 | 124 | 55yearsold | AGE | 0.99999976 | | 160 | 172 | October262025 | DATE | 1.0 |
Why does PretrainedZeroShotNERChunker find more entities?
- Works directly at the span level, not token-by-token
- Not limited by tokenization (e.g., “55yearsold”, “NewYorkCity”, “MountSinaiHospital”)
- Detects nested or overlapping mentions (e.g., “JohnDoe” + “Doe”)
- Avoids IOB conversion issues caused by Spark NLP Tokenizer
- Produces more complete chunks because it evaluates the full text span rather than individual tokens
Advanced ZIP Code Obfuscation Controls Across All Deidentification Workflows
A new parameter, setObfuscateZipKeepDigits, has been added to provide finer control over ZIP code masking when applying HIPAA-compliant geographic de-identification. This enhancement allows users to customize how many leading ZIP digits are preserved before the remaining digits are masked.
The parameter is now supported across Deidentification, StructuredDeidentification, and LightDeidentification workflows, ensuring consistent behavior in all de-identification pipelines.
Key capabilities:
- Preserves the first N digits of a ZIP code while masking the remainder
- Masks all trailing digits—including ZIP+4 extensions—with
* - Accepts values from 0 to 5 (default: 3)
- Overrides the default HIPAA Safe Harbor ZIP generalization pattern
- Enables customizable expert-determination–aligned obfuscation strategies
Examples
| Input ZIP | Keep = 3 (default) | Keep = 2 |
|---|---|---|
12345 |
123** |
12*** |
This new parameter provides significantly more flexibility for organizations seeking to balance privacy protection with analytical utility in their de-identification workflows.
7 New TextMatcher and Parser Models for Enhanced Clinical Entity Recognition, Including Specialized Matchers for Drug Forms, Medical Devices, and Other Clinical Concepts
We have added 7 new TextMatcher and ContextualParser models that expand our clinical entity recognition capabilities with specialized dictionaries for various healthcare domains. These lightweight, dictionary-based models enable fast and accurate identification of domain-specific entities in clinical text.
| Model Name | Description |
|---|---|
drug_form_matcher |
Identify drug form entities in clinical text |
drug_route_matcher |
Identifies drug administration route entities in clinical text |
drug_strength_parser |
Identifies drug strength entities in clinical text |
medical_device_matcher |
Identifies medical device entities in clinical text |
procedure_matcher |
Identifies medical procedure entities in clinical text |
symptom_matcher |
Identify symptom entities in clinical text |
test_result_parser |
Identifies test result entities in clinical text |
Example:
drug_form_matcher = TextMatcherInternalModel.pretrained("drug_form_matcher", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("matched_drug_form")
text = """The patient was started on Metformin 500mg tablet twice daily with meals. For pain, Oxycodone 5mg capsule every 6 hours PRN was prescribed. Insulin glargine injection 20 units subcutaneously at bedtime. Apply Hydrocortisone 1% cream to affected areas twice daily. Fentanyl 25mcg/hr transdermal patch applied every 72 hours. Albuterol solution via nebulizer every 4 hours PRN."""
data = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(data).transform(data)
Result:
| chunk | begin | end | label |
|---|---|---|---|
| tablet | 44 | 49 | DRUG_FORM |
| capsule | 97 | 103 | DRUG_FORM |
| injection | 149 | 157 | DRUG_FORM |
| cream | 220 | 224 | DRUG_FORM |
| transdermal patch | 270 | 286 | DRUG_FORM |
| solution | 330 | 337 | DRUG_FORM |
| nebulizer | 343 | 351 | DRUG_FORM |
A New Clinical Classification Model for Identifying Sentences Containing Quantitative Measurements
A new binary classification model, generic_classifier_measurement, has been introduced to identify clinical sentences containing quantitative measurement information. This classifier streamlines the extraction and filtering of measurement-related content from clinical documentation, making it easier to isolate vital signs, lab results, and other numerical health data.
Classes:
MEASUREMENT: Contains numerical health measurements, vital signs, lab results, or test valuesOTHER: Doesn’t contain quantitative measurement information
Example:
generic_classifier = GenericClassifierModel.pretrained("generic_classifier_measurement", "en", "clinical/models")\
.setInputCols(["features"])\
.setOutputCol("prediction")
sample_texts = [
["This is a 58-year-old male who started out having a toothache in the left lower side of the mouth that is now radiating into his jaw and towards his left ear."],
["The database was available at this point of time. WBC count is elevated at 19,000 with the left shift, hemoglobin of 17.7, and hematocrit of 55.8 consistent with severe dehydration."],
["Temperature max is 99, heart rate was 133 to 177, blood pressure is 114/43 (while moving), respiratory rate was 28 to 56 with O2 saturations 97 to 100% on room air."],
["I will see her back in followup in 3 months, at which time she will be recovering from a shoulder surgery."]
]
data = spark.createDataFrame(sample_texts).toDF("text")
result = clf_pipeline.fit(data).transform(data)
Result:
| text | prediction |
|---|---|
| This is a 58-year-old male who started out having a toothache… | OTHER |
| The database was available at this point of time. WBC count is elevated at 19,000… | MEASUREMENT |
| Temperature max is 99, heart rate was 133 to 177, blood pressure is 114/43… | MEASUREMENT |
| I will see her back in followup in 3 months… | OTHER |
3 New Entity Resolvers and 1 Mapper Using by bge_base_en_v1_5_onnx Embeddings for Improved Accuracy in Medical Code Mapping
We are introducing 3 new entity resolvers and 1 mapper built on the state-of-the-art bge_base_en_v1_5_onnx embeddings. These models leverage BGE (BAAI General Embeddings) for superior semantic understanding, resulting in more accurate medical code resolution and entity mapping across various healthcare terminologies.
| Model Name | Description |
|---|---|
bgeresolve_icd10cm |
Maps clinical entities to ICD-10-CM codes using bge_base_en_v1_5_onnx embeddings |
bgeresolve_rxnorm |
Maps drug entities to RxNorm codes using bge_base_en_v1_5_onnx embeddings |
bgeresolve_snomed |
Maps clinical entities to SNOMED codes using bge_base_en_v1_5_onnx embeddings |
bgeresolve_cpt |
Maps clinical entities to CPT codes using bge_base_en_v1_5_onnx embeddings |
rxnorm_mapper |
Maps drug entities to their corresponding RxNorm codes |
snomed_mapper |
Maps clinical entities to their corresponding SNOMED codes |
icd10cm_mapper |
Maps clinical entities to their corresponding ICD10CM codes |
cpt_mapper |
Maps clinical entities to their corresponding CPT codes |
Example:
bge_embeddings = BGEEmbeddings.pretrained("bge_base_en_v1_5_onnx", "en")\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("bge_embeddings")
icd10cm_resolver = SentenceEntityResolverModel.pretrained("bgeresolve_icd10cm", "en", "clinical/models")\
.setInputCols(["bge_embeddings"])\
.setOutputCol("icd10cm_code")\
.setDistanceFunction("EUCLIDEAN")
text = """A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus (T2DM), one prior episode of HTG-induced pancreatitis three years prior to presentation, associated with acute hepatitis and obesity, presented with a one-week history of polyuria, polydipsia, and vomiting."""
data = spark.createDataFrame([[text]]).toDF("text")
result = pipeline.fit(data).transform(data)
Result:
| sent_id | ner_chunk | entity | icd10cm_code | resolutions | all_codes | all_resolutions |
|---|---|---|---|---|---|---|
| 0 | gestational diabetes mellitus | PROBLEM | O24.4 | gestational diabetes mellitus | [O24.4, O24.41, O24.43, O24.42, O24.414, …] | [gestational diabetes mellitus, gestational di…] |
| 0 | subsequent type two diabetes mellitus | PROBLEM | E11 | type 2 diabetes mellitus | [E11, E11.69, E11.6, E11.64, E13, E11.65, …] | [type 2 diabetes mellitus, type 2 diabetes mel…] |
| 0 | T2DM | PROBLEM | E11 | type 2 diabetes mellitus | [E11, E11.65, E11.64, E11.9, E11.44, E11.5, …] | [type 2 diabetes mellitus, type 2 diabetes mel…] |
| 0 | HTG-induced pancreatitis | PROBLEM | K85.3 | drug induced acute pancreatitis | [K85.3, K86.0, K85.2, K85.31, K85.21, …] | [drug induced acute pancreatitis, alcohol-indu…] |
| 0 | acute hepatitis | PROBLEM | B15 | acute hepatitis a | [B15, B16, B17.1, B17, B17.2, B17.9, …] | [acute hepatitis a, acute hepatitis b, acute h…] |
| 0 | obesity | PROBLEM | E66 | overweight and obesity | [E66, E66.9, E66.0, E66.8, E66.3, O99.21, …] | [overweight and obesity, obesity, unspecified,…] |
| 0 | polyuria | PROBLEM | R35 | polyuria | [R35, R35.89, R35.8, R35.81, R80, R80.8, …] | [polyuria, other polyuria, other polyuria, noc…] |
| 0 | polydipsia | PROBLEM | R63.1 | polydipsia | [R63.1, O40, R35, R35.89, R35.8, R63.2, …] | [polydipsia, polyhydramnios, polyuria, other p…] |
| 0 | vomiting | PROBLEM | R11.1 | vomiting | [R11.1, R11, R11.12, R11.10, R11.11, R11.13, …] | [vomiting, nausea and vomiting, projectile vom…] |
New blog post ?
Updated Notebooks And Demonstrations For making Healthcare NLP Easier To Navigate And Understand ?
We Have Added And Updated A Substantial Number Of New Clinical Models And Pipelines, Further Solidifying Our Offering In The Healthcare Domain.
drug_form_matcherdrug_route_matcherdrug_strength_parsermedical_device_matcherprocedure_matchersymptom_matchertest_result_parsergeneric_classifier_measurementbgeresolve_icd10cmbgeresolve_rxnormbgeresolve_snomedbgeresolve_cptrxnorm_mappersnomed_mappericd10cm_mappercpt_mapper
For all Healthcare NLP models, please check: Models Hub Page
Versions
- 6.2.2
- 6.2.1
- 6.2.0
- 6.1.1
- 6.1.0
- 6.0.4
- 6.0.3
- 6.0.2
- 6.0.1
- 6.0.0
- 5.5.3
- 5.5.2
- 5.5.1
- 5.5.0
- 5.4.1
- 5.4.0
- 5.3.3
- 5.3.2
- 5.3.1
- 5.3.0
- 5.2.1
- 5.2.0
- 5.1.4
- 5.1.3
- 5.1.2
- 5.1.1
- 5.1.0
- 5.0.2
- 5.0.1
- 5.0.0
- 4.4.4
- 4.4.3
- 4.4.2
- 4.4.1
- 4.4.0
- 4.3.2
- 4.3.1
- 4.3.0
- 4.2.8
- 4.2.4
- 4.2.3
- 4.2.2
- 4.2.1
- 4.2.0
- 4.1.0
- 4.0.2
- 4.0.0
- 3.5.3
- 3.5.2
- 3.5.1
- 3.5.0
- 3.4.2
- 3.4.1
- 3.4.0
- 3.3.4
- 3.3.2
- 3.3.1
- 3.3.0
- 3.2.3
- 3.2.2
- 3.2.1
- 3.2.0
- 3.1.3
- 3.1.2
- 3.1.1
- 3.1.0
- 3.0.3
- 3.0.2
- 3.0.1
- 3.0.0
- 2.7.6
- 2.7.5
- 2.7.4
- 2.7.3
- 2.7.2
- 2.7.1
- 2.7.0
- 2.6.2
- 2.6.0
- 2.5.5
- 2.5.3
- 2.5.2
- 2.5.0
- 2.4.6
- 2.4.5
- 2.4.2
- 2.4.1
- 2.4.0