6.4.0
Highlights
We are delighted to announce notable enhancements and updates in Healthcare NLP 6.4.0. The headline addition is Pretrained Zero Shot Multi Task module, a unified annotator that enables Named Entity Recognition, Text Classification, Relation Extraction, and Structured Extraction in a single forward pass—eliminating the need for task-specific fine-tuning or labeled datasets while providing flexible, setter-based configuration. This release also introduces an optional Medical NER model training setting to split multi-sentence rows before training and reduce peak memory usage on long documents; and enables configurable de-identification output wrapping via markers (prefix/suffix markers around target PHI entities). In parallel, the release delivers 78 new pretrained models and pipelines, further strengthening coverage across clinical entity recognition, structured extraction, and domain-specific use cases. Notably, this includes JSL_MedS NER (LLM — q16 — v5), a Medical LLM–based named entity recognition model for structured clinical trial eligibility parsing, and returning them in a standardized JSON format. In addition, we are happy to share new research/ publications spanning medical LLM safety evaluation and dual-level DICOM de-identification with Visual NLP.
- Pretrained Zero Shot Multi Task module: unified zero-shot NER, classification, relation, and structured extraction in one forward pass, with flexible configuration and a single output column for all tasks.
MedicalNerApproachexplodeSentencesForTraining(optional) to expand multi-sentence rows into sentence-level training examples before training, aimed at lower peak memory on long documents.JSL_MedS NER (LLM — q16 — v5)for Structured Clinical Trial Eligibility ParsingDocumentHashCoderIntegration with Light Pipeline to Enable Consistent Hashing in Both Spark Pipeline Transforms and LightPipeline fullAnnotate-Style Inference.- De-identification
setDeidMarkersto wrap de-identified spans with configurable prefix/suffix markers in the deidentified text. - Obfuscation evaluation: A new utility function to report leakage and success rates when comparing original vs obfuscated columns after structured data.
- Enhanced 14 New Entity Resolver Models for Mapping Clinical Terms to Standardized SNOMED CT Codes and LOINC
- New research publications, including a Text2Story 2026 paper on multi-domain red teaming for medical LLM safety evaluation and accepted work on dual-level DICOM de-identification using Visual NLP at CVC 2026 and ICHI 2026 (ICHI: Beyond Metadata Scrubbing: Production-Scale Pixel-Level and Metadata-Level DICOM Anonymisation for Healthcare Workflows).
- New Blog Posts & Technical Deep Dives
- Updated notebooks and demonstrations for making Healthcare NLP easier to navigate and understand
- New Pretrained Zero Shot Multi Task Notebook
- The addition and update of numerous new clinical models and pipelines continue to reinforce our offering in the healthcare domain
These enhancements will elevate your experience with Healthcare NLP, enabling more efficient, accurate, and streamlined analysis of healthcare-related natural language data.
PretrainedZeroShotMultiTask: Unified Zero-Shot Multi-Task Inference
PretrainedZeroShotMultiTask is a unified Spark NLP for Healthcare annotator that runs Named Entity Recognition, Text Classification, Relation Extraction, and Structured Extraction in a single forward pass, without task-specific fine-tuning or labeled training data. Configure tasks with setEntities, setClassifications, setRelations, and setStructures; all outputs are written to one output column (by default extractions) and are distinguished by annotatorType (chunk, category, struct).
Because outputs are mixed in one column, use MultiAnnotationSplitter (for example with setSplitType("chunk")) before downstream annotators that expect dedicated columns—such as DeIdentification for de-identification workflows.
Pretrained models
| Model Name | Description |
|————-|—————————–|
| zeroshot_multitask_generic | Zero-shot multitask model for customizable entity extraction, structure filling, classification, and relation prediction over clinical documents. |
| zeroshot_multitask_base | Base variant of the zero-shot multitask model with the same task surface (entities, structures, classifications, relations). |
Key parameters
| Parameter | Description |
|———-|————-|
| setEntities(list) | NER labels; optional LABEL::description for guided extraction (e.g., "DRUG::medication name like aspirin"). |
| setClassifications(list) | Tasks as (task_name, [labels]) tuples; one label per document per task. |
| setRelations(list) | Relation types between entities (e.g., "works_for", "treatment_administered_for_problem"). |
| setStructures(list) | Templates as (name, [field_definitions]) with field_name::type::description. |
| setEntityThreshold(float) | Minimum entity confidence (default 0.4). |
Output annotation types
| annotatorType | Task | Description |
|—————|——|————-|
| chunk | NER | Spans with entity, confidence, offsets |
| category | Classification / relation | Classification metadata includes task where applicable |
| struct | Structured extraction | Key-value style fields in metadata |
Example: basic four-task configuration (zeroshot_multitask_base)
cancer_text = """The had previously undergone a left mastectomy and an axillary lymph node dissection for a left breast cancer twenty years ago.
The tumor was positive for ER and PR. Postoperatively, radiotherapy was administered to her breast.The cancer recurred as a right lung metastasis 13 years later.
The patient underwent a regimen consisting of adriamycin (60 mg/m2) and cyclophosphamide (600 mg/m2) over six courses, as first line therapy."""
data = spark.createDataFrame([[cancer_text]]).toDF("text")
document_assembler = (
DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
)
model = PretrainedZeroShotMultiTask.pretrained("zeroshot_multitask_base", "en", "clinical/models")
zero_shot = (
model
.setInputCols(["document"])
.setOutputCol("extractions")
.setEntities([
"Cancer_Dx::The exact cancer diagnosis phrase, such as 'left breast cancer' or 'lung cancer'. Extract only the diagnosis phrase itself.",
"Site::The exact anatomical site phrase, including laterality when present, such as 'left breast' or 'right lung'. Extract only the site phrase itself.",
"Biomarker::The exact biomarker name, such as 'ER', 'PR', 'HER2', or 'PD-L1'. Extract only the marker name itself.",
"Biomarker_Result::The exact biomarker result term, such as 'positive', 'negative', 'amplified', 'high', or 'low'. Extract only the result term itself.",
"Metastasis::The exact metastatic disease phrase including site when present, such as 'right lung metastasis' or 'bone metastasis'. Extract only the metastasis phrase itself.",
"Cancer_Surgery::The exact cancer surgery phrase, including laterality when present, such as 'left mastectomy' or 'axillary lymph node dissection'. Extract only the procedure phrase itself.",
"Chemotherapy::The exact chemotherapy drug name or treatment term, such as 'adriamycin', 'cyclophosphamide', or 'chemotherapy'. Extract only the drug or treatment term itself.",
"Radiotherapy::The exact radiation treatment term, such as 'radiotherapy' or 'radiation therapy'. Extract only the treatment term itself, not surrounding action phrases, recipients, or body sites.",
"Line_Of_Therapy::The exact treatment-line phrase, such as 'first line therapy' or 'second line treatment'. Extract only the treatment-line phrase itself.",
"Response_To_Treatment::The exact response or progression term, such as 'recurred', 'progressed', 'improved', or 'stable disease'. Extract only the response term itself.",
"Dosage::The exact dose expression, such as '60 mg/m2' or '400 mg'. Extract only the dose expression itself.",
"Relative_Date::The exact relative time phrase, such as 'twenty years ago' or '13 years later'. Extract only the temporal phrase itself."
]
)
.setStructures([
("cancer_case", [
"diagnosis::str::The exact cancer diagnosis phrase, such as 'left breast cancer' or 'lung cancer'. Extract only the diagnosis phrase itself.",
"site::list::The exact anatomical site phrase related to the cancer, including laterality when present, such as 'left breast' or 'right lung'. Extract only the site phrase itself, not the diagnosis phrase.",
"metastasis::list::The exact metastatic disease phrase, including site when present, such as 'right lung metastasis' or 'bone metastasis'. Extract only the metastasis phrase itself.",
"surgery::list::The exact cancer-related surgery phrase, including laterality when present, such as 'left mastectomy' or 'axillary lymph node dissection'. Extract only the procedure phrase itself, not surrounding action phrases.",
"radiotherapy::list::The exact radiation treatment term, such as 'radiotherapy' or 'radiation therapy'. Extract only the treatment term itself, not surrounding action phrases, recipients, or body sites.",
"line_of_therapy::str::The exact treatment-line phrase, such as 'first line therapy' or 'second line therapy'. Extract only the treatment-line phrase itself.",
"response::str::The exact response or progression term, such as 'recurred', 'progressed', 'improved', or 'stable disease'. Extract only the response term itself.",
"relative_date::list::The exact relative time phrase, such as 'twenty years ago' or '13 years later'. Extract only the temporal phrase itself."
]),
("biomarker_item", [
"name::str::The exact biomarker name, such as 'ER', 'PR', 'HER2', or 'PD-L1'. Extract only the biomarker name itself.",
"result::str::The exact biomarker result term associated with that biomarker, such as 'positive', 'negative', 'amplified', 'high', or 'low'. Extract only the result term itself."
]),
("chemotherapy_item", [
"drug::str::The exact chemotherapy drug name, such as 'adriamycin' or 'cyclophosphamide'. Extract only the drug name itself.",
"dosage::str::The exact dose expression associated with the drug, such as '60 mg/m2' or '600 mg/m2'. Extract only the dosage expression itself."
])
])
.setEntityThreshold(0.4)
)
results = Pipeline().setStages([document_assembler, zero_shot]).fit(data).transform(data)
Result:
{
"cancer_case": [
{
"diagnosis": "left breast cancer",
"site": [
"left breast"
],
"metastasis": [
"right lung metastasis"
],
"surgery": [
"left mastectomy",
"axillary lymph node dissection"
],
"radiotherapy": [
"radiotherapy"
],
"line_of_therapy": "first line therapy",
"response": "recurred",
"relative_date": [
"13 years later",
"twenty years ago"
]
}
],
"biomarker_item": [
{
"name": "ER",
"result": "positive"
},
{
"name": "PR",
"result": "positive"
}
],
"chemotherapy_item": [
{
"drug": "adriamycin",
"dosage": "60 mg/m2"
},
{
"drug": "cyclophosphamide",
"dosage": "600 mg/m2"
}
]
}
PretrainedZero-Shot Multi-Task models are designed to perform multiple tasks in one pass, but if you only want to run NER, for example, you can simply configure the model with setEntities and leave other task setters empty. The output will still be in the same extractions column, but will only contain chunk annotations for the NER task. You can then use MultiAnnotationSplitter to split out the chunk annotations into a separate column for downstream processing.
This flexibility allows you to create De-Identification pipelines that leverage the zero-shot NER capabilities of PretrainedZeroShotMultiTask without needing to configure classification, relation, or structured extraction tasks if they are not needed for your use case.
data = spark.createDataFrame(
[["Dr. John Smith works at Google in New York. The product costs $99. Sentiment: very positive."]]
).toDF("text")
document_assembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
zero_shot_multi_task = (
PretrainedZeroShotMultiTask.pretrained("zeroshot_multitask_base", "en", "clinical/models")
.setInputCols(["document"])
.setOutputCol("extractions")
.setEntities(["NAME::name of person", "LOCATION::city or place name", "ORGANIZATION::company name"])
.setEntityThreshold(0.7)
)
multi_annotation_splitter = (
MultiAnnotationSplitter()
.setInputCols("extractions")
.setOutputCol("ner_chunk")
.setSplitType("chunk")
)
light_deid = LightDeIdentification().setInputCols("document", "ner_chunk").setOutputCol("deid")
full_pipe = Pipeline(stages=[document_assembler, zero_shot_multi_task, multi_annotation_splitter, light_deid]).fit(data)
full_pipe.transform(data).select("deid").show(truncate=False)
Result:
+------------------------------------------------------------------------------------------------+
|result |
+------------------------------------------------------------------------------------------------+
|[<NAME> works at <ORGANIZATION> in <LOCATION>. The product costs $99. Sentiment: very positive.]|
+------------------------------------------------------------------------------------------------+
Please check more detail explanations in the Workshop notebook
JSL_MedS NER (LLM — q16 — v5) for Structured Clinical Trial Eligibility Parsing
This release adds JSL_MedS NER (LLM — q16 — v5), a Medical LLM–based named entity recognition model that extracts structured entities from clinical trial eligibility criteria. Given unstructured inclusion and exclusion text, it returns JSON with inclusion and exclusion entity lists. The model supports four entity types — age, lab, condition, and pregnancy — each following a fixed schema with normalized values and numeric thresholds where applicable.
The model is intended for clinical trial eligibility parsing, patient matching, and structured medical information extraction workflows, and runs on the llama.cpp engine via the MedicalLLM family in Healthcare NLP.
| Model Name | Description |
|---|---|
| jsl_meds_ner_q16_v5 | This LLM model extracts structured entities from clinical trial eligibility criteria. Given unstructured text, it returns JSON with inclusion and exclusion entities. The model supports four entity types: age, lab, condition, and pregnancy. Each entity follows a fixed schema with normalized values and numeric thresholds where applicable. This model is intended for clinical trial eligibility parsing, patient matching, and structured medical information extraction workflows. |
Example: jsl_meds_ner_q16_v5
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
llm = MedicalLLM.pretrained("jsl_meds_ner_q16_v5", "en", "clinical/models")\
.setInputCols(["document"])\
.setOutputCol("llm_output")
pipeline = Pipeline(stages=[documentAssembler, llm])
text = """
Inclusion Criteria:
Adults aged 18–75 with confirmed type 2 diabetes mellitus. HbA1c 7.0–10.5%.
Fasting plasma glucose ≥126 mg/dL. Negative pregnancy test required.
Exclusion Criteria:
Type 1 diabetes mellitus. HbA1c >10.5%. Fasting plasma glucose >300 mg/dL.
Pregnant or breastfeeding. Diabetic ketoacidosis. Uncontrolled thyroid disease.
"""
data = spark.createDataFrame([[text]]).toDF("text")
model = pipeline.fit(data)
result = model.transform(data)
Result:
{
"inclusion": [
{"type":"age","min_years":18,"max_years":75},
{"type":"condition","name":"type 2 diabetes mellitus"},
{"type":"lab","name":"HbA1c","min_percent":7.0,"max_percent":10.5},
{"type":"lab","name":"fasting plasma glucose","min_value":126},
{"type":"pregnancy","status":"negative_pregnancy_test"}
],
"exclusion": [
{"type":"condition","name":"type 1 diabetes mellitus"},
{"type":"lab","name":"HbA1c","max_percent":10.5},
{"type":"lab","name":"fasting plasma glucose","max_value":300},
{"type":"pregnancy","status":"pregnant_or_lactating"},
{"type":"condition","name":"diabetic ketoacidosis"},
{"type":"condition","name":"uncontrolled thyroid disease"}
]
}
Overall, this release extends our Medical LLM portfolio with a production-oriented model for structured eligibility parsing, helping teams move from free-text trial criteria to machine-readable JSON for downstream matching and analytics.
Lower-Memory NER Training via explodeSentencesForTraining Parameter
MedicalNerApproach now supports an optional training-time preprocessing step that splits multi-sentence rows into one row per sentence before training. This can reduce peak memory usage and improve training efficiency on long documents.
The new behavior is controlled by explodeSentencesForTraining and is disabled by default (False).
Performance (example benchmark) Dataset: 30 documents, ~30,000 characters.
| explodeSentencesForTraining | Approx. training time | Approx. peak RAM |
|——————————–|————————–:|———————:|
| True | ~11 minutes | ~9.3 GB |
| False | ~24 minutes | ~30 GB |
Note: Results may vary depending on hardware, Spark configuration, dataset characteristics, and model hyperparameters.
What changed
- Added
explodeSentencesForTrainingtoMedicalNerApproach. - Kept validation splitting at document level first, then optionally exploded each split into sentence rows for training. Example:
nerTagger = (
MedicalNerApproach()
.setInputCols(["splitter", "token", "embeddings"])
.setLabelColumn(LABEL_COLUMN)
.setOutputCol("ner")
.setMaxEpochs(30)
.setBatchSize(8)
.setRandomSeed(0)
.setVerbose(1)
.setValidationSplit(0.2)
.setEvaluationLogExtended(True)
.setEnableOutputLogs(True)
.setIncludeConfidence(True)
.setOutputLogsPath("ner_logs")
.setEarlyStoppingCriterion(0.01)
.setEarlyStoppingPatience(5)
.setUseBestModel(False)
.setExplodeSentencesForTraining(True) # default: False; set to True to enable sentence explosion
)
training_pipeline = Pipeline(stages=[nerDLGraphChecker, nerTagger])
ner_model_nosplit = training_pipeline.fit(train_df)
DocumentHashCoder Integration with Light Pipeline to Enable Consistent Hashing in Both Spark Pipeline Transforms and LightPipeline fullAnnotate-Style Inference
DocumentHashCoder has been updated to be compatible with LightPipeline. With this enhancement, the same hashing behavior is preserved across both standard Spark Pipeline flows and low-latency LightPipeline inference.
Example: DocumentHashCoder with transform()
from pyspark.ml import Pipeline
from sparknlp.base import DocumentAssembler, LightPipeline
from sparknlp_jsl.annotator import DocumentHashCoder
data = spark.createDataFrame(
[
(1, "Patient has diabetes and hypertension."),
(2, "No acute cardiopulmonary process detected."),
],
["id", "text"],
)
document_assembler = (
DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
)
document_hasher = (
DocumentHashCoder()
.setInputCols("document")
.setOutputCol("hashed_document")
.setPatientIdColumn("id")
.setRangeDays(100)
.setNewDateShift("shift_days")
.setSeed(42)
)
pipeline = Pipeline(stages=[document_assembler, document_hasher])
model = pipeline.fit(data)
# Spark pipeline output
transformed = model.transform(data)
transformed.select("id", "hashed_document").show(truncate=False)
Result:
+---+----------------------------------------------------------------------------------------------------------------------+
|id |hashed_document |
+---+----------------------------------------------------------------------------------------------------------------------+
|1 |[{document, 0, 37, Patient has diabetes and hypertension., {sentence -> 0, patientId -> 1, dateshift -> 47}, []}] |
|2 |[{document, 0, 41, No acute cardiopulmonary process detected., {sentence -> 0, patientId -> 2, dateshift -> -26}, []}]|
+---+----------------------------------------------------------------------------------------------------------------------+
Example: DocumentHashCoder with fullAnnotate()
# LightPipeline output
texts = [
"Patient has diabetes and hypertension.",
"No acute cardiopulmonary process detected.",
]
light = LightPipeline(model)
light_out = light.fullAnnotate(texts, metadata={"id": ["1", "2"]})
print(light_out[0]["hashed_document"])
print(light_out[1]["hashed_document"])
Result:
[Annotation(document, 0, 37, Patient has diabetes and hypertension., {'sentence': '0', 'patientId': '1', 'dateshift': '47'}, [])]
[Annotation(document, 0, 41, No acute cardiopulmonary process detected., {'sentence': '0', 'patientId': '2', 'dateshift': '-26'}, [])]
De-identification setDeidMarkers to wrap de-identified spans with configurable prefix/suffix markers in the deidentified text.
DeIdentification now supports setDeidMarkers, which defines the prefix and suffix markers used to wrap de-identified entity spans in the output text.
- The first value is the prefix marker and the second is the suffix marker.
Example:("[START]", "[END]")can wrap sensitive spans so the output reads like[START]...entity text...[END](depending on mode and pipeline configuration).
Parameter formats
- List or tuple:
(prefix, suffix)— e.g.("<DEID>", "</DEID>") - Dict: keys
start(prefix) and/orend(suffix). If only one key is provided, the other defaults to"". Valid keys arestartandendonly.
Example:
light_de_identification = (
LightDeIdentification()
.setInputCols(["ner_chunk", "document"])
.setOutputCol("dei")
.setMode("obfuscate")
.setObfuscateRefSource("faker")
.setSeed(10)
.setDeidMarkers({"start": "[S]", "end": "[E]"})
)
Result:
Original document text:
Patient John Doe was seen in the clinic.
De-identified text (obfuscated), with markers:
Patient [S]Valerie Aho[E] was seen in the clinic.
Obfuscation evaluation: A new utility function to report leakage and success rates when comparing original vs obfuscated columns after structured data.
A new helper, evaluate_obfuscation, is available to measure obfuscation quality on a Spark DataFrame by comparing original and obfuscated column values. It reports leakage (rows where sensitive values were unchanged) and success rate (complement of leakage rate among evaluated rows).
Behavior
- Compares
original_colvsobfuscated_colrow-wise. null_handling:ignore(skip rows where both sides are null/empty),equal(treat null as empty string before compare), orstrict(no null normalization). Default:ignore.normalize_strings: optional whitespace trim before comparison; comparison remains case-sensitive. Default:True.
Returns
A dict with:
total_rows— rows included after filteringleakage_count— rows where original equals obfuscatedleakage_rate—leakage_count / total_rowssuccess_rate—1 - leakage_rate(for evaluated rows)
Example (with StructuredDeidentification):
from sparknlp_jsl.utils import evaluate_obfuscation
from sparknlp_jsl.structured_deidentification import StructuredDeidentification
obfuscator = StructuredDeidentification(
spark=spark,
columns={"NAME": "NAME"},
obfuscateRefSource="faker",
)
obfuscated_df = obfuscator.obfuscateColumns(
df,
outputAsArray=False,
overwrite=False,
suffix="_OBFUSCATED",
)
eval_result = evaluate_obfuscation(
obfuscated_df,
original_col="NAME",
obfuscated_col="NAME_OBFUSCATED",
null_handling="strict",
normalize_strings=True,
)
print(eval_result)
Result:
{
"total_rows": 1000,
"leakage_count": 0,
"leakage_rate": 0.0,
"success_rate": 1.0,
}
Enhanced 14 New Entity Resolver Models for Mapping Clinical Terms to Standardized SNOMED CT Codes and LOINC
We are updating our SentenceEntityResolver (resolving named entities to medical terminologies) lineup with **four LOINC ** and **ten SNOMED CT ** models, covering laboratory and clinical observation coding (LOINC) and SNOMED CT concept resolution across multiple semantic slices.
LOINC resolvers map extracted clinical entities to Logical Observation Identifiers Names and Codes (LOINC) using sentence embeddings; resolutions include the official LOINC description in brackets where applicable. The family is trained against LOINC v2.81 and follows the same NER → chunk → BertSentenceEmbeddings → resolver pattern as our existing LOINC resolvers, with variants for augmented and numeric-augmented training, a base SBioBERT-cased model, and an uncased BlueBERT (sbluebertresolve_loinc_uncased) option for case-insensitive matching.
SNOMED resolvers map clinical entities and concepts to SNOMED CT codes using sbiobert_base_cased_mli sentence embeddings (and mpnet_embeddings_biolord_2023_c for biolordresolve_snomed_findings_aux_concepts), with models specialized by domain (e.g. conditions, drugs, body structure, findings, procedures and measurements, veterinary, auxiliary concepts).
LOINC models
| Model Name | Description |
|————|————-|
| sbiobertresolve_loinc_augmented | Maps medical entities to LOINC codes (augmented variant) |
| sbiobertresolve_loinc | Maps extracted medical entities to LOINC codes using sbiobert_base_cased_mli |
| sbiobertresolve_loinc_numeric_augmented | Maps medical entities to LOINC codes (numeric-augmented variant) |
| sbluebertresolve_loinc_uncased | Uncased BlueBERT-based LOINC sentence resolver |
SNOMED models
| Model Name | Description |
|————|————-|
| sbiobertresolve_snomed_auxConcepts | Maps clinical entities and concepts to SNOMED CT (auxiliary concepts scope) |
| sbiobertresolve_snomed_bodyStructure | Maps entities to SNOMED CT for body structure concepts |
| sbiobertresolve_snomed_conditions | Maps entities to SNOMED CT for clinical conditions |
| sbiobertresolve_snomed_drug | Maps entities to SNOMED CT for drug-related concepts |
| biolordresolve_snomed_findings_aux_concepts | SNOMED findings / auxiliary concepts using Biolord sentence embeddings |
| sbiobertresolve_snomed_findings_aux_concepts | SNOMED findings with auxiliary concepts (SBioBERT) |
| sbiobertresolve_snomed_findings | Maps clinical findings to SNOMED CT |
| sbiobertresolve_snomed_no_class | SNOMED resolution without a fixed semantic-class constraint |
| sbiobertresolve_snomed_procedures_measurements | Maps procedures and measurements to SNOMED CT |
| sbiobertresolve_snomed_veterinary | Maps entities to SNOMED CT in a veterinary-focused setting |
Example (LOINC — sbiobertresolve_loinc):
document_assembler = nlp.DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models") \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = nlp.Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner_model = medical.NerModel.pretrained("ner_jsl", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = medical.NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")\
.setWhiteList(["Test"])
chunk2doc = medical.Chunk2Doc()\
.setInputCols("ner_chunk")\
.setOutputCol("ner_chunk_doc")
sbert_embedder = nlp.BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("sbert_embeddings")\
.setCaseSensitive(False)
resolver = medical.SentenceEntityResolverModel.pretrained("sbiobertresolve_loinc","en", "clinical/models") \
.setInputCols(["sbert_embeddings"]) \
.setOutputCol("resolution")\
.setDistanceFunction("EUCLIDEAN")
nlpPipeline = nlp.Pipeline(stages=[document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
ner_model,
ner_converter,
chunk2doc,
sbert_embedder,
resolver])
data = spark.createDataFrame([["""A 65-year-old woman presents to the office with generalized fatigue for the last 4 months.
She used to walk 1 mile each evening but now gets tired after 1-2 blocks. She has a history of Crohn disease and hypertension
for which she receives appropriate medications. She is married and lives with her husband. She eats a balanced diet that
includes chicken, fish, pork, fruits, and vegetables. She rarely drinks alcohol and denies tobacco use. A physical examination
is unremarkable. Laboratory studies show the following: Hemoglobin: 9.8g/dL, Hematocrit: 32%, Mean Corpuscular Volume: 110 μm3"""]]).toDF("text")
result = nlpPipeline.fit(data).transform(data)
Result (LOINC):
| chunk | begin | end | ner_label | loinc_code | description | resolutions | all_codes | aux_labels |
|---|---|---|---|---|---|---|---|---|
| physical examination | 450 | 469 | Test | 29544-4 | Physical findings [Physical findings] | Physical findings [Physical findings]:::Physical findings… | 29544-4:::29545-1:::55286-9:::11435-5:::11384-5:::8709-8:… | ACTIVE:::ACTIVE:::ACTIVE:::ACTIVE:::ACTIVE:::ACTIVE:::ACT… |
| Laboratory studies | 490 | 507 | Test | 26436-6 | Laboratory studies (set) [Laboratory studies (set)] | Laboratory studies (set) [Laboratory studies (set)]:::Lab… | 26436-6:::52482-7:::11502-2:::34075-2:::100455-5:::85069-… | ACTIVE:::DISCOURAGED:::ACTIVE:::ACTIVE:::ACTIVE:::ACTIVE:… |
| Hemoglobin | 529 | 538 | Test | 10346-5 | Hemoglobin [Hemoglobin A [Units/volume] in Blood by Elect…] | Hemoglobin [Hemoglobin A [Units/volume] in Blood by Elect…] | 10346-5:::109592-6:::11559-2:::2030-5:::34618-9:::38896-7… | ACTIVE:::TRIAL:::ACTIVE:::ACTIVE:::ACTIVE:::ACTIVE:::ACTI… |
| Hematocrit | 550 | 559 | Test | 11559-2 | Fractional hemoglobin [Fractional oxyhemoglobin in Blood] | Fractional hemoglobin [Fractional oxyhemoglobin in Blood]… | 11559-2:::10346-5:::41986-1:::48703-3:::55103-6:::8478-0:… | ACTIVE:::ACTIVE:::ACTIVE:::ACTIVE:::ACTIVE:::ACTIVE:::ACT… |
| Mean Corpuscular Volume | 567 | 589 | Test | 30386-7 | Erythrocyte mean corpuscular diameter [Length] [Erythrocy…] | Erythrocyte mean corpuscular diameter [Length] [Erythrocy…] | 30386-7:::101864-7:::20161-6:::18033-1:::19853-1:::101150… | ACTIVE:::ACTIVE:::ACTIVE:::ACTIVE:::ACTIVE:::ACTIVE:::ACT… |
Example (SNOMED — sbiobertresolve_snomed_auxConcepts):
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner_jsl = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner_jsl")
ner_jsl_converter = NerConverterInternal()\
.setInputCols(["sentence", "token", "ner_jsl"])\
.setOutputCol("ner_jsl_chunk")\
.setWhiteList(["Procedure","Substance","Drug_Ingredient","Internal_organ_or_component","Modifier","BMI","LDL","External_body_part_or_region","Alcohol","Treatment","Test","Smoking"])
chunk2doc = Chunk2Doc()\
.setInputCols("ner_jsl_chunk")\
.setOutputCol("ner_chunk_doc")
sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("sbert_embeddings")\
.setCaseSensitive(False)
snomed_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_snomed_auxConcepts","en","clinical/models")\
.setInputCols(["sbert_embeddings"])\
.setOutputCol("snomed_code")\
.setDistanceFunction("EUCLIDEAN")
nlpPipeline= Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
word_embeddings,
ner_jsl,
ner_jsl_converter,
chunk2doc,
sbert_embedder,
snomed_resolver
])
sample_text = """This is an 82-year-old male with a history of prior tobacco use, hypertension, chronic renal insufficiency, COPD, gastritis, and TIA. He initially presented to Braintree with a nonspecific ST-T abnormality and was transferred to St. Margaret’s Center. He underwent cardiac catheterization because of occlusion of the mid left anterior descending coronary artery lesion, which was complicated by hypotension and bradycardia. He required atropine, IV fluids, and dopamine, possibly secondary to a vagal reaction. He was subsequently transferred to the CCU for close monitoring. He was hemodynamically stable at the time of admission to the CCU."""
df= spark.createDataFrame([[sample_text]]).toDF("text")
result= nlpPipeline.fit(df).transform(df)
Result (SNOMED):
| sent_id | ner_chunk | entity | snomed_code | resolution | all_codes | all_resolutions | aux_list |
|---|---|---|---|---|---|---|---|
| 0 | tobacco | Smoking | 57264008 | tobacco | [57264008, 102407002, 39953003, 159882006, 230491000087101, 102408007, 722496004, 38402001, 7110…] | [tobacco, tobacco smoke, tobacco - substance, tobacco processor, shisha tobacco, cigarette smoke…] | [Organism, Substance, Substance, Social Context, No_Concept_Class, Substance, Physical Object, …] |
| 1 | nonspecific | Modifier | 10003008 | non-specific | [10003008, 261992003, 863956004, 300844001, 278001007, 863967003, 236821005, 276119007, 12170110…] | [non-specific, non-biological, non-sterile, non-resonant, nonspecific site, non-absorbable, …] | [Qualifier Value, Qualifier Value, Qualifier Value, Qualifier Value, Body Structure, Qualifier Value…] |
| 2 | cardiac catheterization | Procedure | 41976001 | cardiac catheterization | [41976001, 705923009, 721968000, 467735004, 129085009, 425315000, 128956009, 467525008, 12895700…] | [cardiac catheterization, cardiac catheter, cardiac catheterization report, …] | [Procedure, Physical Object, Record Artifact, Physical Object, Qualifier Value, Procedure, …] |
| 3 | atropine | Drug_Ingredient | 73949004 | atropine | [73949004, 105075009, 349945006, 410493009, 74237004, 50507004, 109207004, 349946007, 34013000…] | [atropine, atropine measurement, oral atropine, atropinization, atropine sulfate, …] | [Pharma/Biol Product, Procedure, Clinical Drug Form, Procedure, Substance, Substance, …] |
| 3 | fluids | Drug_Ingredient | 255765007 | fluid | [255765007, 246498002, 258442002, 251851008, 32457005, 396276007, 406142009, 251840008, 33463005…] | [fluid, fluid used, fluid sample, fluid input, body fluid, fluid appearance, …] | [Qualifier Value, Attribute, Specimen, Observable Entity, Substance, Observable Entity, …] |
| 3 | dopamine | Drug_Ingredient | 59187003 | dopamine | [59187003, 412383006, 37484001, 32779004, 412845004, 713493000, 418222008, 12307008, 384952006…] | [dopamine, dopamine agent, dopamine receptor, dopamine measurement, serum dopamine level, …] | [Pharma/Biol Product, Substance, Substance, Procedure, Procedure, Substance, …] |
New Research Publications
-
A Multi-Domain Red Teaming Framework for Safety, Robustness, and Fairness Evaluation of Medical Large Language Models Presented at Text2Story 2026, this work introduces a multi-domain red teaming framework evaluating medical LLMs across hundreds of clinically grounded scenarios spanning multiple domains and subcategories, combining adversarial transformations with a seven-dimension rubric, LLM-assisted scoring, and human-in-the-loop validation. The findings highlight substantial performance variance across models, meaningful worst-case failures even among high aggregate scores, equity-related sensitivity under demographic modifications, and the importance of hybrid (automated + clinician) evaluation for credible safety assessment.
-
Comprehensive DICOM Deidentification Using Visual NLP: A Dual-Level Approach to Privacy-Preserving Medical Imaging Accepted at the Computer Vision Conference (CVC) 2026, this paper describes a dual-level DICOM de-identification methodology using John Snow Labs Visual NLP to address PHI in both pixel-level image content and DICOM metadata. Evaluated on the MIDI-B dataset, the approach reports strong validation outcomes—including high accuracy for text processing and consistent handling of key metadata operations—while aiming to preserve clinical utility for research-oriented use cases.
-
Beyond Metadata Scrubbing: Production-Scale Pixel-Level and Metadata-Level DICOM Anonymisation for Healthcare Workflows Accepted at the 2026 IEEE 14th International Conference on Healthcare Informatics (ICHI) under this title, covering production-scale DICOM anonymisation at both pixel and metadata levels for healthcare workflows. This entry corresponds to the full contribution summarized under Comprehensive DICOM Deidentification Using Visual NLP: A Dual-Level Approach to Privacy-Preserving Medical Imaging.
New Blog Posts & Technical Deep Dives
-
Running the Latest LLMs on Spark: Llama.cpp Integration Gets a Major Upgrade This blog post describes how Spark NLP’s Llama.cpp stack was upgraded (including Spark NLP and a newer upstream Llama.cpp release) to support a broader set of modern LLM families—quantized GGUF models and multimodal variants—with faster, more memory-efficient inference on CPUs and typical hardware. It explains how AutoGGUFModel and related annotators plug into Spark ML pipelines, how to use pretrained GGUF checkpoints or load your own
.gguffiles, and why keeping pace with Llama.cpp helps teams run the latest open models at Spark scale with minimal extra infrastructure. -
Multilingual Clinical NER with ONNX: New Models for Entity Extraction Across Languages This blog post introduces new ONNX-based clinical Named Entity Recognition models for English, Italian, and Spanish in John Snow Labs Healthcare NLP, aimed at extracting core medical entities such as diseases, procedures, medications, and symptoms with production-ready performance. It outlines why ONNX is a practical deployment format for clinical NLP, how these models fit into Spark NLP–style pipelines, and how multilingual coverage supports structured extraction from non-English clinical narratives alongside the existing English foundation.
Updated Notebooks And Demonstrations For making Healthcare NLP Easier To Navigate And Understand
- New Pretrained Zero Shot Multi Task Notebook
We Have Added And Updated A Substantial Number Of New Clinical Models And Pipelines, Further Solidifying Our Offering In The Healthcare Domain.
jsl_meds_ner_q16_v5sbiobertresolve_loinc_augmentedsbiobertresolve_loincsbiobertresolve_loinc_numeric_augmentedsbluebertresolve_loinc_uncasedzeroshot_multitask_genericzeroshot_multitask_basesbiobertresolve_snomed_auxConceptssbiobertresolve_snomed_bodyStructuresbiobertresolve_snomed_conditionssbiobertresolve_snomed_drugbiolordresolve_snomed_findings_aux_conceptssbiobertresolve_snomed_findings_aux_conceptssbiobertresolve_snomed_findingssbiobertresolve_snomed_no_classsbiobertresolve_snomed_procedures_measurementssbiobertresolve_snomed_veterinaryrxnorm_umls_mapperumls_rxnorm_mapperumls_cpt_mappercpt_umls_mappersnomed_umls_mapperumls_snomed_mapperaccount_parser_pipelineage_parser_pipelinebgeresolve_cpt_pipelinebgeresolve_icd10cm_pipelinebgeresolve_rxnorm_pipelinebgeresolve_snomed_pipelinecpt_mapper_pipelinecpt_parser_pipelinedate_of_birth_parser_pipelinedate_of_death_parser_pipelinedate_regex_matcher_pipelinedln_parser_pipelinedrug_form_matcher_pipelinedrug_route_matcher_pipelinedrug_strength_parser_pipelineemail_regex_matcher_pipelinegeneric_classifier_measurement_pipelineicd10_parser_pipelineicd10cm_mapper_pipelineicd10cm_umls_mappingip_regex_matcher_pipelinelicense_parser_pipelineloinc_umls_mappingmed_parser_pipelinemedical_device_matcher_pipelinemesh_umls_mappingner_deid_generic_nonMedical_pipelinener_deid_subentity_nonMedical_pipelinener_drugs_large_v2_pipelinephone_parser_pipelineplate_parser_pipelineprocedure_matcher_pipelinerxnorm_mapper_pipelinerxnorm_umls_mappingsnomed_umls_mappingspecimen_parser_pipelinessn_parser_pipelinesymptom_matcher_pipelineumls_clinical_drugs_mappingumls_clinical_findings_mappingumls_disease_syndrome_mappingumls_drug_substance_mappingumls_icd10cm_mappingumls_loinc_mappingumls_major_concepts_mappingumls_mesh_mappingumls_rxnorm_mappingumls_snomed_mappingurl_regex_matcher_pipelinevin_parser_pipelinezeroshot_ner_deid_generic_nonMedical_large_pipelinezeroshot_ner_deid_generic_nonMedical_medium_pipelinezeroshot_ner_deid_subentity_nonMedical_large_pipelinezeroshot_ner_deid_subentity_nonMedical_medium_pipelinezip_regex_matcher_pipeline
Versions
- 6.4.0
- 6.3.0
- 6.2.2
- 6.2.1
- 6.2.0
- 6.1.1
- 6.1.0
- 6.0.4
- 6.0.3
- 6.0.2
- 6.0.1
- 6.0.0
- 5.5.3
- 5.5.2
- 5.5.1
- 5.5.0
- 5.4.1
- 5.4.0
- 5.3.3
- 5.3.2
- 5.3.1
- 5.3.0
- 5.2.1
- 5.2.0
- 5.1.4
- 5.1.3
- 5.1.2
- 5.1.1
- 5.1.0
- 5.0.2
- 5.0.1
- 5.0.0
- 4.4.4
- 4.4.3
- 4.4.2
- 4.4.1
- 4.4.0
- 4.3.2
- 4.3.1
- 4.3.0
- 4.2.8
- 4.2.4
- 4.2.3
- 4.2.2
- 4.2.1
- 4.2.0
- 4.1.0
- 4.0.2
- 4.0.0
- 3.5.3
- 3.5.2
- 3.5.1
- 3.5.0
- 3.4.2
- 3.4.1
- 3.4.0
- 3.3.4
- 3.3.2
- 3.3.1
- 3.3.0
- 3.2.3
- 3.2.2
- 3.2.1
- 3.2.0
- 3.1.3
- 3.1.2
- 3.1.1
- 3.1.0
- 3.0.3
- 3.0.2
- 3.0.1
- 3.0.0
- 2.7.6
- 2.7.5
- 2.7.4
- 2.7.3
- 2.7.2
- 2.7.1
- 2.7.0
- 2.6.2
- 2.6.0
- 2.5.5
- 2.5.3
- 2.5.2
- 2.5.0
- 2.4.6
- 2.4.5
- 2.4.2
- 2.4.1
- 2.4.0