Healthcare NLP Release Notes

6.1.1

Highlights

We are delighted to announce significant new features and major enhancements in the latest release of Healthcare NLP. This release introduces brand-new Medical Vision language models (VLMs), new capabilities such as the new Annotation2Training module for converting annotations from GenAI Lab into trainable dataframes, the MedicalNerDLGraphChecker for well-informed NER training, domain-specific LLM pipelines, and faster ONNX models—a total of 78 new LLMs, pipelines, and ONNX models available out of the box.

Medical Vision LLM Models extend clinical AI with multimodal text and image understanding
Pretrained Clinical Pipelines for LLMs deliver ready-to-use Q&A, NER, summarization, RAG, and chat
Introducing MedicalNerDLGraphChecker improves graph management during medical NER training
Annotation2Training converts GenAI Lab annotations into NER-ready training datasets
Lightweight Text-to-SQL Model based on small LLMs enables seamless natural language queries on healthcare data
Human phenotype ontology (HPO) Mapping Models and Pipeline standardize phenotype recognition and linking via returning exact, related, and broad synonyms for each term at once
ONNX-Optimized MedicalBERT Models provide faster inference on CPU and GPU for certain tasks (NER, assertion)
Cross-Framework Benchmarking compares TensorFlow, ONNX, and OpenVINO, with ONNX leading on GPU
Additional De-Identification Enhancements strengthen HIPAA compliance and customization
Structured JSON Converter with Mappers streamlines integration of structured data
Various core improvements; bug fixes, enhanced overall robustness and reliability of Spark NLP for Healthcare
- Simplified Spark Session initialization
- ner_source metadata key in IOBTagger
- Consistent results with genderAwareness=True in DeIdentification
Updated notebooks and demonstrations for making Spark NLP for Healthcare easier to navigate and understand
- New MedicalVisionLLM MOOC Notebook
- New AssertionMerger MOOC Notebook
- New End2End Preannotation and Training Pipeline Notebook
- New GenAI Lab to Ner Training Notebook
- New NER ASSERTIONS RULE BASED Demo
- Updated MEDICAL LLM Demo
- Updated ENTITY RESOLUTION CODE MAPPING Demo
- Updated NER GENE PHENOTYPES Demo
The addition and update of numerous new clinical models and pipelines continue to reinforce our offering in the healthcare domain

Medical Vision LLM Models Extend Clinical AI with Multimodal Text and Image Understanding

In this release, we are expanding our Medical Vision LLM (VLM) family with additional models specifically finetuned for medical tasks. These models extend large language model capabilities with integrated visual language understanding, enabling multimodal clinical analysis by combining textual and image inputs.

The new VLMs provide strong performance for tasks such as diagnostic image interpretation, image-to-text summarization, and integrated documentation analysis — continuing our mission to advance clinical AI with robust, domain-specific multimodal solutions.

Model Name	Quantization Options
jsl_meds_ner_vlm_8b_v1	q4, q8, q16
jsl_meds_ner_vlm_7b_v1	q4, q8, q16

Example:

prompt = """
# Template:
{
  "Patient Name": "string",
  "Patient Age": "integer",
  "Patient Gender": "string",
  "Hospital Number": "string",
  "Episode Number": "string",
  "Episode Date": "date-time"
}
# Context:
<image>
"""

input_df = nlp.vision_llm_preprocessor(
    spark=spark,
    images_path="images",
    prompt=prompt,
    output_col_name="prompt"
)

document_assembler = (
    nlp.DocumentAssembler()
    .setInputCol("prompt")
    .setOutputCol("caption_document")
)

image_assembler = (
    nlp.ImageAssembler()
    .setInputCol("image")
    .setOutputCol("image_assembler")
)

medicalVisionLLM = (
    medical.AutoGGUFVisionModel.pretrained("jsl_meds_ner_vlm_8b_q16_v1", "en", "clinical/models")
    .setInputCols(["caption_document", "image_assembler"])
    .setOutputCol("completions")
)

pipeline = nlp.Pipeline().setStages([
    document_assembler,
    image_assembler,
    medicalVisionLLM
])

model = pipeline.fit(input_df)
result = model.transform(input_df)

Result:

{
    "Patient Name": "Ms RUKHSANA SHAHEEN",
    "Patient Age": 56,
    "Patient Gender": "Female",
    "Hospital Number": "MH005990453",
    "Episode Number": "030000528270",
    "Episode Date": "2021-07-02T08:31:00"
}

Pretrained Clinical Pipelines for LLMs Deliver ready-to-use Q&A, NER, summarization, RAG, and chat

This release introduces a new collection of pretrained clinical LLM pipelines, designed to streamline clinical document analysis. Each pipeline is built on top of state-of-the-art small-sized Medical LLMs finetuned by us, providing ready-to-use solutions for Q&A, NER, Summarization, Retrieval-Augmented Generation (RAG), and Chat.

The main advantage of these pipelines is the elimination of manual model chaining. Instead of building and testing complex workflows, users can instantly deploy one-liner pipelines that are efficient, accurate, and purpose-built for clinical tasks — reducing setup time while maintaining high performance.

Model Name	Description
`jsl_meds_4b_q16_v4_pipeline`	Q&A, NER, Summarization, RAG, and Chat.
`jsl_meds_8b_q8_v4_pipeline`	Q&A, NER, Summarization, RAG, and Chat.
`jsl_meds_ner_2b_q16_v2_pipeline`	Q&A, NER, Summarization, RAG, and Chat.
`jsl_meds_ner_q16_v4_pipeline`	Q&A, NER
`jsl_meds_ner_vlm_2b_q16_v2_pipeline`	Q&A, NER

Example:

from johnsnowlabs import nlp, medical

pipeline = nlp.PretrainedPipeline("jsl_meds_ner_2b_q16_v2_pipeline", "en", "clinical/models")

text = """
# Template:
{
  "Patient Name": "string",
  "Patient Age": "integer",
  "Patient Gender": "string",
  "Hospital Number": "string",
  "Episode Number": "string",
  "Episode Date": "date-time"
}
# Context:
The patient, Johnathan Miller, is a 54-year-old male admitted under hospital number HN382914. 
His most recent episode number is EP2024-1178, recorded on 2025-08-10. 
The patient presented with chronic knee pain and swelling. 
Past medical history includes hypertension and type 2 diabetes.
"""

Result:

{
    "Patient Name": "Johnathan Miller",
    "Patient Age": 54,
    "Patient Gender": "male",
    "Hospital Number": "HN382914",
    "Episode Number": "EP2024-1178",
    "Episode Date": "2025-08-10"
}

Please check Loading Medical and Open Source LLMs notebook for more detail

Introducing `MedicalNerDLGraphChecker` Improves Graph Management During Medical NER Training

This release introduces a new annotator, MedicalNerDLGraphChecker, designed to verify whether a suitable MedicalNerApproach TensorFlow graph is available for a given training dataset before computation begins. This prevents runtime errors and ensures that the correct graph is used for custom training workflows.

Along with the new annotator, we have added ~80 new graph files to support a wider range of configurations.

The MedicalNerDLGraphChecker must be placed before embeddings or MedicalNerApproach annotators in the pipeline. It processes the dataset to extract required graph parameters (tokens, labels, embedding dimensions) and raises an exception if the appropriate graph is missing in the JAR. This makes it especially useful for custom datasets and specialized NER training cases.

Example:

nerDLGraphChecker = (MedicalNerDLGraphChecker()
            .setInputCols(["sentence", "token"])
            .setLabelColumn("label")
            .setEmbeddingsModel(embeddings))
            
nerDLGraphChecker.fit(train_df)

Note: The MedicalNerDLGraphChecker automatically validates whether a suitable TensorFlow graph is bundled inside the JAR for the specified embeddings and labels.

If a required graph is not found, it will raise a detailed error message, guiding you on what configuration is missing.

If a suitable graph is found, it runs silently, confirming that the embedded graph is compatible, and you can proceed with training without manually creating graph files.

`Annotation2Training` Converts GenAI Lab Annotations into NER-ready Training Datasets

We’re introducing Annotation2Training, a utility that converts annotation outputs (from JSON or CSV) into a Spark DataFrame ready for NER training. It expects inputs structured like John Snow Labs’ Generative AI annotation tool exports and produces token-level labels aligned with sentences and documents.

🚀 Features & Highlights

Converts GenAI Exports to NER Training DataFrames
- convertJson2NerDF
- convertCsv2NerDF
Practical & Easy-to-Use: Convert JSON or CSV files directly into Spark DataFrames with a single function call.
Fast & Scalable: Optimized partitioning ensures high performance, leveraging system CPU cores for parallel processing.
Base Pipeline Compatibility: Works with your base pipeline (e.g., DocumentAssembler, SentenceDetector, InternalDocumentSplitter, and Tokenizer).
NER Training Ready: Use this to streamline data prep for MedicalNerApproach and other NER trainers—no manual wrangling, consistent columns, and validated alignment out of the box.
Modern Alternative: Eliminates the complexity of traditional CoNLL-based approach, providing a more efficient and reliable workflow.

Example:

JSON_PATH = "/content/result.json"

from sparknlp_jsl.training import Annotation2Training
annotation2training = Annotation2Training(spark)
training_df_json = annotation2training.convertJson2NerDF(
    json_path = JSON_PATH,                   # Path to the input JSON file.
    pipeline_model = base_pipeline_model,    # A pre-trained Spark NLP PipelineModel that includes at least a DocumentAssembler, and Tokenizer.
    repartition = (os.cpu_count() * 4),      # Number of partitions to use when creating the DataFrame (default is 32).
    token_output_col = "token",              # The name of the column containing token annotations (default is "token").
    ner_label_col = "label")                 # The name of the output column for NER labels (default is "label").

Please check the 1.7.GenAI Lab_to_Ner_Training.ipynb Notebook

Lightweight Text-to-SQL Model Enables Seamless Natural Language Queries on Healthcare Data

We are releasing a new Text-to-SQL model fine-tuned by John Snow Labs for the healthcare domain based on small LLMs. The jsl_meds_text2sql_1b_q16_v1 model is lightweight and optimized for transforming natural language queries into SQL, enabling seamless interaction with structured medical and healthcare datasets.

Example:

medical_llm = MedicalLLM.pretrained("jsl_meds_text2sql_1b_q16_v1", "en", "clinical/models")\
    .setInputCols("document")\
    .setOutputCol("completions")\
    .setBatchSize(1)\
    .setNPredict(100)\
    .setUseChatTemplate(True)\
    .setTemperature(0)

medm_prompt = """### Instruction:
### Instruction:
Table: CancerPatients
- patient_id (INT)
- name (VARCHAR)
- age (INT)
- gender (VARCHAR)
- cancer_type (VARCHAR)
- diagnosis_date (DATE)
List the names of patients diagnosed with breast cancer.
### Response:
"""

Results:

SELECT name FROM CancerPatients WHERE cancer_type = 'breast cancer'

HPO Mapping Models and Pipeline Standardize Phenotype Recognition and Linking

We are introducing new resources for phenotype extraction and ontology mapping. Together, the hpo_synonym_mapper model and the hpo_mapper_pipeline_v4 pipeline enable comprehensive recognition and standardization of phenotypic concepts in clinical and biomedical text.

The pretrained model maps Human Phenotype Ontology (HPO) terms to their exact, related, and broad synonyms, ensuring consistent representation of phenotypic concepts.
The pretrained pipeline extracts phenotype-related entities, maps them to HPO codes, determines assertion status (present, absent, suspected), and enriches the results by linking to UMLS CUIs, genes, and associated diseases.

These tools provide a powerful way to achieve deeper phenotypic and genomic insights directly from unstructured text, supporting downstream clinical and biomedical applications.

Model Name	Description
`hpo_synonym_mapper`	Maps HPO terms to their exact, related, and broad synonyms
`hpo_mapper_pipeline_v4`	Designed to extract phenotype-related entities from clinical or biomedical text, map them to their corresponding Human Phenotype Ontology (HPO) codes, and determine their assertion status

Example:

mapperModel = ChunkMapperModel.pretrained("hpo_synonym_mapper", "en", "clinical/models")\
    .setInputCols(["ner_chunk"])\
    .setOutputCol("mappings")\
    .setRels(["synonym"])

result = model.transform(spark.createDataFrame([["""The patient, a 62-year-old male, presented with a neoplasm in the lung. He also reported progressive fatigue over the past three months and episodes of shortness of breath. On examination, hepatomegaly was noted, and laboratory results confirmed anemia."""]]).toDF("text"))

Results:

term	synonym
shortness of breath	{‘exact_synonym’: [‘dyspnea’, ‘abnormal breathing’, ‘breathing difficulty’, ‘difficult to breathe’, ‘difficulty breathing’, ‘dyspnoea’, ‘trouble breathing’], ‘related_synonym’: [‘panting’], ‘broad_synonym’: []}
fatigue	{‘exact_synonym’: [‘fatigue’, ‘tired’, ‘tiredness’], ‘related_synonym’: [], ‘broad_synonym’: []}
neoplasm	{‘exact_synonym’: [‘neoplasia’, ‘oncological abnormality’, ‘tumor’, ‘tumour’], ‘related_synonym’: [‘cancer’, ‘oncology’], ‘broad_synonym’: [‘abnormal tissue mass’]}
anemia	{‘exact_synonym’: [‘anaemia’, ‘low number of red blood cells or haemoglobin’, ‘low number of red blood cells or hemoglobin’], ‘related_synonym’: [‘decreased haemoglobin’, ‘decreased hemoglobin’], ‘broad_synonym’: []}
progressive	{‘exact_synonym’: [‘worsens with time’], ‘related_synonym’: [‘progressive disorder’], ‘broad_synonym’: []}
hepatomegaly	{‘exact_synonym’: [‘enlarged liver’], ‘related_synonym’: [], ‘broad_synonym’: []}

ONNX-Optimized MedicalBERT Models Provide Faster Inference on CPU and GPU

We have converted multiple MedicalBertForSequenceClassification and MedicalBertForTokenClassifier models into ONNX format, enabling significant speed improvements when running on CPU and especially GPU. This conversion allows users to leverage the efficiency of ONNX Runtime, resulting in faster inference times. The new models ends with _onnx in their names to easily identify them.

Example:

sequence_classification = (
    MedicalBertForSequenceClassification.pretrained(
        "bert_sequence_classifier_ade_augmented_onnx",
        "en",
        "clinical/models"
    )
    .setInputCols(["token", "sentence"])
    .setOutputCol("ner")
    .setCaseSensitive(True)
)

data = spark.createDataFrame(["So glad I am off effexor, so sad it ruined my teeth. tip Please be carefull taking antideppresiva and read about it 1st",
                              "Religare Capital Ranbaxy has been accepting approval for Diovan since 2012"], StringType()).toDF("text")

Results:

text	result
So glad I am off effexor, so sad it ruined my teeth. tip Please be carefull taking antideppresiva and read about it 1st	[ADE]
Religare Capital Ranbaxy has been accepting approval for Diovan since 2012	[noADE]

Cross-Framework Benchmarking compares TensorFlow, ONNX, and OpenVINO, with ONNX leading on GPU

This benchmark evaluates the performance of Spark NLP for Healthcare models across three different architectures (TensorFlow, ONNX, OpenVINO) on both CPU and GPU hardware. Key findings show ONNX consistently delivers superior performance on GPU environments.

Datasets:
- MTSamples Dataset: 1,000 clinical texts, ~500 tokens per text
  - Usage: General NER and embedding benchmarks
- Assertion Test Dataset: 7,570 labeled rows
  - Usage: BertForAssertionClassification evaluation
Versions:
- spark-nlp Version: v6.1.1
- spark-nlp-jsl Version : v6.1.0
- Spark Version : v3.5.1
Instance Types:
- CPU Machine: Colab V6e-1, 173.0 GB RAM, 44 vCPUs
- GPU Machine: Colab A100, 83.5 GB RAM, 40.0 GB GPU VRAM, 12 vCPUs
Models Tested:
- BertSentenceEmbeddings → sbiobert_base_cased_mli
- MedicalBertForSequenceClassification → bert_sequence_classifier_ade
- BertForAssertionClassification → assertion_bert_classification_oncology
- MedicalBertForTokenClassifier → bert_token_classifier_ner_clinical
- PretrainedZeroShotNER → zeroshot_ner_deid_subentity_merged_medium
- WordEmbeddings + MedicalNerModel → embeddings_clinical + ner_deid_subentity_augmented
- WordEmbeddings + 2 MedicalNerModel → embeddings_clinical + ner_deid_subentity_augmented + ner_deid_generic_docwise
NOTES:
- This benchmark compares Transformer architectures and ML models across CPU and GPU environments
- Hardware Context: CPU and GPU machines differ in cores and memory; comparisons should consider these hardware variations
- Preprocessing: DocumentAssembler, SentenceDetector, and Tokenizer stages were pre-processed; reported times reflect pure model execution
- Configuration: All models executed with default settings
- Timing Methodology:
```
%%timeit -n 3 -r 1
model.write.mode("overwrite").format("noop").save()
```
- Results: Numbers represent average execution times across runs

Base Pipeline Configuration:

basePipeline = Pipeline(
    stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer
    ])

CPU Benchmarking

Model	TensorFlow	ONNX	OpenVINO
BertSentenceEmbeddings	8 min 37 sec	4 min 46 sec	3 min 31 sec
MedicalBertForSequenceClassification	3 min 30 sec	2 min 47 sec	N/A
BertForAssertionClassification	57 sec	33 sec	N/A
MedicalBertForTokenClassifier	3 min 29 sec	2 min 46 sec	N/A
PretrainedZeroShotNER	N/A	38 min 10 sec	N/A
WordEmbeddings + MedicalNerModel	25 sec	N/A	N/A
WordEmbeddings + 2 MedicalNerModel	38 sec	N/A	N/A

GPU Benchmarking

Model	TensorFlow	ONNX	OpenVINO
BertSentenceEmbeddings	28 min 50 sec	12 sec	18 min 49 sec
MedicalBertForSequenceClassification	11 min 45 sec	28 sec	N/A
BertForAssertionClassification	3 min 24 sec	8 sec	N/A
MedicalBertForTokenClassifier	11 min 47 sec	26 sec	N/A
PretrainedZeroShotNER	N/A	1 min 1 sec	N/A
WordEmbeddings + MedicalNerModel	2 min 24 sec	N/A	N/A
WordEmbeddings + 2 MedicalNerModel	4 min 8 sec	N/A	N/A

De-Identification Enhancements strengthen HIPAA compliance and customization

This release introduces new parameters that give users more flexibility and control over how sensitive information is obfuscated, while also ensuring alignment with HIPAA Safe Harbor requirements.

ZIP Code Obfuscation (HIPAA Safe Harbor)
A new parameter obfuscateZipByHipaa allows users to enforce HIPAA-compliant ZIP code handling.
When enabled, ZIP and ZIP+4 codes are automatically masked according to HIPAA Safe Harbor rules: restricted prefixes are fully suppressed, and all others are generalized to protect patient privacy.
When disabled, the system falls back to default/custom ZIP obfuscation rules, offering freedom for alternative strategies.

data = [
    ("Patient lives at 123 Main St, ZIP 12345-6789, with mild asthma.",),
    ("The clinic in ZIP 03690 treated the patient for diabetes.",),
    ("Follow-up scheduled at ZIP 90210 for hypertension check.",)
]
df = spark.createDataFrame(data, ["text"])

deidentification = LightDeIdentification()\
    .setInputCols(["document", "chunk"])\
    .setOutputCol("deid")\
    .setMode("obfuscate")\
    .setObfuscateZipByHipaa(True)    

Results:

+---------------------------------------------------------------+
|deid                                                           |
+---------------------------------------------------------------+
|Patient lives at 123 Main St, ZIP 123**-****, with mild asthma.|
|The clinic in ZIP 000** treated the patient for diabetes.      |
|Follow-up scheduled at ZIP 902** for hypertension check.       |
+---------------------------------------------------------------+

Date Obfuscation Flexibility

The maxRandomDisplacementDays parameter provides fine-grained control over the extent to which date values can be randomly shifted when randomization is enabled. This ensures a balanced trade-off between data utility and privacy. When an ID column (e.g., patient identifier) is provided, all dates linked to the same ID will be displaced by the same amount. This preserves the relative temporal relationships within a patient’s timeline, while still protecting sensitive information.

Example:

data = [
    ("PAT-0001", "Patient was admitted on 11/11/2020."),
    ("PAT-0001", "Follow-up scheduled for 13/11/2020."),
    ("PAT-0002", "Discharge planned on 15/11/2020."),
    ("PAT-0002", "Next appointment set for 17/11/2020.")
]
df = spark.createDataFrame(data, ["ID", "text"])

Define DeIdentification and Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document") \
    .setIdCol("ID")

dateMatcher = RegexMatcherInternalModel.pretrained("date_matcher", "en") \
    .setInputCols(["document"]) \
    .setOutputCol("date")

deIdentification = LightDeIdentification() \
    .setInputCols(["date", "document"]) \
    .setOutputCol("dei") \
    .setMode("obfuscate") \
    .setObfuscateDate(True) \
    .setMaxRandomDisplacementDays(60) \
    .setIsRandomDateDisplacement(True) \
    .setSeed(1000)

pipeline = Pipeline(stages=[
    documentAssembler,
    dateMatcher,
    deIdentification
])

result = pipeline.fit(df).transform(df)
result.select("ID", "text", "dei.result").show(truncate=False)

Results:

+--------+------------------------------------+--------------------------------------+
|ID      |text                                |result                                |
+--------+------------------------------------+--------------------------------------+
|PAT-0001|Patient was admitted on 11/11/2020. |[Patient was admitted on 16/11/2020.] |
|PAT-0001|Follow-up scheduled for 13/11/2020. |[Follow-up scheduled for 18/11/2020.] |
|PAT-0002|Discharge planned on 15/11/2020.    |[Discharge planned on 06/12/2020.]    |
|PAT-0002|Next appointment set for 17/11/2020.|[Next appointment set for 08/12/2020.]|
+--------+------------------------------------+--------------------------------------+

These improvements empower users to adopt stricter compliance when required, while also maintaining flexibility for research, testing, or custom obfuscation needs.

`StructuredJsonConverter` with Mappers streamlines integration of structured data

We have added support for mappers in StructuredJsonConverter, making it easier to transform and normalize extracted entity outputs into custom schemas. This enhancement allows developers to map model outputs (e.g., synonyms, IDs, or ontology codes) directly into a structured JSON format that aligns with their downstream applications.

Example:

ner_converter = NerConverterInternal() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("ner_chunk")

hpo_mapper = ChunkMapperModel().pretrained("hpo_mapper", "en", "clinical/models") \
    .setInputCols(["ner_chunk"]) \
    .setOutputCol("hpo_code") \
    .setLowerCase(False)

from sparknlp_jsl.pipeline_tracer import PipelineTracer
tracer = PipelineTracer(base_model)
columns_schema = tracer.createParserDictionary()
columns_schema

Schema of the StructuredJsonConverter:

{'document_identifier': '',
 'document_text': 'document',
 'entities': ['ner_chunk'],
 'assertions': [],
 'resolutions': [],
 'relations': [],
 'summaries': [],
 'deidentifications': [],
 'classifications': [],
 'mappers': ['hpo_code']}

Define StructuredJsonConverter And Transform

converter = StructuredJsonConverter() \
.setOutputCol("json") \
.setConverterSchema(columns_schema) \
.setOutputAsStr(False) \
.setCleanAnnotations(True)

json_df = converter.transform(base_df)
mappers_json_output = json_df.select("json.mappers").collect()[0][0]
mappers_json_output

Result:

[{'sentence': '0',
  'resolved_text': 'HP:0001249',
  'distance': '0.0',
  'all_relations': '',
  'chunk': 'intellectual disability',
  'ner_source': 'ner_chunk',
  'ner_confidence': '0.99325',
  'chunk_id': '30e08780',
  'relation': 'hpo_code',
  'ner_label': 'HP',
  'all_k_distances': '0.0:::0.0',
  'all_k_resolutions': 'HP:0001249',
  'end': '49',
  'begin': '27'}]

Various Core Improvements: Bug Fixes, Enhanced Overall Robustness, and Reliability of Spark NLP for Healthcare

Simplified Spark Session initialization

Starting a Spark NLP for Healthcare session is now easier and more flexible.
Previously, a secret had to be explicitly passed to the start() function:

spark = sparknlp_jsl.start(secret = "YOUR_SECRET_HERE")

With this release, the library can automatically read the secret from the environment variable SECRET, so you can simply run:

spark = sparknlp_jsl.start()

This improvement makes session startup simpler, more user-friendly, and adaptable across different environments.

Disabled llama.cpp logs in LLM annotators

LLM annotators in Spark NLP for Healthcare now run with llama.cpp logs disabled by default.
This change provides a cleaner and less verbose output, making it easier to focus on the actual results.

ner_source metadata key in IOBTagger

The IOBTagger annotator now enriches its metadata with a new field: ner_source.
This field indicates the originating chunk for each created token. This enhancement provides:

Better monitoring of token generation
Improved traceability between chunks and their derived tokens

With this addition, users gain deeper insights into the tokenization process and can more easily debug or analyze entity extraction workflows.

Consistent Results with genderAwareness=True in DeIdentification Fixed an issue in DeIdentification where enabling genderAwareness=True produced inconsistent results for names with three parts ([first_name, middle_name, last_name]).
Now, both two-part and three-part names are handled consistently.

Updated Notebooks And Demonstrations For Making Spark NLP For Healthcare Easier To Navigate And Understand

New MedicalVisionLLM MOOC Notebook
New AssertionMerger MOOC Notebook
New End2End Preannotation and Training Pipeline Notebook
New Generative AI to Ner Training Notebook
New NER ASSERTIONS RULE BASED Demo
Updated MEDICAL LLM Demo
Updated ENTITY RESOLUTION CODE MAPPING Demo
Updated NER GENE PHENOTYPES Demo

We Have Added And Updated A Substantial Number Of New Clinical Models And Pipelines, Further Solidifying Our Offering In The Healthcare Domain.

jsl_meds_4b_q16_v4_pipeline
jsl_meds_8b_q8_v4_pipeline
jsl_meds_ner_2b_q16_v2_pipeline
jsl_meds_ner_q16_v4_pipeline
jsl_meds_ner_vlm_2b_q16_v2_pipeline
jsl_meds_ner_vlm_8b_q16_v1
jsl_meds_ner_vlm_8b_q8_v1
jsl_meds_ner_vlm_8b_q4_v1
jsl_meds_ner_vlm_7b_q4_v1
jsl_meds_ner_vlm_7b_q8_v1
jsl_meds_ner_vlm_7b_q16_v1
hpo_synonym_mapper
hpo_mapper_pipeline_v4
jsl_meds_text2sql_1b_q16_v1
bert_sequence_classifier_ade_augmented_onnx
bert_sequence_classifier_ade_augmented_v2_onnx
bert_sequence_classifier_age_group_onnx
bert_sequence_classifier_binary_rct_biobert_onnx
bert_sequence_classifier_biomarker_onnx
bert_sequence_classifier_clinical_sections_headless_onnx
bert_sequence_classifier_covid_sentiment_onnx
bert_sequence_classifier_drug_reviews_webmd_onnx
bert_sequence_classifier_exact_age_reddit_onnx
bert_sequence_classifier_gender_biobert_onnx
bert_sequence_classifier_health_mandates_premise_tweet_onnx
bert_sequence_classifier_health_mandates_stance_tweet_onnx
bert_sequence_classifier_health_mentions_bert_onnx
bert_sequence_classifier_health_mentions_medbert_onnx
bert_sequence_classifier_metastasis_onnx
bert_sequence_classifier_patient_complaint_onnx
bert_sequence_classifier_patient_urgency_onnx
bert_sequence_classifier_pico_biobert_onnx
bert_sequence_classifier_rct_biobert_onnx
bert_sequence_classifier_response_to_treatment_onnx
bert_sequence_classifier_sdoh_community_absent_status_onnx
bert_sequence_classifier_sdoh_community_present_status_onnx
bert_sequence_classifier_sdoh_environment_status_onnx
bert_sequence_classifier_sdoh_frailty_onnx
bert_sequence_classifier_sdoh_frailty_vulnerability_onnx
bert_sequence_classifier_sdoh_mental_health_onnx
bert_sequence_classifier_sdoh_violence_abuse_onnx
bert_sequence_classifier_self_reported_age_tweet_onnx
bert_sequence_classifier_self_reported_partner_violence_tweet_onnx
bert_sequence_classifier_self_reported_stress_tweet_onnx
bert_sequence_classifier_self_reported_symptoms_tweet_onnx
bert_sequence_classifier_self_reported_vaccine_status_tweet_onnx
bert_sequence_classifier_stressor_onnx
bert_sequence_classifier_treatment_changes_sentiment_tweet_onnx
bert_sequence_classifier_vaccine_sentiment_onnx
bert_sequence_classifier_vop_adverse_event_onnx
bert_sequence_classifier_vop_drug_side_effect_onnx
bert_sequence_classifier_vop_hcp_consult_onnx
bert_sequence_classifier_vop_self_report_onnx
bert_sequence_classifier_vop_side_effect_onnx
bert_sequence_classifier_vop_sound_medical_onnx
bert_token_classifier_ade_tweet_binary_onnx
bert_token_classifier_drug_development_trials_onnx
bert_token_classifier_ner_ade_binary_onnx
bert_token_classifier_ner_ade_onnx
bert_token_classifier_ner_anatem_onnx
bert_token_classifier_ner_anatomy_onnx
bert_token_classifier_ner_bacteria_onnx
bert_token_classifier_ner_bc2gm_gene_onnx
bert_token_classifier_ner_bc4chemd_chemicals_onnx
bert_token_classifier_ner_bc5cdr_chemicals_onnx
bert_token_classifier_ner_bc5cdr_disease_onnx
bert_token_classifier_ner_bionlp_onnx
bert_token_classifier_ner_cellular_onnx
bert_token_classifier_ner_chemicals_onnx
bert_token_classifier_ner_chemprot_onnx
bert_token_classifier_ner_clinical_onnx
bert_token_classifier_ner_clinical_trials_abstracts_onnx
bert_token_classifier_ner_deid_onnx
bert_token_classifier_ner_drugs_onnx
bert_token_classifier_ner_jnlpba_cellular_onnx
bert_token_classifier_ner_jsl_onnx
bert_token_classifier_ner_jsl_slim_onnx
bert_token_classifier_ner_linnaeus_species_onnx
bert_token_classifier_ner_living_species_onnx
bert_token_classifier_ner_ncbi_disease_onnx
bert_token_classifier_ner_pathogen_onnx
bert_token_classifier_ner_species_onnx

For all Spark NLP for Healthcare models, please check: Models Hub Page

Previous versions

Version
Version
Version

PREVIOUSBest Practices Using Pretrained Models Together

NEXTInteractive Healthcare NLP Wiki

6.1.1

Highlights

Medical Vision LLM Models Extend Clinical AI with Multimodal Text and Image Understanding

Pretrained Clinical Pipelines for LLMs Deliver ready-to-use Q&A, NER, summarization, RAG, and chat

Introducing MedicalNerDLGraphChecker Improves Graph Management During Medical NER Training

Annotation2Training Converts GenAI Lab Annotations into NER-ready Training Datasets

Lightweight Text-to-SQL Model Enables Seamless Natural Language Queries on Healthcare Data

HPO Mapping Models and Pipeline Standardize Phenotype Recognition and Linking

ONNX-Optimized MedicalBERT Models Provide Faster Inference on CPU and GPU

Cross-Framework Benchmarking compares TensorFlow, ONNX, and OpenVINO, with ONNX leading on GPU

CPU Benchmarking

GPU Benchmarking

De-Identification Enhancements strengthen HIPAA compliance and customization

StructuredJsonConverter with Mappers streamlines integration of structured data

Various Core Improvements: Bug Fixes, Enhanced Overall Robustness, and Reliability of Spark NLP for Healthcare

Updated Notebooks And Demonstrations For Making Spark NLP For Healthcare Easier To Navigate And Understand

We Have Added And Updated A Substantial Number Of New Clinical Models And Pipelines, Further Solidifying Our Offering In The Healthcare Domain.

Previous versions

Introducing `MedicalNerDLGraphChecker` Improves Graph Management During Medical NER Training

`Annotation2Training` Converts GenAI Lab Annotations into NER-ready Training Datasets

`StructuredJsonConverter` with Mappers streamlines integration of structured data