6.1.1
Highlights
We are delighted to announce significant new features and major enhancements in the latest release of Healthcare NLP. This release introduces brand-new Medical Vision language models (VLMs), new capabilities such as the new Annotation2Training module for converting annotations from GenAI Lab into trainable dataframes, the MedicalNerDLGraphChecker for well-informed NER training, domain-specific LLM pipelines, and faster ONNX models—a total of 78 new LLMs, pipelines, and ONNX models available out of the box.
- Medical Vision LLM Models extend clinical AI with multimodal text and image understanding
- Pretrained Clinical Pipelines for LLMs deliver ready-to-use Q&A, NER, summarization, RAG, and chat
- Introducing
MedicalNerDLGraphCheckerimproves graph management during medical NER training Annotation2Trainingconverts GenAI Lab annotations into NER-ready training datasets- Lightweight Text-to-SQL Model based on small LLMs enables seamless natural language queries on healthcare data
- Human phenotype ontology (HPO) Mapping Models and Pipeline standardize phenotype recognition and linking via returning exact, related, and broad synonyms for each term at once
- ONNX-Optimized MedicalBERT Models provide faster inference on CPU and GPU for certain tasks (NER, assertion)
- Cross-Framework Benchmarking compares TensorFlow, ONNX, and OpenVINO, with ONNX leading on GPU
- Additional De-Identification Enhancements strengthen HIPAA compliance and customization
- Structured JSON Converter with Mappers streamlines integration of structured data
- Various core improvements; bug fixes, enhanced overall robustness and reliability of Spark NLP for Healthcare
- Simplified Spark Session initialization
ner_sourcemetadata key in IOBTagger- Consistent results with
genderAwareness=Truein DeIdentification
- Updated notebooks and demonstrations for making Spark NLP for Healthcare easier to navigate and understand
- New MedicalVisionLLM MOOC Notebook
- New AssertionMerger MOOC Notebook
- New End2End Preannotation and Training Pipeline Notebook
- New GenAI Lab to Ner Training Notebook
- New NER ASSERTIONS RULE BASED Demo
- Updated MEDICAL LLM Demo
- Updated ENTITY RESOLUTION CODE MAPPING Demo
- Updated NER GENE PHENOTYPES Demo
- The addition and update of numerous new clinical models and pipelines continue to reinforce our offering in the healthcare domain
Medical Vision LLM Models Extend Clinical AI with Multimodal Text and Image Understanding
In this release, we are expanding our Medical Vision LLM (VLM) family with additional models specifically finetuned for medical tasks. These models extend large language model capabilities with integrated visual language understanding, enabling multimodal clinical analysis by combining textual and image inputs.
The new VLMs provide strong performance for tasks such as diagnostic image interpretation, image-to-text summarization, and integrated documentation analysis — continuing our mission to advance clinical AI with robust, domain-specific multimodal solutions.
| Model Name | Quantization Options |
|---|---|
| jsl_meds_ner_vlm_8b_v1 | q4, q8, q16 |
| jsl_meds_ner_vlm_7b_v1 | q4, q8, q16 |
Example:
prompt = """
# Template:
{
"Patient Name": "string",
"Patient Age": "integer",
"Patient Gender": "string",
"Hospital Number": "string",
"Episode Number": "string",
"Episode Date": "date-time"
}
# Context:
<image>
"""
input_df = nlp.vision_llm_preprocessor(
spark=spark,
images_path="images",
prompt=prompt,
output_col_name="prompt"
)
document_assembler = (
nlp.DocumentAssembler()
.setInputCol("prompt")
.setOutputCol("caption_document")
)
image_assembler = (
nlp.ImageAssembler()
.setInputCol("image")
.setOutputCol("image_assembler")
)
medicalVisionLLM = (
medical.AutoGGUFVisionModel.pretrained("jsl_meds_ner_vlm_8b_q16_v1", "en", "clinical/models")
.setInputCols(["caption_document", "image_assembler"])
.setOutputCol("completions")
)
pipeline = nlp.Pipeline().setStages([
document_assembler,
image_assembler,
medicalVisionLLM
])
model = pipeline.fit(input_df)
result = model.transform(input_df)
Result:
{
"Patient Name": "Ms RUKHSANA SHAHEEN",
"Patient Age": 56,
"Patient Gender": "Female",
"Hospital Number": "MH005990453",
"Episode Number": "030000528270",
"Episode Date": "2021-07-02T08:31:00"
}
Pretrained Clinical Pipelines for LLMs Deliver ready-to-use Q&A, NER, summarization, RAG, and chat
This release introduces a new collection of pretrained clinical LLM pipelines, designed to streamline clinical document analysis. Each pipeline is built on top of state-of-the-art small-sized Medical LLMs finetuned by us, providing ready-to-use solutions for Q&A, NER, Summarization, Retrieval-Augmented Generation (RAG), and Chat.
The main advantage of these pipelines is the elimination of manual model chaining. Instead of building and testing complex workflows, users can instantly deploy one-liner pipelines that are efficient, accurate, and purpose-built for clinical tasks — reducing setup time while maintaining high performance.
| Model Name | Description |
|---|---|
jsl_meds_4b_q16_v4_pipeline |
Q&A, NER, Summarization, RAG, and Chat. |
jsl_meds_8b_q8_v4_pipeline |
Q&A, NER, Summarization, RAG, and Chat. |
jsl_meds_ner_2b_q16_v2_pipeline |
Q&A, NER, Summarization, RAG, and Chat. |
jsl_meds_ner_q16_v4_pipeline |
Q&A, NER |
jsl_meds_ner_vlm_2b_q16_v2_pipeline |
Q&A, NER |
Example:
from johnsnowlabs import nlp, medical
pipeline = nlp.PretrainedPipeline("jsl_meds_ner_2b_q16_v2_pipeline", "en", "clinical/models")
text = """
# Template:
{
"Patient Name": "string",
"Patient Age": "integer",
"Patient Gender": "string",
"Hospital Number": "string",
"Episode Number": "string",
"Episode Date": "date-time"
}
# Context:
The patient, Johnathan Miller, is a 54-year-old male admitted under hospital number HN382914.
His most recent episode number is EP2024-1178, recorded on 2025-08-10.
The patient presented with chronic knee pain and swelling.
Past medical history includes hypertension and type 2 diabetes.
"""
Result:
{
"Patient Name": "Johnathan Miller",
"Patient Age": 54,
"Patient Gender": "male",
"Hospital Number": "HN382914",
"Episode Number": "EP2024-1178",
"Episode Date": "2025-08-10"
}
Please check Loading Medical and Open Source LLMs notebook for more detail
Introducing MedicalNerDLGraphChecker Improves Graph Management During Medical NER Training
This release introduces a new annotator, MedicalNerDLGraphChecker, designed to verify whether a suitable MedicalNerApproach TensorFlow graph is available for a given training dataset before computation begins. This prevents runtime errors and ensures that the correct graph is used for custom training workflows.
Along with the new annotator, we have added ~80 new graph files to support a wider range of configurations.
The MedicalNerDLGraphChecker must be placed before embeddings or MedicalNerApproach annotators in the pipeline. It processes the dataset to extract required graph parameters (tokens, labels, embedding dimensions) and raises an exception if the appropriate graph is missing in the JAR. This makes it especially useful for custom datasets and specialized NER training cases.
Example:
nerDLGraphChecker = (MedicalNerDLGraphChecker()
.setInputCols(["sentence", "token"])
.setLabelColumn("label")
.setEmbeddingsModel(embeddings))
nerDLGraphChecker.fit(train_df)
Note: The
MedicalNerDLGraphCheckerautomatically validates whether a suitable TensorFlow graph is bundled inside the JAR for the specified embeddings and labels.
- If a required graph is not found, it will raise a detailed error message, guiding you on what configuration is missing.
- If a suitable graph is found, it runs silently, confirming that the embedded graph is compatible, and you can proceed with training without manually creating graph files.
Annotation2Training Converts GenAI Lab Annotations into NER-ready Training Datasets
We’re introducing Annotation2Training, a utility that converts annotation outputs (from JSON or CSV) into a Spark DataFrame ready for NER training. It expects inputs structured like John Snow Labs’ Generative AI annotation tool exports and produces token-level labels aligned with sentences and documents.
🚀 Features & Highlights
- Converts GenAI Exports to NER Training DataFrames
convertJson2NerDFconvertCsv2NerDF
- Practical & Easy-to-Use: Convert JSON or CSV files directly into Spark DataFrames with a single function call.
- Fast & Scalable: Optimized partitioning ensures high performance, leveraging system CPU cores for parallel processing.
- Base Pipeline Compatibility: Works with your base pipeline (e.g.,
DocumentAssembler,SentenceDetector,InternalDocumentSplitter, andTokenizer). - NER Training Ready: Use this to streamline data prep for
MedicalNerApproachand other NER trainers—no manual wrangling, consistent columns, and validated alignment out of the box. - Modern Alternative: Eliminates the complexity of traditional CoNLL-based approach, providing a more efficient and reliable workflow.
Example:
JSON_PATH = "/content/result.json"
from sparknlp_jsl.training import Annotation2Training
annotation2training = Annotation2Training(spark)
training_df_json = annotation2training.convertJson2NerDF(
json_path = JSON_PATH, # Path to the input JSON file.
pipeline_model = base_pipeline_model, # A pre-trained Spark NLP PipelineModel that includes at least a DocumentAssembler, and Tokenizer.
repartition = (os.cpu_count() * 4), # Number of partitions to use when creating the DataFrame (default is 32).
token_output_col = "token", # The name of the column containing token annotations (default is "token").
ner_label_col = "label") # The name of the output column for NER labels (default is "label").
Please check the 1.7.GenAI Lab_to_Ner_Training.ipynb Notebook
Lightweight Text-to-SQL Model Enables Seamless Natural Language Queries on Healthcare Data
We are releasing a new Text-to-SQL model fine-tuned by John Snow Labs for the healthcare domain based on small LLMs. The jsl_meds_text2sql_1b_q16_v1 model is lightweight and optimized for transforming natural language queries into SQL, enabling seamless interaction with structured medical and healthcare datasets.
Example:
medical_llm = MedicalLLM.pretrained("jsl_meds_text2sql_1b_q16_v1", "en", "clinical/models")\
.setInputCols("document")\
.setOutputCol("completions")\
.setBatchSize(1)\
.setNPredict(100)\
.setUseChatTemplate(True)\
.setTemperature(0)
medm_prompt = """### Instruction:
### Instruction:
Table: CancerPatients
- patient_id (INT)
- name (VARCHAR)
- age (INT)
- gender (VARCHAR)
- cancer_type (VARCHAR)
- diagnosis_date (DATE)
List the names of patients diagnosed with breast cancer.
### Response:
"""
Results:
SELECT name FROM CancerPatients WHERE cancer_type = 'breast cancer'
HPO Mapping Models and Pipeline Standardize Phenotype Recognition and Linking
We are introducing new resources for phenotype extraction and ontology mapping. Together, the hpo_synonym_mapper model and the hpo_mapper_pipeline_v4 pipeline enable comprehensive recognition and standardization of phenotypic concepts in clinical and biomedical text.
- The
pretrained modelmaps Human Phenotype Ontology (HPO) terms to their exact, related, and broad synonyms, ensuring consistent representation of phenotypic concepts. - The
pretrained pipelineextracts phenotype-related entities, maps them to HPO codes, determines assertion status (present, absent, suspected), and enriches the results by linking to UMLS CUIs, genes, and associated diseases.
These tools provide a powerful way to achieve deeper phenotypic and genomic insights directly from unstructured text, supporting downstream clinical and biomedical applications.
| Model Name | Description |
|---|---|
hpo_synonym_mapper |
Maps HPO terms to their exact, related, and broad synonyms |
hpo_mapper_pipeline_v4 |
Designed to extract phenotype-related entities from clinical or biomedical text, map them to their corresponding Human Phenotype Ontology (HPO) codes, and determine their assertion status |
Example:
mapperModel = ChunkMapperModel.pretrained("hpo_synonym_mapper", "en", "clinical/models")\
.setInputCols(["ner_chunk"])\
.setOutputCol("mappings")\
.setRels(["synonym"])
result = model.transform(spark.createDataFrame([["""The patient, a 62-year-old male, presented with a neoplasm in the lung. He also reported progressive fatigue over the past three months and episodes of shortness of breath. On examination, hepatomegaly was noted, and laboratory results confirmed anemia."""]]).toDF("text"))
Results:
| term | synonym |
|---|---|
| shortness of breath | {‘exact_synonym’: [‘dyspnea’, ‘abnormal breathing’, ‘breathing difficulty’, ‘difficult to breathe’, ‘difficulty breathing’, ‘dyspnoea’, ‘trouble breathing’], ‘related_synonym’: [‘panting’], ‘broad_synonym’: []} |
| fatigue | {‘exact_synonym’: [‘fatigue’, ‘tired’, ‘tiredness’], ‘related_synonym’: [], ‘broad_synonym’: []} |
| neoplasm | {‘exact_synonym’: [‘neoplasia’, ‘oncological abnormality’, ‘tumor’, ‘tumour’], ‘related_synonym’: [‘cancer’, ‘oncology’], ‘broad_synonym’: [‘abnormal tissue mass’]} |
| anemia | {‘exact_synonym’: [‘anaemia’, ‘low number of red blood cells or haemoglobin’, ‘low number of red blood cells or hemoglobin’], ‘related_synonym’: [‘decreased haemoglobin’, ‘decreased hemoglobin’], ‘broad_synonym’: []} |
| progressive | {‘exact_synonym’: [‘worsens with time’], ‘related_synonym’: [‘progressive disorder’], ‘broad_synonym’: []} |
| hepatomegaly | {‘exact_synonym’: [‘enlarged liver’], ‘related_synonym’: [], ‘broad_synonym’: []} |
ONNX-Optimized MedicalBERT Models Provide Faster Inference on CPU and GPU
We have converted multiple MedicalBertForSequenceClassification and MedicalBertForTokenClassifier models into ONNX format, enabling significant speed improvements when running on CPU and especially GPU.
This conversion allows users to leverage the efficiency of ONNX Runtime, resulting in faster inference times. The new models ends with _onnx in their names to easily identify them.
Example:
sequence_classification = (
MedicalBertForSequenceClassification.pretrained(
"bert_sequence_classifier_ade_augmented_onnx",
"en",
"clinical/models"
)
.setInputCols(["token", "sentence"])
.setOutputCol("ner")
.setCaseSensitive(True)
)
data = spark.createDataFrame(["So glad I am off effexor, so sad it ruined my teeth. tip Please be carefull taking antideppresiva and read about it 1st",
"Religare Capital Ranbaxy has been accepting approval for Diovan since 2012"], StringType()).toDF("text")
Results:
| text | result |
|---|---|
| So glad I am off effexor, so sad it ruined my teeth. tip Please be carefull taking antideppresiva and read about it 1st | [ADE] |
| Religare Capital Ranbaxy has been accepting approval for Diovan since 2012 | [noADE] |
Cross-Framework Benchmarking compares TensorFlow, ONNX, and OpenVINO, with ONNX leading on GPU
This benchmark evaluates the performance of Spark NLP for Healthcare models across three different architectures (TensorFlow, ONNX, OpenVINO) on both CPU and GPU hardware. Key findings show ONNX consistently delivers superior performance on GPU environments.
- Datasets:
- MTSamples Dataset: 1,000 clinical texts, ~500 tokens per text
- Usage: General NER and embedding benchmarks
- Assertion Test Dataset: 7,570 labeled rows
- Usage: BertForAssertionClassification evaluation
- MTSamples Dataset: 1,000 clinical texts, ~500 tokens per text
- Versions:
- spark-nlp Version: v6.1.1
- spark-nlp-jsl Version : v6.1.0
- Spark Version : v3.5.1
- Instance Types:
- CPU Machine: Colab V6e-1, 173.0 GB RAM, 44 vCPUs
- GPU Machine: Colab A100, 83.5 GB RAM, 40.0 GB GPU VRAM, 12 vCPUs
- Models Tested:
- BertSentenceEmbeddings →
sbiobert_base_cased_mli - MedicalBertForSequenceClassification →
bert_sequence_classifier_ade - BertForAssertionClassification →
assertion_bert_classification_oncology - MedicalBertForTokenClassifier →
bert_token_classifier_ner_clinical - PretrainedZeroShotNER →
zeroshot_ner_deid_subentity_merged_medium - WordEmbeddings + MedicalNerModel →
embeddings_clinical+ner_deid_subentity_augmented - WordEmbeddings + 2 MedicalNerModel →
embeddings_clinical+ner_deid_subentity_augmented+ner_deid_generic_docwise
- BertSentenceEmbeddings →
- NOTES:
- This benchmark compares Transformer architectures and ML models across CPU and GPU environments
- Hardware Context: CPU and GPU machines differ in cores and memory; comparisons should consider these hardware variations
- Preprocessing: DocumentAssembler, SentenceDetector, and Tokenizer stages were pre-processed; reported times reflect pure model execution
- Configuration: All models executed with default settings
- Timing Methodology:
%%timeit -n 3 -r 1 model.write.mode("overwrite").format("noop").save() - Results: Numbers represent average execution times across runs
- Base Pipeline Configuration:
basePipeline = Pipeline( stages=[ documentAssembler, sentenceDetector, tokenizer ])
CPU Benchmarking
| Model | TensorFlow | ONNX | OpenVINO |
|---|---|---|---|
| BertSentenceEmbeddings | 8 min 37 sec | 4 min 46 sec | 3 min 31 sec |
| MedicalBertForSequenceClassification | 3 min 30 sec | 2 min 47 sec | N/A |
| BertForAssertionClassification | 57 sec | 33 sec | N/A |
| MedicalBertForTokenClassifier | 3 min 29 sec | 2 min 46 sec | N/A |
| PretrainedZeroShotNER | N/A | 38 min 10 sec | N/A |
| WordEmbeddings + MedicalNerModel | 25 sec | N/A | N/A |
| WordEmbeddings + 2 MedicalNerModel | 38 sec | N/A | N/A |
GPU Benchmarking
| Model | TensorFlow | ONNX | OpenVINO |
|---|---|---|---|
| BertSentenceEmbeddings | 28 min 50 sec | 12 sec | 18 min 49 sec |
| MedicalBertForSequenceClassification | 11 min 45 sec | 28 sec | N/A |
| BertForAssertionClassification | 3 min 24 sec | 8 sec | N/A |
| MedicalBertForTokenClassifier | 11 min 47 sec | 26 sec | N/A |
| PretrainedZeroShotNER | N/A | 1 min 1 sec | N/A |
| WordEmbeddings + MedicalNerModel | 2 min 24 sec | N/A | N/A |
| WordEmbeddings + 2 MedicalNerModel | 4 min 8 sec | N/A | N/A |
De-Identification Enhancements strengthen HIPAA compliance and customization
This release introduces new parameters that give users more flexibility and control over how sensitive information is obfuscated, while also ensuring alignment with HIPAA Safe Harbor requirements.
- ZIP Code Obfuscation (HIPAA Safe Harbor)
A new parameterobfuscateZipByHipaaallows users to enforce HIPAA-compliant ZIP code handling.
When enabled, ZIP and ZIP+4 codes are automatically masked according to HIPAA Safe Harbor rules: restricted prefixes are fully suppressed, and all others are generalized to protect patient privacy.
When disabled, the system falls back to default/custom ZIP obfuscation rules, offering freedom for alternative strategies.
data = [
("Patient lives at 123 Main St, ZIP 12345-6789, with mild asthma.",),
("The clinic in ZIP 03690 treated the patient for diabetes.",),
("Follow-up scheduled at ZIP 90210 for hypertension check.",)
]
df = spark.createDataFrame(data, ["text"])
deidentification = LightDeIdentification()\
.setInputCols(["document", "chunk"])\
.setOutputCol("deid")\
.setMode("obfuscate")\
.setObfuscateZipByHipaa(True)
Results:
+---------------------------------------------------------------+
|deid |
+---------------------------------------------------------------+
|Patient lives at 123 Main St, ZIP 123**-****, with mild asthma.|
|The clinic in ZIP 000** treated the patient for diabetes. |
|Follow-up scheduled at ZIP 902** for hypertension check. |
+---------------------------------------------------------------+
- Date Obfuscation Flexibility
The maxRandomDisplacementDays parameter provides fine-grained control over the extent to which date values can be randomly shifted when randomization is enabled. This ensures a balanced trade-off between data utility and privacy.
When an ID column (e.g., patient identifier) is provided, all dates linked to the same ID will be displaced by the same amount. This preserves the relative temporal relationships within a patient’s timeline, while still protecting sensitive information.
Example:
data = [
("PAT-0001", "Patient was admitted on 11/11/2020."),
("PAT-0001", "Follow-up scheduled for 13/11/2020."),
("PAT-0002", "Discharge planned on 15/11/2020."),
("PAT-0002", "Next appointment set for 17/11/2020.")
]
df = spark.createDataFrame(data, ["ID", "text"])
Define DeIdentification and Pipeline
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document") \
.setIdCol("ID")
dateMatcher = RegexMatcherInternalModel.pretrained("date_matcher", "en") \
.setInputCols(["document"]) \
.setOutputCol("date")
deIdentification = LightDeIdentification() \
.setInputCols(["date", "document"]) \
.setOutputCol("dei") \
.setMode("obfuscate") \
.setObfuscateDate(True) \
.setMaxRandomDisplacementDays(60) \
.setIsRandomDateDisplacement(True) \
.setSeed(1000)
pipeline = Pipeline(stages=[
documentAssembler,
dateMatcher,
deIdentification
])
result = pipeline.fit(df).transform(df)
result.select("ID", "text", "dei.result").show(truncate=False)
Results:
+--------+------------------------------------+--------------------------------------+
|ID |text |result |
+--------+------------------------------------+--------------------------------------+
|PAT-0001|Patient was admitted on 11/11/2020. |[Patient was admitted on 16/11/2020.] |
|PAT-0001|Follow-up scheduled for 13/11/2020. |[Follow-up scheduled for 18/11/2020.] |
|PAT-0002|Discharge planned on 15/11/2020. |[Discharge planned on 06/12/2020.] |
|PAT-0002|Next appointment set for 17/11/2020.|[Next appointment set for 08/12/2020.]|
+--------+------------------------------------+--------------------------------------+
These improvements empower users to adopt stricter compliance when required, while also maintaining flexibility for research, testing, or custom obfuscation needs.
StructuredJsonConverter with Mappers streamlines integration of structured data
We have added support for mappers in StructuredJsonConverter, making it easier to transform and normalize extracted entity outputs into custom schemas. This enhancement allows developers to map model outputs (e.g., synonyms, IDs, or ontology codes) directly into a structured JSON format that aligns with their downstream applications.
Example:
ner_converter = NerConverterInternal() \
.setInputCols(["sentence", "token", "ner"]) \
.setOutputCol("ner_chunk")
hpo_mapper = ChunkMapperModel().pretrained("hpo_mapper", "en", "clinical/models") \
.setInputCols(["ner_chunk"]) \
.setOutputCol("hpo_code") \
.setLowerCase(False)
from sparknlp_jsl.pipeline_tracer import PipelineTracer
tracer = PipelineTracer(base_model)
columns_schema = tracer.createParserDictionary()
columns_schema
Schema of the StructuredJsonConverter:
{'document_identifier': '',
'document_text': 'document',
'entities': ['ner_chunk'],
'assertions': [],
'resolutions': [],
'relations': [],
'summaries': [],
'deidentifications': [],
'classifications': [],
'mappers': ['hpo_code']}
Define StructuredJsonConverter And Transform
converter = StructuredJsonConverter() \
.setOutputCol("json") \
.setConverterSchema(columns_schema) \
.setOutputAsStr(False) \
.setCleanAnnotations(True)
json_df = converter.transform(base_df)
mappers_json_output = json_df.select("json.mappers").collect()[0][0]
mappers_json_output
Result:
[{'sentence': '0',
'resolved_text': 'HP:0001249',
'distance': '0.0',
'all_relations': '',
'chunk': 'intellectual disability',
'ner_source': 'ner_chunk',
'ner_confidence': '0.99325',
'chunk_id': '30e08780',
'relation': 'hpo_code',
'ner_label': 'HP',
'all_k_distances': '0.0:::0.0',
'all_k_resolutions': 'HP:0001249',
'end': '49',
'begin': '27'}]
Various Core Improvements: Bug Fixes, Enhanced Overall Robustness, and Reliability of Spark NLP for Healthcare
- Simplified Spark Session initialization
Starting a Spark NLP for Healthcare session is now easier and more flexible.
Previously, a secret had to be explicitly passed to the start() function:
spark = sparknlp_jsl.start(secret = "YOUR_SECRET_HERE")
With this release, the library can automatically read the secret from the environment variable SECRET, so you can simply run:
spark = sparknlp_jsl.start()
This improvement makes session startup simpler, more user-friendly, and adaptable across different environments.
- Disabled llama.cpp logs in LLM annotators
LLM annotators in Spark NLP for Healthcare now run with llama.cpp logs disabled by default.
This change provides a cleaner and less verbose output, making it easier to focus on the actual results.
ner_sourcemetadata key in IOBTagger
The IOBTagger annotator now enriches its metadata with a new field: ner_source.
This field indicates the originating chunk for each created token. This enhancement provides:
- Better monitoring of token generation
- Improved traceability between chunks and their derived tokens
With this addition, users gain deeper insights into the tokenization process and can more easily debug or analyze entity extraction workflows.
- Consistent Results with
genderAwareness=Truein DeIdentification Fixed an issue in DeIdentification where enablinggenderAwareness=Trueproduced inconsistent results for names with three parts ([first_name, middle_name, last_name]).
Now, both two-part and three-part names are handled consistently.
Updated Notebooks And Demonstrations For Making Spark NLP For Healthcare Easier To Navigate And Understand
- New MedicalVisionLLM MOOC Notebook
- New AssertionMerger MOOC Notebook
- New End2End Preannotation and Training Pipeline Notebook
- New Generative AI to Ner Training Notebook
- New NER ASSERTIONS RULE BASED Demo
- Updated MEDICAL LLM Demo
- Updated ENTITY RESOLUTION CODE MAPPING Demo
- Updated NER GENE PHENOTYPES Demo
We Have Added And Updated A Substantial Number Of New Clinical Models And Pipelines, Further Solidifying Our Offering In The Healthcare Domain.
jsl_meds_4b_q16_v4_pipelinejsl_meds_8b_q8_v4_pipelinejsl_meds_ner_2b_q16_v2_pipelinejsl_meds_ner_q16_v4_pipelinejsl_meds_ner_vlm_2b_q16_v2_pipelinejsl_meds_ner_vlm_8b_q16_v1jsl_meds_ner_vlm_8b_q8_v1jsl_meds_ner_vlm_8b_q4_v1jsl_meds_ner_vlm_7b_q4_v1jsl_meds_ner_vlm_7b_q8_v1jsl_meds_ner_vlm_7b_q16_v1hpo_synonym_mapperhpo_mapper_pipeline_v4jsl_meds_text2sql_1b_q16_v1bert_sequence_classifier_ade_augmented_onnxbert_sequence_classifier_ade_augmented_v2_onnxbert_sequence_classifier_age_group_onnxbert_sequence_classifier_binary_rct_biobert_onnxbert_sequence_classifier_biomarker_onnxbert_sequence_classifier_clinical_sections_headless_onnxbert_sequence_classifier_covid_sentiment_onnxbert_sequence_classifier_drug_reviews_webmd_onnxbert_sequence_classifier_exact_age_reddit_onnxbert_sequence_classifier_gender_biobert_onnxbert_sequence_classifier_health_mandates_premise_tweet_onnxbert_sequence_classifier_health_mandates_stance_tweet_onnxbert_sequence_classifier_health_mentions_bert_onnxbert_sequence_classifier_health_mentions_medbert_onnxbert_sequence_classifier_metastasis_onnxbert_sequence_classifier_patient_complaint_onnxbert_sequence_classifier_patient_urgency_onnxbert_sequence_classifier_pico_biobert_onnxbert_sequence_classifier_rct_biobert_onnxbert_sequence_classifier_response_to_treatment_onnxbert_sequence_classifier_sdoh_community_absent_status_onnxbert_sequence_classifier_sdoh_community_present_status_onnxbert_sequence_classifier_sdoh_environment_status_onnxbert_sequence_classifier_sdoh_frailty_onnxbert_sequence_classifier_sdoh_frailty_vulnerability_onnxbert_sequence_classifier_sdoh_mental_health_onnxbert_sequence_classifier_sdoh_violence_abuse_onnxbert_sequence_classifier_self_reported_age_tweet_onnxbert_sequence_classifier_self_reported_partner_violence_tweet_onnxbert_sequence_classifier_self_reported_stress_tweet_onnxbert_sequence_classifier_self_reported_symptoms_tweet_onnxbert_sequence_classifier_self_reported_vaccine_status_tweet_onnxbert_sequence_classifier_stressor_onnxbert_sequence_classifier_treatment_changes_sentiment_tweet_onnxbert_sequence_classifier_vaccine_sentiment_onnxbert_sequence_classifier_vop_adverse_event_onnxbert_sequence_classifier_vop_drug_side_effect_onnxbert_sequence_classifier_vop_hcp_consult_onnxbert_sequence_classifier_vop_self_report_onnxbert_sequence_classifier_vop_side_effect_onnxbert_sequence_classifier_vop_sound_medical_onnxbert_token_classifier_ade_tweet_binary_onnxbert_token_classifier_drug_development_trials_onnxbert_token_classifier_ner_ade_binary_onnxbert_token_classifier_ner_ade_onnxbert_token_classifier_ner_anatem_onnxbert_token_classifier_ner_anatomy_onnxbert_token_classifier_ner_bacteria_onnxbert_token_classifier_ner_bc2gm_gene_onnxbert_token_classifier_ner_bc4chemd_chemicals_onnxbert_token_classifier_ner_bc5cdr_chemicals_onnxbert_token_classifier_ner_bc5cdr_disease_onnxbert_token_classifier_ner_bionlp_onnxbert_token_classifier_ner_cellular_onnxbert_token_classifier_ner_chemicals_onnxbert_token_classifier_ner_chemprot_onnxbert_token_classifier_ner_clinical_onnxbert_token_classifier_ner_clinical_trials_abstracts_onnxbert_token_classifier_ner_deid_onnxbert_token_classifier_ner_drugs_onnxbert_token_classifier_ner_jnlpba_cellular_onnxbert_token_classifier_ner_jsl_onnxbert_token_classifier_ner_jsl_slim_onnxbert_token_classifier_ner_linnaeus_species_onnxbert_token_classifier_ner_living_species_onnxbert_token_classifier_ner_ncbi_disease_onnxbert_token_classifier_ner_pathogen_onnxbert_token_classifier_ner_species_onnx
For all Spark NLP for Healthcare models, please check: Models Hub Page
Previous versions
- 6.1.1
- 6.1.0
- 6.0.4
- 6.0.3
- 6.0.2
- 6.0.1
- 6.0.0
- 5.5.3
- 5.5.2
- 5.5.1
- 5.5.0
- 5.4.1
- 5.4.0
- 5.3.3
- 5.3.2
- 5.3.1
- 5.3.0
- 5.2.1
- 5.2.0
- 5.1.4
- 5.1.3
- 5.1.2
- 5.1.1
- 5.1.0
- 5.0.2
- 5.0.1
- 5.0.0
- 4.4.4
- 4.4.3
- 4.4.2
- 4.4.1
- 4.4.0
- 4.3.2
- 4.3.1
- 4.3.0
- 4.2.8
- 4.2.4
- 4.2.3
- 4.2.2
- 4.2.1
- 4.2.0
- 4.1.0
- 4.0.2
- 4.0.0
- 3.5.3
- 3.5.2
- 3.5.1
- 3.5.0
- 3.4.2
- 3.4.1
- 3.4.0
- 3.3.4
- 3.3.2
- 3.3.1
- 3.3.0
- 3.2.3
- 3.2.2
- 3.2.1
- 3.2.0
- 3.1.3
- 3.1.2
- 3.1.1
- 3.1.0
- 3.0.3
- 3.0.2
- 3.0.1
- 3.0.0
- 2.7.6
- 2.7.5
- 2.7.4
- 2.7.3
- 2.7.2
- 2.7.1
- 2.7.0
- 2.6.2
- 2.6.0
- 2.5.5
- 2.5.3
- 2.5.2
- 2.5.0
- 2.4.6
- 2.4.5
- 2.4.2
- 2.4.1
- 2.4.0