5.5.3
Highlights
We are delighted to announce remarkable enhancements and updates in our latest release of Spark NLP for Healthcare. This release includes advanced Structured Deidentification with new obfuscation parameters, expanded ContextualEntityRuler capabilities with regex, prefix, and suffix support, and updated clinical pretrained models, and pipelines.
- Enhanced
StructuredDeidentification
with new obfuscation parameters - Customizing named entities with contextual rules: enhanced prefix, suffix, and regex support in
ContextualEntityRuler
- Enhanced flexibility for chunk-based output in
StructuredJsonConverter
- Supporting overlapping sentences in sentence-aware document splitting to feed a longer and better context to downstream models
- Advanced one-liner clinical NLP pipelines for
oncological
document analysis - Introducing the Veterinary MeSH Resolver for accurately mapping veterinary terms to the corresponding MeSH codes
- New Spanish medical entity resolver for SNOMED mapping
- New BlogPosts on various topics
- Various core improvements; bug fixes, enhanced overall robustness and reliability of Spark NLP for Healthcare
- Fixed broken
encodeModel
andsetInputSuffix
functions inLLMLoader
- Fixed end index issue in
AssertionChunkConverter
- Fixed broken
- Updated notebooks and demonstrations for making Spark NLP for Healthcare easier to navigate and understand
- New Deidentification_Performance_Comparison_Of_Healthcare_NLP_VS_Cloud_Solutions Notebook
- Updated Loading Medical and Open-Souce LLMs Notebook
- Updated Contextual_Entity_Ruler Notebook
- Updated Clinical Deidentification for Structured Data Notebook
- Updated PipelineTracer and PipelineOutputParser Notebook
- The addition and update of numerous new clinical models and pipelines continue to reinforce our offering in the healthcare domain
These enhancements will elevate your experience with Spark NLP for Healthcare, enabling more efficient, accurate, and streamlined analysis of healthcare-related natural language data.
Enhanced StructuredDeidentification with New Obfuscation Parameters
This update introduces new parameters to the StructuredDeidentification
tool, improving its flexibility for handling sensitive data obfuscation. Key additions include options for regional date formats region
, selective date obfuscation keepYear
, keepMonth
, text length preservation keepTextSizeForObfuscation
, fakerLengthOffset
, gender-aware name obfuscation genderAwareness
, and HIPAA-compliant age anonymization ageRangesByHipaa
.
Additionally, new parameters have been added to the obfuscateColumns()
function, allowing control over output formatting outputAsArray
, overwriting behavior overwrite
, and column suffixing suffix
. These enhancements provide greater control over data anonymization while maintaining data integrity and compliance.
Example DataFrme:
# Example DataFrame
data = [
("Juan García", "13/02/1977", "711 Nulla St.", 140, "673 431234"),
("Will Smith", "23/02/1977", "1 Green Avenue.", 140, "+23 (673) 431234"),
("Pedro Ximénez", "11/04/2000", "Calle del Libertador, 7", 100, "912 345623")
]
Example Code:
obfuscator = StructuredDeidentification(spark=spark,
columns={"NAME": "NAME", "DOB": "DATE", "TEL": "PHONE"},
columnsSeed={"NAME": 23, "DOB": 23},
obfuscateRefSource="faker",
days=5)
obfuscator_df = obfuscator.obfuscateColumns(df)
obfuscator_df.show(truncate=False)
Result:
NAME | DOB | ADDRESS | SBP | TEL |
---|---|---|---|---|
[Gwynda Leriche] | [18/02/1977] | 711 Nulla St. | 140 | [217 075870] |
[Sharman Debar] | [28/02/1977] | 1 Green Avenue. | 140 | [+76 (106) 964769] |
[Lavera Postal] | [16/04/2000] | Calle del Libertador, 7 | 100 | [358 709287] |
Example Code:
obfuscator = StructuredDeidentification(spark,
columns={"PATIENT": "PATIENT", "DOB": "DATE", "TEL": "PHONE"},
columnsSeed={"PATIENT": 23, "DOB": 23, "TEL": 23},
obfuscateRefSource = "faker",
days=60,
region="eu",
keepYear=True,
keepTextSizeForObfuscation=True
)
obfuscator_df = obfuscator.obfuscateColumns(df, outputAsArray=False, overwrite=False, suffix="_obfuscated")
obfuscator_df.show(truncate=False)
Result:
NAME | DOB | ADDRESS | SBP | TEL | DOB_obfuscated | TEL_obfuscated | NAME_obfuscated |
---|---|---|---|---|---|---|---|
Juan García | 13/02/1977 | 711 Nulla St. | 140 | 673 431234 | 08/02/1977 | 984 742547 | Marc Senior |
Will Smith | 23/02/1977 | 1 Green Avenue. | 140 | +23 (673) 431234 | 18/02/1977 | +54 (984) 742547 | Leora Rand |
Pedro Ximénez | 11/04/2000 | Calle del Libertador, 7 | 100 | 912 345623 | 06/04/2000 | 681 032510 | Temple Feeler |
Please check the Clinical Deidentification for Structured Data Notebook for more information
Enhanced Flexibility for Chunk-Based Output in StructuredJsonConverter
This update adds support for chunk-based results in StructuredJsonConverter
, providing greater flexibility in text processing. By using the new .setParentSource("chunk")
option, users can extract structured chunks instead of base schema results, enabling more precise control over text segmentation. Additionally, the new sentenceColumn
parameter allows retrieval of sentence-level details. The enhanced output schema includes chunk metadata, NER attributes, assertions, and relations, making it particularly valuable for structured NLP applications like clinical text analysis.
Example:
pipeline = PretrainedPipeline("explain_clinical_doc_oncology", "en", "clinical/models")
pipeline_tracer = PipelineTracer(pipeline)
converter_schema = pipeline_tracer.createParserDictionary()
text = """The Patient underwent a CT scan of the abdomen, which showed a complex mass."""
text_df = spark.createDataFrame([[text]]).toDF("text")
base_df = pipeline.transform(text_df)
structured_json_converter = (
StructuredJsonConverter()
.setConverterSchema(converter_schema)
.setOutputCol("json")
.setCleanAnnotations(True)
.setOutputAsStr(True)
.setParentSource("chunk")
.setSentenceColumn("sentence")
)
result_df = structured_json_converter.transform(base_df)
collected_result = result_df.selectExpr("json").collect()
json_result = eval(collected_result[0]["json"])
Result for Before Setting Parameters:
{
document_identifier='dabbb011-5900-4542-b771-693151850a2d',
document_text=['The Patient underwent a CT scan of the abdomen, which showed a complex mass.\n'],
entities=[
{'ner_label': 'Imaging_Test', 'sentence': '0', 'chunk': 'CT scan of the abdomen', 'end': '45', 'ner_source': 'ner_oncology_chunk', 'ner_confidence': '0.76256', 'begin': '24', 'chunk_id': 'cbe58cc4'},
{'ner_label': 'Tumor_Finding', 'sentence': '0', 'chunk': 'mass', 'end': '74', 'ner_source': 'ner_oncology_chunk', 'ner_confidence': '0.7887', 'begin': '71', 'chunk_id': 'ebfe618e'},
{'ner_label': 'Imaging_Test', 'sentence': '0', 'chunk': 'CT scan of the abdomen', 'end': '45', 'ner_source': 'ner_oncology_chunk', 'ner_confidence': '0.76256', 'begin': '24', 'chunk_id': 'cbe58cc4'},
{'ner_label': 'Tumor_Finding', 'sentence': '0', 'chunk': 'mass', 'end': '74', 'ner_source': 'ner_oncology_chunk', 'ner_confidence': '0.7887', 'begin': '71', 'chunk_id': 'ebfe618e'}],
assertions=[
{'chunk': 'CT scan of the abdomen', 'assertion_source': 'assertion', 'assertion': 'Past', 'assertion_confidence': '1.0', 'chunk_id': 'cbe58cc4'},
{'chunk': 'mass', 'assertion_source': 'assertion', 'assertion': 'Present', 'assertion_confidence': '0.9986', 'chunk_id': 'ebfe618e'}
],
resolutions=[],
relations=[],
summaries=[],
deidentifications=[],
classifications=[]
}
Result for After Setting Parameters:
[
{
'chunk_id': 'cbe58cc4',
'chunk': 'CT scan of the abdomen',
'begin': 24,
'end': 45,
'sentence_id': 0,
'sentence': 'The Patient underwent a CT scan of the abdomen, which showed a complex mass.',
'ner_label': 'Imaging_Test',
'ner_source': 'ner_oncology_chunk',
'ner_confidence': '0.76256',
'assertion': 'Past',
'assertion_confidence': '1.0',
'relations': []
},
{
'chunk_id': 'ebfe618e',
'chunk': 'mass',
'begin': 71,
'end': 74,
'sentence_id': 0,
'sentence': 'The Patient underwent a CT scan of the abdomen, which showed a complex mass.',
'ner_label': 'Tumor_Finding',
'ner_source': 'ner_oncology_chunk',
'ner_confidence': '0.7887',
'assertion': 'Present',
'assertion_confidence': '0.9986',
'relations': []
},
{
'chunk_id': 'cbe58cc4',
'chunk': 'CT scan of the abdomen',
'begin': 24,
'end': 45,
'sentence_id': 0,
'sentence': 'The Patient underwent a CT scan of the abdomen, which showed a complex mass.',
'ner_label': 'Imaging_Test',
'ner_source': 'ner_oncology_chunk',
'ner_confidence': '0.76256',
'assertion': 'Past',
'assertion_confidence': '1.0',
'relations': []
},
{
'chunk_id': 'ebfe618e',
'chunk': 'mass',
'begin': 71,
'end': 74,
'sentence_id': 0,
'sentence': 'The Patient underwent a CT scan of the abdomen, which showed a complex mass.',
'ner_label': 'Tumor_Finding',
'ner_source': 'ner_oncology_chunk',
'ner_confidence': '0.7887',
'assertion': 'Present',
'assertion_confidence': '0.9986',
'relations': []
}
]
Please check the PipelineTracer and PipelineOutputParser Notebook for more information
Customizing Named Entities with Contextual Rules: Enhanced Prefix, Suffix, and Regex Support in ContextualEntityRuler
The latest update to ContextualEntityRuler
introduces prefixEntities
and suffixEntities
, refining entity recognition by adjusting chunk information based on surrounding context. Additionally, the new regexInBetween
parameter enables pattern matching between entities, enhancing accuracy and flexibility in contextual constraints. These improvements allow for more precise entity customization, making ContextualEntityRuler
a powerful tool for domain-specific text processing.
- This update introduces
prefixEntities
andsuffixEntities
toContextualEntityRuler
, allowing entity recognition to be refined based on the surrounding context. These parameters adjust chunk information when specified entities appear before or after the target entity.
These parameters allow entity recognition based on contextual constraints:prefixEntities
: Updates chunk information if the specified entities appear before the target entity.suffixEntities
: Updates chunk information if the specified entities appear after the target entity.
Example:
rules = [
{
"entity": "CONTACT",
"scopeWindow": [6,6],
"scopeWindowLevel": "token",
"prefixEntities": ["LOCATION"],
"replaceEntity": "ZIP_CODE",
"mode": "replace_label_only"
}
]
contextual_entity_ruler = ContextualEntityRuler()\
.setInputCols("sentence", "token", "ner_chunks")\
.setOutputCol("ruled_ner_chunks")\
.setRules(rules)\
.setCaseSensitive(False)\
.setDropEmptyChunks(True)\
.setAllowPunctuationInBetween(False)\
.setAllowTokensInBetween(True)
text = "Los Angeles, zip code 90001, is located in the South Los Angeles region of the city."
data = spark.createDataFrame([text], StringType()).toDF("text")
Result:
Before
entity | begin | end | ner_chunks_result |
---|---|---|---|
LOCATION | 0 | 10 | Los Angeles |
CONTACT | 22 | 26 | 90001 |
LOCATION | 47 | 63 | South Los Angeles |
After
entity | begin | end | ruled_ner_chunks_result | CHANGES |
---|---|---|---|---|
LOCATION | 0 | 10 | Los Angeles | |
ZIP_CODE | 22 | 26 | 90001 | CONTACT updated as ZIP_CODE |
LOCATION | 47 | 63 | South Los Angeles |
- Additionally, the
regexInBetween
parameter has been added, enabling pattern matching between entities to enforce contextual constraints. These improvements enhance entity recognition accuracy and flexibility. This parameter allows searching for a regex pattern that occurs between two entities. If the regex pattern matches, the chunk will be updated according to the parameters.
Example:
rules = [
{
"entity": "LOCATION",
"scopeWindow": [6,6],
"scopeWindowLevel": "token",
"regexInBetween": "^zip$",
"suffixEntities": ["CONTACT", "IDNUM"],
"replaceEntity": "REPLACED_LOC",
"mode": "include"
}
]
contextual_entity_ruler = ContextualEntityRuler() \
.setInputCols("sentence", "token", "ner_chunks") \
.setOutputCol("ruled_ner_chunks") \
.setRules(rules) \
.setCaseSensitive(False)\
.setDropEmptyChunks(True)\
.setAllowPunctuationInBetween(False)\
.setAllowTokensInBetween(True)
text = "Los Angeles, zip code 90001, is located in the South Los Angeles region of the city."
data = spark.createDataFrame([text], StringType()).toDF("text")
Result:
- Before:
entity | begin | end | ner_chunks_result |
---|---|---|---|
LOCATION | 0 | 10 | Los Angeles |
CONTACT | 22 | 26 | 90001 |
LOCATION | 47 | 63 | South Los Angeles |
- After:
entity | begin | end | ruled_ner_chunks_result | CHANGES |
---|---|---|---|---|
REPLACED_LOC | 0 | 26 | Los Angeles, zip code 90001 | Two entities are merged with the text between them |
LOCATION | 47 | 63 | South Los Angeles |
Please check the Contextual Entity Ruler Notebook for more information
Supporting Overlapping Sentences in Sentence-Aware document Splitting to Feed a Longer and Better Context to Downstream Models
This feature introduces support for overlapping sentences in sentence-aware document splitting, enabling downstream models to process longer and more contextually rich text segments. The setChunkOverlap
method allows users to define the overlap length between text chunks, improving coherence and continuity in recursive and sentence-based splitting modes.
Advanced One-Liner Clinical NLP Pipelines for Oncological
Document Analysis
We introduce a cutting-edge suite of pretrained NLP pipelines designed to simplify oncological clinical document analysis. Built upon state-of-the-art models, these pipelines efficiently extract oncological entities, determine their assertion status, and establish relationships within clinical texts—all in a seamless, user-friendly manner. By eliminating the complexities of model selection and pipeline construction, this solution enables rapid, accurate insights for oncology research and clinical decision-making.
Model Name | Description |
---|---|
explain_clinical_doc_oncology_slim |
This pipeline is designed to extract all oncological entities, assign assertion status to the extracted entities, and establish relations between the extracted entities from the clinical documents. |
Example:
from sparknlp.pretrained import PretrainedPipeline
pipeline_sdoh = PretrainedPipeline("explain_clinical_doc_oncology_slim", "en", "clinical/models")
text = """A 56-year-old man presented with a 2-month history of whole-body weakness, double vision, difficulty swallowing, and a 45 mm anterior mediastinal mass detected via chest CT.
Neurological examination and electromyography confirmed a diagnosis of Lambert-Eaton Myasthenic Syndrome (LEMS), associated with anti-P/Q-type VGCC antibodies. The patient was treated with
cisplatin 75 mg/m² on day 1, combined with etoposide 100 mg/m² on days 1-3, repeated every 3 weeks for four cycles. A video-assisted thoracic surgery revealed histopathological features consistent
with small cell lung cancer (SCLC) with lymph node metastases. The immunohistochemical analysis showed positive markers for AE1/AE3, TTF-1, chromogranin A, and synaptophysin. Notably,
a pulmonary nodule in the left upper lobe disappeared, and FDG-PET/CT post-surgery revealed no primary lesions or metastases."""
NER and Assertion Result:
chunks | begin | end | entities | assertion | |
---|---|---|---|---|---|
0 | mass | 146 | 149 | Tumor_Finding | Present |
1 | VGCC | 317 | 320 | Biomarker | Present |
2 | cisplatin | 363 | 371 | Chemotherapy | Past |
3 | etoposide | 406 | 414 | Chemotherapy | Present |
4 | for four cycles | 462 | 476 | Duration | Present |
5 | video-assisted thoracic surgery | 481 | 511 | Cancer_Surgery | Past |
6 | small cell lung cancer | 565 | 586 | Carcinoma_Type | Present |
7 | SCLC | 589 | 592 | Carcinoma_Type | Present |
8 | metastases | 611 | 620 | Metastasis | Present |
9 | AE1/AE3 | 684 | 690 | Biomarker | Present |
10 | TTF-1 | 693 | 697 | Biomarker | Present |
11 | chromogranin A | 700 | 713 | Biomarker | Present |
12 | synaptophysin | 720 | 732 | Biomarker | Present |
13 | nodule | 756 | 761 | Tumor_Finding | Present |
14 | disappeared | 786 | 796 | Response_To_Treatment | Present |
15 | primary lesions | 839 | 853 | Tumor_Finding | Absent |
16 | metastases | 858 | 867 | Metastasis | Absent |
Relation Extraction Result:
chunk1 | entity1 | chunk2 | entity2 | relation | confidence |
---|---|---|---|---|---|
cisplatin | Chemotherapy | 75 mg/m² | Dosage | Chemotherapy-Dosage | 1 |
cisplatin | Chemotherapy | day 1 | Cycle_Day | Chemotherapy-Cycle_Day | 1 |
cisplatin | Chemotherapy | 100 mg/m² | Dosage | Chemotherapy-Dosage | 1 |
cisplatin | Chemotherapy | days 1-3 | Cycle_Day | Chemotherapy-Cycle_Day | 1 |
75 mg/m² | Dosage | etoposide | Chemotherapy | Dosage-Chemotherapy | 1 |
day 1 | Cycle_Day | etoposide | Chemotherapy | Cycle_Day-Chemotherapy | 1 |
etoposide | Chemotherapy | 100 mg/m² | Dosage | Chemotherapy-Dosage | 1 |
etoposide | Chemotherapy | days 1-3 | Cycle_Day | Chemotherapy-Cycle_Day | 1 |
45 mm | Tumor_Size | mass | Tumor_Finding | is_size_of | 0.97 |
mediastinal | Site_Other_Body_Part | mass | Tumor_Finding | is_location_of | 0.93 |
histopathological | Pathology_Test | SCLC | Cancer_Dx | is_finding_of | 0.74 |
positive | Biomarker_Result | AE1/AE3 | Biomarker | is_finding_of | 0.91 |
positive | Biomarker_Result | TTF-1 | Biomarker | is_finding_of | 0.90 |
positive | Biomarker_Result | chromogranin A | Biomarker | is_finding_of | 0.89 |
positive | Biomarker_Result | synaptophysin | Biomarker | is_finding_of | 0.72 |
pulmonary | Site_Lung | nodule | Tumor_Finding | is_location_of | 0.93 |
nodule | Tumor_Finding | upper lobe | Site_Lung | is_location_of | 0.93 |
Please check the Task Based Clinical Pretrained Pipelines model for more information
Introducing The Veterinary MeSH Resolver For Accurately Mapping Veterinary Terms To The Corresponding MeSH Codes
The Veterinary MeSH Resolver is designed to accurately map species-specific terms to MeSH codes in veterinary clinical notes, electronic health records (EHRs), and research articles, improving information retrieval and data analysis. It is especially useful for diagnostics and epidemiological studies, enabling more accurate extraction of disease entities, drug interactions, and treatment outcomes. This makes it an invaluable tool for veterinary informatics and biomedical research.
Example:
...
sbert_embedder = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli",' en',' clinical/models')\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("sbert_embeddings")\
.setCaseSensitive(False)
resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_mesh_veterinary", "en", "clinical/models") \
.setInputCols(["sbert_embeddings"]) \
.setOutputCol("mesh_code")\
.setDistanceFunction("EUCLIDEAN")
sample_text = "The dog is a labrador retriever, 4-year-old, it was brought in with vomiting and diarrhea for the past 2 days. A preliminary diagnosis of canine parvovirus infection was made, and supportive care was recommended. The owner was advised on isolation precautions to prevent the spread of the virus."
Result:
chunk | begin | end | ner_label | resolution | description | all_k_results | all_k_resolutions |
---|---|---|---|---|---|---|---|
vomiting | 68 | 75 | PROBLEM | C536228 | periodic vomiting | C536228:::C007262:::C080875… | periodic vomiting:::vomitoxin:::mirage:::propargite:::ena… |
diarrhea | 81 | 88 | PROBLEM | C565627 | diarrhea, syndromic | C565627:::C564019:::C531700… | diarrhea, syndromic:::diarrhea, chronic, with villous atr… |
canine parvovirus infection | 138 | 164 | PROBLEM | D017993 | canine parvovirus | D017993:::D052660:::D028323… | canine parvovirus:::bovine parvovirus:::porcine parvoviru… |
the virus | 285 | 293 | PROBLEM | D014780 | virus | D014780:::D006678:::D006476… | virus:::aids virus:::andes virus:::virus, associated:::pr… |
New Spanish Medical Entity Resolver for SNOMED Mapping
This model maps Spanish medical entities and concepts to SNOMED codes using the sent_xlm_roberta_biolord_2023_m
sentence embeddings. It leverages a specialized resolver to accurately link medical terms to SNOMED terminologies, ensuring precise medical coding.
Example:
biolord_embeddings = XlmRoBertaSentenceEmbeddings.pretrained("sent_xlm_roberta_biolord_2023_m", "xx")\
.setInputCols(["ner_chunk_doc"])\
.setOutputCol("biolord_embeddings")
snomed_resolver = SentenceEntityResolverModel\
.pretrained("biolordresolve_snomed_augmented", "es", "clinical/models") \
.setInputCols(["biolord_embeddings"]) \
.setOutputCol("snomed_code")\
.setDistanceFunction("EUCLIDEAN")
clinical_note = ("La paciente, con antecedente de diabetes mellitus gestacional evolucionada a tipo 2 y obesidad, presenta vómitos de una semana de evolución junto con dolorosa inflamación de sínfisis de pubis que dificulta la deambulación.")
Result:
ner_chunk | entity | snomed_code | resolutions | all_codes | all_resolutions |
---|---|---|---|---|---|
diabetes mellitus gestacional | clinical_condition | 11687002 | diabetes mellitus gestacional [diabetes mellitus gestacional] | [‘11687002’, ‘40801000119106’, ‘168964… | [‘diabetes mellitus gestacional [diabetes mellitus gestacional]… |
obesidad | clinical_condition | 414916001 | obesidad [obesidad] | [‘414916001’, ‘414915002’, ‘271590003’… | [‘obesidad [obesidad]’, ‘obeso [obeso]’, ‘constitución obesa [c… |
vómitos | clinical_condition | 422400008 | vómitos [vómitos] | [‘422400008’, ‘249497008’, ‘23971007’,… | [‘vómitos [vómitos]’, ‘síntoma de vómito [síntoma de vómito]’, … |
dolorosa | clinical_condition | 71393004 | dolorimiento [dolorimiento] | [‘71393004’, ‘22253000’, ‘301371003’, … | [‘dolorimiento [dolorimiento]’, ‘dolor [dolor]’, ‘dolor que cor… |
inflamación | clinical_condition | 128139000 | enfermedad inflamatoria [enfermedad inflamatoria] | [‘128139000’, ‘409774005’, ‘4532008’, … | [‘enfermedad inflamatoria [enfermedad inflamatoria]’, ‘morfolog… |
New Blog Posts On Various Topics
Dive into our latest blog series exploring cutting-edge advancements in healthcare NLP. The integration of Natural Language Processing (NLP) and Large Language Models (LLMs) is transforming healthcare by extracting critical insights from unstructured clinical text. From enhancing genomic research and precision medicine to revolutionizing oncology case analysis, these AI-driven tools enable faster, more accurate decision-making. Additionally, ensuring GDPR-compliant de-identification and detecting dataset shifts in PHI data are crucial for maintaining data security and model performance over time.
- Extracting Key Entities in Clinical Text for Enhanced Genomic Research and Precision Medicine This blog post explores how John Snow Labs’ Healthcare NLP & LLM library can be used to extract genes and phenotypes from clinical text. By leveraging NLP techniques, we can transform unstructured medical data into actionable insights, enabling more efficient genetic research, clinical diagnostics, and personalized medicine. The blog covers the key steps in training NER and assertion status detection models for this task, including data preparation, annotation, and evaluation, and highlights real-world use cases where extracting genetic and phenotypic entities enhances precision in clinical decision-making.
- AI-Powered Oncology: Healthcare NLP’s Role in Cancer Research and Treatment This blog post explores how John Snow Labs’ Healthcare NLP & LLM library revolutionizes oncology case analysis by extracting actionable insights from clinical text. Key use cases include detecting valuable information using NER, assertion status, relation extraction, and ICD-10 mapping models; summarizing reports and enabling Q&A with LLMs; and leveraging zero-shot NER for identifying new entities with minimal effort. These approaches streamline oncology data analysis, enhance decision-making, and improve patient outcomes.
- De-Identification of German Medical Text for GDPR Compliance This blog post explores how specialized de-identification pipelines ensure GDPR compliance and protect patient privacy in German clinical texts. By leveraging advanced NLP techniques such as masking, entity replacement, and context-aware obfuscation, these pipelines anonymize names, dates, and locations while preserving data usability. The blog highlights key challenges posed by complex linguistic structures and high-density personal data.
- Detecting a Dataset Shift in PHI Data to Ensure De-IDentification Model Performance on Future Data This blog post explores how PHI detection models must adapt to evolving clinical data, as dataset shifts—such as changes in document types, patient populations, or medical contexts—can impact performance.
Various Core Improvements: Bug Fixes, Enhanced Overall Robustness, and Reliability of Spark NLP for Healthcare
- Fixed broken
encodeModel
andsetInputSuffix
functions inLLMLoader
- Fixed end index issue in
AssertionChunkConverter
Updated Notebooks And Demonstrations For making Spark NLP For Healthcare Easier To Navigate And Understand
- New Deidentification_Performance_Comparison_Of_Healthcare_NLP_VS_Cloud_Solutions Notebook
- Updated Loading Medical and Open-Souce LLMs Notebook
- Updated Contextual_Entity_Ruler Notebook
- Updated Clinical Deidentification for Structured Data Notebook
- Updated PipelineTracer and PipelineOutputParser Notebook
We Have Added And Updated A Substantial Number Of New Clinical Models And Pipelines, Further Solidifying Our Offering In The Healthcare Domain.
explain_clinical_doc_oncology_slim
biolordresolve_snomed_augmented
sbiobertresolve_mesh
sbiobertresolve_mesh_augmented
sbiobertresolve_mesh_veterinary
For all Spark NLP for Healthcare models, please check: Models Hub Page
Previous versions
- 5.5.3
- 5.5.2
- 5.5.1
- 5.5.0
- 5.4.1
- 5.4.0
- 5.3.3
- 5.3.2
- 5.3.1
- 5.3.0
- 5.2.1
- 5.2.0
- 5.1.4
- 5.1.3
- 5.1.2
- 5.1.1
- 5.1.0
- 5.0.2
- 5.0.1
- 5.0.0
- 4.4.4
- 4.4.3
- 4.4.2
- 4.4.1
- 4.4.0
- 4.3.2
- 4.3.1
- 4.3.0
- 4.2.8
- 4.2.4
- 4.2.3
- 4.2.2
- 4.2.1
- 4.2.0
- 4.1.0
- 4.0.2
- 4.0.0
- 3.5.3
- 3.5.2
- 3.5.1
- 3.5.0
- 3.4.2
- 3.4.1
- 3.4.0
- 3.3.4
- 3.3.2
- 3.3.1
- 3.3.0
- 3.2.3
- 3.2.2
- 3.2.1
- 3.2.0
- 3.1.3
- 3.1.2
- 3.1.1
- 3.1.0
- 3.0.3
- 3.0.2
- 3.0.1
- 3.0.0
- 2.7.6
- 2.7.5
- 2.7.4
- 2.7.3
- 2.7.2
- 2.7.1
- 2.7.0
- 2.6.2
- 2.6.0
- 2.5.5
- 2.5.3
- 2.5.2
- 2.5.0
- 2.4.6
- 2.4.5
- 2.4.2
- 2.4.1
- 2.4.0