5.1.4
Highlights
We are delighted to announce a suite of remarkable enhancements and updates in our latest release of Spark NLP for Healthcare. This release comes with the advanced Document Splitter annotator specifically designed for medical context and more flexibility in Deidentification as well as robust exception handling in MedicalNerModel
for corrupted inputs.
- Introducing the Advanced
Medical Document Splitter
annotator with more flexibility and customization for RAG pipelines - Enhancing
ChunkFiltererApproach
by introducing json-based entity confidence configuration - Advanced data privacy with
DeIdentification
unleashing custom regex patterns - Robust exception handling in
MedicalNerModel
for corrupted inputs - Various core improvements; bug fixes, enhanced overall robustness and reliability of Spark NLP for Healthcare
- Enhanced the
ChunkSentenceSplitter
annotator and revised documentation - Updated the
training_log_parser
and its utility script to align with the latest NumPy version - Transitioned the SecureRandom algorithm from Spark configuration to the system environment. Please check the previous Random Seed Algorithm implementation for details.
- Revised some imports for improved functionality in Deidentification
- Enhanced the
- New and updated demos
- New Drug Resolver Demo
- Updated Multi Language Clinical NER Demo
- New Medical Document Splitter Notebook
- Updated Clinical Named Entity Recognition Notebook with
setDoExceptionHandling
information - Updated Clinical DeIdentification Notebook with
setRegexPatternsDictionaryAsJsonString
andsetCombineRegexPatterns
examples - Updated Calculate Medicare Risk Adjustment Score Notebook with the latest improvement
These enhancements will elevate your experience with Spark NLP for Healthcare, enabling more efficient, accurate, and streamlined healthcare-related natural language data analysis.
Introducing the Advanced Medical Document Splitter
Annotator with More Flexibility and Customization for RAG Pipelines
Discover our cutting-edge Internal Document Splitter—an innovative annotator designed to effortlessly break down extensive documents into manageable segments. Empowering users with the ability to define custom separators, this tool seamlessly divides texts, ensuring each chunk adheres to specified length criteria.
InternalDocumentSplitter has a setSplitMode method to decide how to split documents. Default: ‘regex’. It should be one of the following values:
char
: Split text based on individual characters.token
: Split text based on tokens. You should supply tokens from inputCols.sentence
: Split text based on sentences. You should supply sentences from inputCols.recursive
: Split text recursively using a specific algorithm.regex
: Split text based on a regular expression pattern.
Example:
document_splitter = InternalDocumentSplitter()\
.setInputCols("document")\
.setOutputCol("splits")\
.setSplitMode("recursive")\
.setChunkSize(100)\
.setChunkOverlap(3)\
.setExplodeSplits(True)\
.setPatternsAreRegex(False)\
.setSplitPatterns(["\n\n", "\n", " "])\
.setKeepSeparators(False)\
.setTrimWhitespace(True)
text = [(
"The patient is a 28-year-old, who is status post gastric bypass surgery"
" nearly one year ago. \nHe has lost about 200 pounds and was otherwise doing well"
" until yesterday evening around 7:00-8:00 when he developed nausea and right upper quadrant pain,"
" which apparently wrapped around toward his right side and back. He feels like he was on it"
" but has not done so. He has overall malaise and a low-grade temperature of 100.3."
" \n\nHe denies any prior similar or lesser symptoms. His last normal bowel movement was yesterday."
" He denies any outright chills or blood per rectum."
)]
Result:
sentence | doc_id |
---|---|
The patient is a 28-year-old, who is status post gastric bypass surgery nearly one year ago. | 0 |
He has lost about 200 pounds and was otherwise doing well until yesterday evening around 7:00-8:00 | 1 |
when he developed nausea and right upper quadrant pain, which apparently wrapped around toward his | 2 |
his right side and back. He feels like he was on it but has not done so. He has overall malaise and | 3 |
and a low-grade temperature of 100.3. | 4 |
He denies any prior similar or lesser symptoms. His last normal bowel movement was yesterday. He | 5 |
He denies any outright chills or blood per rectum. | 6 |
Please check Medical Document Splitter Notebook for more information
Enhancing ChunkFiltererApproach
by Introducing JSON-Based Entity Confidence Configuration
The new setEntitiesConfidenceResourceAsJsonString
method allows users to finely tune entity confidence levels using a JSON configuration.
Example:
chunk_filterer = ChunkFiltererApproach()\
.setInputCols("sentence","ner_chunk")\
.setOutputCol("chunk_filtered")\
.setFilterEntity("entity")\
.setEntitiesConfidenceResourceAsJsonString("""{'DURATION':'0.9',
'DOSAGE':'0.9',
'FREQUENCY':'0.9',
'STRENGTH':'0.9',
'DRUG':'0.9'}""")
text ='The patient was prescribed 1 capsule of Advil for 5 days . He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night , 12 units of insulin lispro with meals , metformin 1000 mg two times a day . It was determined that all SGLT2 inhibitors should be discontinued indefinitely fro 3 months .'
Without Filtering Results:
chunks | begin | end | sentence_id | entities | confidence |
---|---|---|---|---|---|
1 capsule of Advil | 27 | 44 | 0 | DRUG | 0.64 |
for 5 days | 46 | 55 | 0 | DURATION | 0.55 |
40 units of insulin glargine | 126 | 153 | 1 | DRUG | 0.62 |
at night | 155 | 162 | 1 | FREQUENCY | 0.74 |
12 units of insulin lispro | 166 | 191 | 1 | DRUG | 0.67 |
with meals | 193 | 202 | 1 | FREQUENCY | 0.72 |
metformin 1000 mg | 206 | 222 | 1 | DRUG | 0.70 |
two times a day | 224 | 238 | 1 | FREQUENCY | 0.67 |
SGLT2 inhibitors | 269 | 284 | 2 | DRUG | 0.89 |
Filtered Results:
chunks | begin | end | sentence_id | entities | confidence |
---|---|---|---|---|---|
at night | 155 | 162 | 1 | FREQUENCY | 0.74 |
with meals | 193 | 202 | 1 | FREQUENCY | 0.72 |
SGLT2 inhibitors | 269 | 284 | 2 | DRUG | 0.89 |
Advanced Data Privacy with DeIdentification Unleashing Regex Patterns Unleashed
The latest update in the DeIdentification library brings advanced data privacy with the introduction of the setRegexPatternsDictionaryAsJsonString method. This powerful feature empowers users to create custom regular expression patterns for masking specific protected entities.
Example:
deid = DeIdentification()\
.setInputCols(["sentence", "token", "ner_chunk"]) \
.setOutputCol("deidentified") \
.setMode("mask") \
.setRegexPatternsDictionaryAsJsonString("{'NUMBER':'\d+'},"+
"{'NUMBER':'(\d+.?\d+.?\d+)'}")\
.setRegexOverride(True) # Prioritizing regex rules
text ='''
Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson Ora , MR # 7194334 Date : 01/13/93 . PCP : Oliveira , 25 years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital , Keats Street , ZIP 45662, Phone 55-555-5555 .
'''
Without RegexOverride Results (default regex):
[Record date : <DATE> , David Hale , M.D ., , Name : Hendrickson Ora , Date : <DATE> ., PCP : Oliveira , 25 years-old , Record date : <DATE> ., Cocke County Baptist Hospital , Keats Street , ZIP 45662, Phone <PHONE> .]
With RegexOverride Results (custom regex):
[Record date : <NUMBER> , David Hale , M.D ., , Name : Hendrickson Ora , Date : <NUMBER> ., PCP : Oliveira , <NUMBER> years-old , Record date : <NUMBER> ., Cocke County Baptist Hospital , Keats Street , ZIP <NUMBER>, Phone <NUMBER> .]
- Merging default regex rules and custom user-defined regex with
setCombineRegexPatterns
Example:
deid = DeIdentification()\
.setInputCols(["sentence", "token", "ner_chunk"]) \
.setOutputCol("deidentified") \
.setMode("mask") \
.setCombineRegexPatterns(True)\
.setRegexPatternsDictionary("./custom_regex.txt")
Please check: Clinical DeIdentification Notebook for more information
Robust Exception Handling in MedicalNerModel
for Corrupted Inputs
Enhance the resilience of your Medical Named Entity Recognition (NER) model with the ExceptionHandling feature. When the setDoExceptionHandling
is set to true, the model attempts to compute batch-wise as usual. In the event of an exception within a batch, the system switches to row-wise processing. Any exception during row processing results in the emission of an Error Annotation, ensuring that only the problematic rows are lost rather than the entire batch.
Example:
clinical_ner = MedicalNerModel.pretrained("ner_oncology", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")\
.setDoExceptionHandling(True)
Please check: Clinical Named Entity Recognition Notebook for more information
Various Core Improvements: Bug Fixes, Enhanced Overall Robustness, And Reliability Of Spark NLP For Healthcare
- Enhanced the
ChunkSentenceSplitter
annotator and revised documentation - Updated the
training_log_parser
and its utility script to align with the latest NumPy version - Transitioned the SecureRandom algorithm from Spark configuration to the system environment. Please check the previous Random Seed Algorithm implementation for details.
- Revised some imports for improved functionality in Deidentification
Updated Notebooks And Demonstrations For making Spark NLP For Healthcare Easier To Navigate And Understand
- New Drug Resolver Demo
- Updated Multi Language Clinical NER Demo
- New Medical Document Splitter Notebook
- Updated Clinical Named Entity Recognition Notebook with
setDoExceptionHandling
information - Updated Clinical DeIdentification Notebook with
setRegexPatternsDictionaryAsJsonString
andsetCombineRegexPatterns
examples - Updated Calculate Medicare Risk Adjustment Score Notebook with the latest improvement
For all Spark NLP for Healthcare models, please check: Models Hub Page
Versions
- 5.5.0
- 5.4.1
- 5.4.0
- 5.3.3
- 5.3.2
- 5.3.1
- 5.3.0
- 5.2.1
- 5.2.0
- 5.1.4
- 5.1.3
- 5.1.2
- 5.1.1
- 5.1.0
- 5.0.2
- 5.0.1
- 5.0.0
- 4.4.4
- 4.4.3
- 4.4.2
- 4.4.1
- 4.4.0
- 4.3.2
- 4.3.1
- 4.3.0
- 4.2.8
- 4.2.4
- 4.2.3
- 4.2.2
- 4.2.1
- 4.2.0
- 4.1.0
- 4.0.2
- 4.0.0
- 3.5.3
- 3.5.2
- 3.5.1
- 3.5.0
- 3.4.2
- 3.4.1
- 3.4.0
- 3.3.4
- 3.3.2
- 3.3.1
- 3.3.0
- 3.2.3
- 3.2.2
- 3.2.1
- 3.2.0
- 3.1.3
- 3.1.2
- 3.1.1
- 3.1.0
- 3.0.3
- 3.0.2
- 3.0.1
- 3.0.0
- 2.7.6
- 2.7.5
- 2.7.4
- 2.7.3
- 2.7.2
- 2.7.1
- 2.7.0
- 2.6.2
- 2.6.0
- 2.5.5
- 2.5.3
- 2.5.2
- 2.5.0
- 2.4.6
- 2.4.5
- 2.4.2
- 2.4.1
- 2.4.0