Healthcare NLP v2.7.3 Release Notes

2.7.3

We are glad to announce that Spark NLP for Healthcare 2.7.3 has been released!

Highlights:

Introducing a brand-new RelationExtractionDL Annotator – Achieving SOTA results in clinical relation extraction using BioBert.
Massive Improvements & feature enhancements in De-Identification module:
- Introduction of faker augmentation in Spark NLP for Healthcare to generate random data for obfuscation in de-identification module.
- Brand-new annotator for Structured De-Identification.
Drug Normalizer: Normalize medication-related phrases (dosage, form and strength) and abbreviations in text and named entities extracted by NER models.
Confidence scores in assertion output : just like NER output, assertion models now also support confidence scores for each prediction.
Cosine similarity metrics in entity resolvers to get more informative and semantically correct results.
AuxLabel in the metadata of entity resolvers to return additional mappings.
New Relation Extraction models to extract relations between body parts and clinical entities.
New Entity Resolver models to extract billable medical codes.
New Clinical Pretrained NER models.
Bug fixes & general improvements.
Matching the version with Spark NLP open-source v2.7.3.

1. Improvements in De-Identification Module:

Integration of faker library to automatically generate random data like names, dates, addresses etc so users dont have to specify dummy data (custom obfuscation files can still be used). It also improves the obfuscation results due to a bigger pool of random values.

How to use:

Set the flag setObfuscateRefSource to faker

deidentification = DeIdentification()
    .setInputCols(["sentence", "token", "ner_chunk"])\
	.setOutputCol("deidentified")\
	.setMode("obfuscate") \
	.setObfuscateRefSource("faker")

For more details: Check out this notebook

2. Structured De-Identification Module:

Introduction of a new annotator to handle de-identification of structured data. it allows users to define a mapping of columns and their obfuscation policy. Users can also provide dummy data and map them to columns they want to replace values in.

How to use:

obfuscator = StructuredDeidentification \
	(spark,{"NAME":"PATIENT","AGE":"AGE"},
	obfuscateRefSource = "faker")

obfuscator_df = obfuscator.obfuscateColumns(df)

obfuscator_df.select("NAME","AGE").show(truncate=False)

Example:

Input Data:

Name	Age
Cecilia Chapman	83
Iris Watson	9
Bryar Pitts	98
Theodore Lowe	16
Calista Wise	76

Deidentified:

Name	Age
Menne Erdôs	20
Longin Robinson	31
Flynn Fiedlerová	50
John Wakeland	21
Vanessa Andersson	12

For more details: Check out this notebook.

3. Introducing SOTA relation extraction model using BioBert

A brand-new end-to-end trained BERT model, resulting in massive improvements. Another new annotator (ReChunkFilter) is also developed for this new model to allow syntactic features work well with BioBert to extract relations.

How to use:

re_ner_chunk_filter = RENerChunksFilter()\
    .setInputCols(["ner_chunks", "dependencies"])\
    .setOutputCol("re_ner_chunks")\
    .setRelationPairs(pairs)\
    .setMaxSyntacticDistance(4)

re_model = RelationExtractionDLModel()\
    .pretrained(“redl_temporal_events_biobert”, "en", "clinical/models")\
    .setPredictionThreshold(0.9)\
    .setInputCols(["re_ner_chunks", "sentences"])\
    .setOutputCol("relations")

Benchmarks:

on benchmark datasets

model	Spark NLP ML model	Spark NLP DL model	benchmark
re_temporal_events_clinical	68.29	71.0	80.2 1
re_clinical	56.45	69.2	68.2 2
re_human_pheotype_gene_clinical	-	87.9	67.2 3
re_drug_drug_interaction	-	72.1	83.8 4
re_chemprot	76.69	94.1	83.64 5

on in-house annotations

model	Spark NLP ML model	Spark NLP DL model
re_bodypart_problem	84.58	85.7
re_bodypart_procedure	61.0	63.3
re_date_clinical	83.0	84.0
re_bodypart_direction	93.5	92.5

For more details: Check out the notebook or modelshub.

4. Drug Normalizer:

Standardize units of drugs and handle abbreviations in raw text or drug chunks identified by any NER model. This normalization significantly improves performance of entity resolvers.

How to use:

drug_normalizer = DrugNormalizer()\
    .setInputCols("document")\
    .setOutputCol("document_normalized")\
    .setPolicy("all") #all/abbreviations/dosages

Examples:

drug_normalizer.transform("adalimumab 54.5 + 43.2 gm”)

>>> "adalimumab 97700 mg"

Changes: combine 54.5 + 43.2 and normalize gm to mg

drug_normalizer.transform("Agnogenic one half cup”)

>>> "Agnogenic 0.5 oral solution"

Changes: replace one half to the 0.5, normalize cup to the oral solution

drug_normalizer.transform("interferon alfa-2b 10 million unit ( 1 ml ) injec”)

>>> "interferon alfa - 2b 10000000 unt ( 1 ml ) injection "

Changes: convert 10 million unit to the 10000000 unt, replace injec with injection

For more details: Check out this notebook

5. Assertion models to support confidence in output:

Just like NER output, assertion models now also provides confidence scores for each prediction.

chunks	entities	assertion	confidence
a headache	PROBLEM	present	0.9992
anxious	PROBLEM	conditional	0.9039
alopecia	PROBLEM	absent	0.9992
pain	PROBLEM	absent	0.9238

.setClasses() method is deprecated in AssertionDLApproach and users do not need to specify number of classes while training, as it will be inferred from the dataset.

6. New Relation Extraction Models:

We are also releasing new relation extraction models to link the clinical entities to body parts and dates. These models are trained using binary relation extraction approach for better accuracy.

- re_bodypart_direction : Relation Extraction between Body Part and Direction entities.

Example:

Text: “MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia”

relations	entity1	chunk1	entity2	chunk2	confidence
1	Direction	upper	bodyPart	brain stem	0.999
0	Direction	upper	bodyPart	cerebellum	0.999
0	Direction	upper	bodyPart	basil ganglia	0.999
0	bodyPart	brain stem	Direction	left	0.999
0	bodyPart	brain stem	Direction	right	0.999
1	Direction	left	bodyPart	cerebellum	1.0
0	Direction	left	bodyPart	basil ganglia	0.976
0	bodyPart	cerebellum	Direction	right	0.953
1	Direction	right	bodyPart	basil ganglia	1.0

- re_bodypart_problem : Relation Extraction between Body Part and Problem entities.

Example:

Text: “No neurologic deficits other than some numbness in his left hand.”

relation	entity1	chunk1	entity2	chunk2	confidence
0	Symptom	neurologic deficits	bodyPart	hand	1
1	Symptom	numbness	bodyPart	hand	1

- re_bodypart_proceduretest : Relation Extraction between Body Part and Procedure, Test entities.

Example:

Text: “TECHNIQUE IN DETAIL: After informed consent was obtained from the patient and his mother, the chest was scanned with portable ultrasound.”

relation	entity1	chunk1	entity2	chunk2	confidence
1	bodyPart	chest	Test	portable ultrasound	0.999

-re_date_clinical : Relation Extraction between Date and different clinical entities.

Example:

Text: “This 73 y/o patient had CT on 1/12/95, with progressive memory and cognitive decline since 8/11/94.”

relations	entity1	chunk1	entity2	chunk2	confidence
1	Test	CT	Date	1/12/95	1.0
1	Symptom	progressive memory and cognitive decline	Date	8/11/94	1.0

How to use:

re_model = RelationExtractionModel()\
    .pretrained("re_bodypart_direction","en","clinical/models")\
    .setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\
    .setOutputCol("relations")\
    .setMaxSyntacticDistance(4)\
    .setRelationPairs([‘Internal_organ_or_component’, ‘Direction’])

For more details: Check out the notebook or modelshub.

New matching scheme for entity resolvers - improved accuracy: Adding the option to use cosine similarity to resolve entities and find closest matches, resulting in better, more semantically correct results.

7. New Resolver Models using `JSL SBERT`:

sbiobertresolve_icd10cm_augmented
sbiobertresolve_cpt_augmented
sbiobertresolve_cpt_procedures_augmented
sbiobertresolve_icd10cm_augmented_billable_hcc
sbiobertresolve_hcc_augmented

Returning auxilary columns mapped to resolutions: Chunk entity resolver and sentence entity resolver now returns auxilary data that is mapped the resolutions during training. This will allow users to get multiple resolutions with single model without using any other annotator in the pipeline (In order to get billable codes otherwise there needs to be other modules in the same pipeline)

Example:

sbiobertresolve_icd10cm_augmented_billable_hcc Input Text: “bladder cancer”

idx	chunks	code	resolutions	all_codes	billable	hcc_status	hcc_score	all_distances
0	bladder cancer	C679	[‘bladder cancer’, ‘suspected bladder cancer’, ‘cancer in situ of urinary bladder’, ‘tumor of bladder neck’, ‘malignant tumour of bladder neck’]	[‘C679’, ‘Z126’, ‘D090’, ‘D494’, ‘C7911’]	[‘1’, ‘1’, ‘1’, ‘1’, ‘1’]	[‘1’, ‘0’, ‘0’, ‘0’, ‘1’]	[‘11’, ‘0’, ‘0’, ‘0’, ‘8’]	[‘0.0000’, ‘0.0904’, ‘0.0978’, ‘0.1080’, ‘0.1281’]

sbiobertresolve_cpt_augmented
Input Text: “ct abdomen without contrast”

idx	cpt code	distance	resolutions
0	74150	0.0802	Computed tomography, abdomen; without contrast material
1	65091	0.1312	Evisceration of ocular contents; without implant
2	70450	0.1323	Computed tomography, head or brain; without contrast material
3	74176	0.1333	Computed tomography, abdomen and pelvis; without contrast material
4	74185	0.1343	Magnetic resonance imaging without contrast
5	77059	0.1343	Magnetic resonance imaging without contrast

8. New Pretrained Clinical NER Models

NER Radiology Input Text: “Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.”

idx	chunks	entities
0	Bilateral	Direction
1	breast	BodyPart
2	ultrasound	ImagingTest
3	ovoid mass	ImagingFindings
4	0.5 x 0.5 x 0.4	Measurements
5	cm	Units
6	anteromedial aspect	Direction
7	left	Direction
8	shoulder	BodyPart
9	mass	ImagingFindings
10	isoechoic echotexture	ImagingFindings
11	muscle	BodyPart
12	internal color flow	ImagingFindings
13	benign fibrous tissue	ImagingFindings
14	lipoma	Disease_Syndrome_Disorder

Versions

Version
Version
Version

PREVIOUSVersion Compatibility

2.7.3

Highlights:

1. Improvements in De-Identification Module:

2. Structured De-Identification Module:

3. Introducing SOTA relation extraction model using BioBert

Benchmarks:

4. Drug Normalizer:

5. Assertion models to support confidence in output:

6. New Relation Extraction Models:

7. New Resolver Models using JSL SBERT:

8. New Pretrained Clinical NER Models

Versions

7. New Resolver Models using `JSL SBERT`: