Spark NLP for Healthcare Release Notes 2.7.3

 

2.7.3

We are glad to announce that Spark NLP for Healthcare 2.7.3 has been released!

Highlights:

  • Introducing a brand-new RelationExtractionDL Annotator – Achieving SOTA results in clinical relation extraction using BioBert.
  • Massive Improvements & feature enhancements in De-Identification module:
    • Introduction of faker augmentation in Spark NLP for Healthcare to generate random data for obfuscation in de-identification module.
    • Brand-new annotator for Structured De-Identification.
  • Drug Normalizer: Normalize medication-related phrases (dosage, form and strength) and abbreviations in text and named entities extracted by NER models.
  • Confidence scores in assertion output : just like NER output, assertion models now also support confidence scores for each prediction.
  • Cosine similarity metrics in entity resolvers to get more informative and semantically correct results.
  • AuxLabel in the metadata of entity resolvers to return additional mappings.
  • New Relation Extraction models to extract relations between body parts and clinical entities.
  • New Entity Resolver models to extract billable medical codes.
  • New Clinical Pretrained NER models.
  • Bug fixes & general improvements.
  • Matching the version with Spark NLP open-source v2.7.3.

1. Improvements in De-Identification Module:

Integration of faker library to automatically generate random data like names, dates, addresses etc so users dont have to specify dummy data (custom obfuscation files can still be used). It also improves the obfuscation results due to a bigger pool of random values.

How to use:

Set the flag setObfuscateRefSource to faker

deidentification = DeIdentification()
    .setInputCols(["sentence", "token", "ner_chunk"])\
	.setOutputCol("deidentified")\
	.setMode("obfuscate") \
	.setObfuscateRefSource("faker")

For more details: Check out this notebook

2. Structured De-Identification Module:

Introduction of a new annotator to handle de-identification of structured data. it allows users to define a mapping of columns and their obfuscation policy. Users can also provide dummy data and map them to columns they want to replace values in.

How to use:

obfuscator = StructuredDeidentification \
	(spark,{"NAME":"PATIENT","AGE":"AGE"},
	obfuscateRefSource = "faker")

obfuscator_df = obfuscator.obfuscateColumns(df)

obfuscator_df.select("NAME","AGE").show(truncate=False)

Example:

Input Data:

Name Age
Cecilia Chapman 83
Iris Watson 9
Bryar Pitts 98
Theodore Lowe 16
Calista Wise 76

Deidentified:

Name Age
Menne Erdôs 20
Longin Robinson 31
Flynn Fiedlerová 50
John Wakeland 21
Vanessa Andersson 12

For more details: Check out this notebook.

3. Introducing SOTA relation extraction model using BioBert

A brand-new end-to-end trained BERT model, resulting in massive improvements. Another new annotator (ReChunkFilter) is also developed for this new model to allow syntactic features work well with BioBert to extract relations.

How to use:

re_ner_chunk_filter = RENerChunksFilter()\
    .setInputCols(["ner_chunks", "dependencies"])\
    .setOutputCol("re_ner_chunks")\
    .setRelationPairs(pairs)\
    .setMaxSyntacticDistance(4)

re_model = RelationExtractionDLModel()\
    .pretrained(“redl_temporal_events_biobert”, "en", "clinical/models")\
    .setPredictionThreshold(0.9)\
    .setInputCols(["re_ner_chunks", "sentences"])\
    .setOutputCol("relations")
Benchmarks:

on benchmark datasets

model Spark NLP ML model Spark NLP DL model benchmark
re_temporal_events_clinical 68.29 71.0 80.2 1
re_clinical 56.45 69.2 68.2 2
re_human_pheotype_gene_clinical - 87.9 67.2 3
re_drug_drug_interaction - 72.1 83.8 4
re_chemprot 76.69 94.1 83.64 5

on in-house annotations

model Spark NLP ML model Spark NLP DL model
re_bodypart_problem 84.58 85.7
re_bodypart_procedure 61.0 63.3
re_date_clinical 83.0 84.0
re_bodypart_direction 93.5 92.5

For more details: Check out the notebook or modelshub.

4. Drug Normalizer:

Standardize units of drugs and handle abbreviations in raw text or drug chunks identified by any NER model. This normalization significantly improves performance of entity resolvers.

How to use:

drug_normalizer = DrugNormalizer()\
    .setInputCols("document")\
    .setOutputCol("document_normalized")\
    .setPolicy("all") #all/abbreviations/dosages

Examples:

drug_normalizer.transform("adalimumab 54.5 + 43.2 gm”)

>>> "adalimumab 97700 mg"

Changes: combine 54.5 + 43.2 and normalize gm to mg

drug_normalizer.transform("Agnogenic one half cup”)

>>> "Agnogenic 0.5 oral solution"

Changes: replace one half to the 0.5, normalize cup to the oral solution

drug_normalizer.transform("interferon alfa-2b 10 million unit ( 1 ml ) injec”)

>>> "interferon alfa - 2b 10000000 unt ( 1 ml ) injection "

Changes: convert 10 million unit to the 10000000 unt, replace injec with injection

For more details: Check out this notebook

5. Assertion models to support confidence in output:

Just like NER output, assertion models now also provides confidence scores for each prediction.

chunks entities assertion confidence
a headache PROBLEM present 0.9992
anxious PROBLEM conditional 0.9039
alopecia PROBLEM absent 0.9992
pain PROBLEM absent 0.9238

.setClasses() method is deprecated in AssertionDLApproach and users do not need to specify number of classes while training, as it will be inferred from the dataset.

6. New Relation Extraction Models:

We are also releasing new relation extraction models to link the clinical entities to body parts and dates. These models are trained using binary relation extraction approach for better accuracy.

- re_bodypart_direction : Relation Extraction between Body Part and Direction entities.

Example:

Text: “MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia”

relations entity1 chunk1 entity2 chunk2 confidence
1 Direction upper bodyPart brain stem 0.999
0 Direction upper bodyPart cerebellum 0.999
0 Direction upper bodyPart basil ganglia 0.999
0 bodyPart brain stem Direction left 0.999
0 bodyPart brain stem Direction right 0.999
1 Direction left bodyPart cerebellum 1.0
0 Direction left bodyPart basil ganglia 0.976
0 bodyPart cerebellum Direction right 0.953
1 Direction right bodyPart basil ganglia 1.0

- re_bodypart_problem : Relation Extraction between Body Part and Problem entities.

Example:

Text: “No neurologic deficits other than some numbness in his left hand.”

relation entity1 chunk1 entity2 chunk2 confidence
0 Symptom neurologic deficits bodyPart hand 1
1 Symptom numbness bodyPart hand 1

- re_bodypart_proceduretest : Relation Extraction between Body Part and Procedure, Test entities.

Example:

Text: “TECHNIQUE IN DETAIL: After informed consent was obtained from the patient and his mother, the chest was scanned with portable ultrasound.”

relation entity1 chunk1 entity2 chunk2 confidence
1 bodyPart chest Test portable ultrasound 0.999

-re_date_clinical : Relation Extraction between Date and different clinical entities.

Example:

Text: “This 73 y/o patient had CT on 1/12/95, with progressive memory and cognitive decline since 8/11/94.”

relations entity1 chunk1 entity2 chunk2 confidence
1 Test CT Date 1/12/95 1.0
1 Symptom progressive memory and cognitive decline Date 8/11/94 1.0

How to use:

re_model = RelationExtractionModel()\
    .pretrained("re_bodypart_direction","en","clinical/models")\
    .setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\
    .setOutputCol("relations")\
    .setMaxSyntacticDistance(4)\
    .setRelationPairs([‘Internal_organ_or_component’, ‘Direction’])

For more details: Check out the notebook or modelshub.

New matching scheme for entity resolvers - improved accuracy: Adding the option to use cosine similarity to resolve entities and find closest matches, resulting in better, more semantically correct results.

7. New Resolver Models using JSL SBERT:

  • sbiobertresolve_icd10cm_augmented

  • sbiobertresolve_cpt_augmented

  • sbiobertresolve_cpt_procedures_augmented

  • sbiobertresolve_icd10cm_augmented_billable_hcc

  • sbiobertresolve_hcc_augmented

Returning auxilary columns mapped to resolutions: Chunk entity resolver and sentence entity resolver now returns auxilary data that is mapped the resolutions during training. This will allow users to get multiple resolutions with single model without using any other annotator in the pipeline (In order to get billable codes otherwise there needs to be other modules in the same pipeline)

Example:

sbiobertresolve_icd10cm_augmented_billable_hcc Input Text: “bladder cancer”

idx chunks code resolutions all_codes billable hcc_status hcc_score all_distances
0 bladder cancer C679 [‘bladder cancer’, ‘suspected bladder cancer’, ‘cancer in situ of urinary bladder’, ‘tumor of bladder neck’, ‘malignant tumour of bladder neck’] [‘C679’, ‘Z126’, ‘D090’, ‘D494’, ‘C7911’] [‘1’, ‘1’, ‘1’, ‘1’, ‘1’] [‘1’, ‘0’, ‘0’, ‘0’, ‘1’] [‘11’, ‘0’, ‘0’, ‘0’, ‘8’] [‘0.0000’, ‘0.0904’, ‘0.0978’, ‘0.1080’, ‘0.1281’]

sbiobertresolve_cpt_augmented
Input Text: “ct abdomen without contrast”

idx cpt code distance resolutions
0 74150 0.0802 Computed tomography, abdomen; without contrast material
1 65091 0.1312 Evisceration of ocular contents; without implant
2 70450 0.1323 Computed tomography, head or brain; without contrast material
3 74176 0.1333 Computed tomography, abdomen and pelvis; without contrast material
4 74185 0.1343 Magnetic resonance imaging without contrast
5 77059 0.1343 Magnetic resonance imaging without contrast

8. New Pretrained Clinical NER Models

  • NER Radiology Input Text: “Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.”
idx chunks entities
0 Bilateral Direction
1 breast BodyPart
2 ultrasound ImagingTest
3 ovoid mass ImagingFindings
4 0.5 x 0.5 x 0.4 Measurements
5 cm Units
6 anteromedial aspect Direction
7 left Direction
8 shoulder BodyPart
9 mass ImagingFindings
10 isoechoic echotexture ImagingFindings
11 muscle BodyPart
12 internal color flow ImagingFindings
13 benign fibrous tissue ImagingFindings
14 lipoma Disease_Syndrome_Disorder

Versions

Last updated