Healthcare NLP v2.7.2 Release Notes

2.7.2

We are glad to announce that Spark NLP for Healthcare 2.7.2 has been released !

In this release, we introduce the following features:

Far better accuracy for resolving medication terms to RxNorm codes:

ondansetron 8 mg tablet' -> '312086
Far better accuracy for resolving diagnosis terms to ICD-10-CM codes:

TIA -> transient ischemic attack (disorder) ‘S0690’

New ability to map medications to pharmacological actions (PA):

'metformin' -> ‘Hypoglycemic Agents’
2 new greedy named entity recognition models for medication details:

ner_drugs_greedy: ‘magnesium hydroxide 100mg/1ml PO’

` ner_posology _greedy: ‘12 units of insulin lispro’ `

New model to classify the gender of a patient in a given medical note:

'58yo patient with a family history of breast cancer' -> ‘female’

And starting customized spark sessions with rich parameters

        params = {"spark.driver.memory":"32G",
        "spark.kryoserializer.buffer.max":"2000M",
        "spark.driver.maxResultSize":"2000M"}

        spark = sparknlp_jsl.start(secret, params=params)

State-of-the-art accuracy is achieved using new healthcare-tuned BERT Sentence Embeddings (s-Bert). The following sections include more details, metrics, and examples.

Named Entity Recognizers for Medications

A new medication NER (ner_drugs_greedy) that joins the drug entities with neighboring entities such as dosage, route, form and strength; and returns a single entity drug. This greedy NER model would be highly useful if you want to extract a drug with its context and then use it to get a RxNorm code (drugs may get different RxNorm codes based on the dosage and strength information).

Metrics

label	tp	fp	fn	prec	rec	f1
I-DRUG	37423	4179	3773	0.899	0.908	0.904
B-DRUG	29699	2090	1983	0.934	0.937	0.936

A new medication NER (ner_posology_greedy) that joins the drug entities with neighboring entities such as dosage, route, form and strength. It also returns all the other medication entities even if not related to (or joined with) a drug.

Now we have five different medication-related NER models. You can see the outputs from each model below:

Text = ‘‘The patient was prescribed 1 capsule of Advil 10 mg for 5 days and magnesium hydroxide 100mg/1ml suspension PO. He was seen by the endocrinology service and she was discharged on 40 units of insulin glargine at night, 12 units of insulin lispro with meals, and metformin 1000 mg two times a day.’’

a. ner_drugs_greedy

	chunks	begin	end	entities
0	1 capsule of Advil 10 mg	27	50	DRUG
1	magnesium hydroxide 100mg/1ml PO	67	98	DRUG
2	40 units of insulin glargine	168	195	DRUG
3	12 units of insulin lispro	207	232	DRUG

b. ner_posology_greedy

	chunks	begin	end	entities
0	1 capsule of Advil 10 mg	27	50	DRUG
1	magnesium hydroxide 100mg/1ml PO	67	98	DRUG
2	for 5 days	52	61	DURATION
3	40 units of insulin glargine	168	195	DRUG
4	at night	197	204	FREQUENCY
5	12 units of insulin lispro	207	232	DRUG
6	with meals	234	243	FREQUENCY
7	metformin 1000 mg	250	266	DRUG
8	two times a day	268	282	FREQUENCY

c. ner_drugs

	chunks	begin	end	entities
0	Advil	40	44	DrugChem
1	magnesium hydroxide	67	85	DrugChem
2	metformin	261	269	DrugChem

d.ner_posology

	chunks	begin	end	entities
0	1	27	27	DOSAGE
1	capsule	29	35	FORM
2	Advil	40	44	DRUG
3	10 mg	46	50	STRENGTH
4	for 5 days	52	61	DURATION
5	magnesium hydroxide	67	85	DRUG
6	100mg/1ml	87	95	STRENGTH
7	PO	97	98	ROUTE
8	40 units	168	175	DOSAGE
9	insulin glargine	180	195	DRUG
10	at night	197	204	FREQUENCY
11	12 units	207	214	DOSAGE
12	insulin lispro	219	232	DRUG
13	with meals	234	243	FREQUENCY
14	metformin	250	258	DRUG
15	1000 mg	260	266	STRENGTH
16	two times a day	268	282	FREQUENCY

e. ner_drugs_large

	chunks	begin	end	entities
0	Advil 10 mg	40	50	DRUG
1	magnesium hydroxide 100mg/1ml PO.	67	99	DRUG
2	insulin glargine	180	195	DRUG
3	insulin lispro	219	232	DRUG
4	metformin 1000 mg	250	266	DRUG

Patient Gender Classification

This model detects the gender of the patient in the clinical document. It can classify the documents into Female, Male and Unknown.

We release two models:

‘Classifierdl_gender_sbert’ (more accurate, works with licensed sbiobert_base_cased_mli)
‘Classifierdl_gender_biobert’ (works with biobert_pubmed_base_cased)

The models are trained on more than four thousands clinical documents (radiology reports, pathology reports, clinical visits etc.), annotated internally.

Metrics `(Classifierdl_gender_sbert)`

	precision	recall	f1-score	support
Female	0.9224	0.8954	0.9087	239
Male	0.7895	0.8468	0.8171	124

Text= ‘‘social history: shows that does not smoke cigarettes or drink alcohol, lives in a nursing home. family history: shows a family history of breast cancer.’’

gender_classifier.annotate(text)['class'][0]
>> `Female`

See this Colab notebook for further details.

a. classifierdl_gender_sbert

document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sbert_embedder = BertSentenceEmbeddings\
    .pretrained("sbiobert_base_cased_mli", 'en', 'clinical/models')\
    .setInputCols(["document"])\
    .setOutputCol("sentence_embeddings")\
    .setMaxSentenceLength(512)

gender_classifier = ClassifierDLModel\
    .pretrained('classifierdl_gender_sbert', 'en', 'clinical/models') \
    .setInputCols(["document", "sentence_embeddings"]) \
    .setOutputCol("class")

gender_pred_pipeline = Pipeline(
    stages = [
       document,
       sbert_embedder,
       gender_classifier
            ])

b. classifierdl_gender_biobert

documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

clf_tokenizer = Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")\

biobert_embeddings = BertEmbeddings().pretrained('biobert_pubmed_base_cased') \
    .setInputCols(["document",'token'])\
    .setOutputCol("bert_embeddings")

biobert_embeddings_avg = SentenceEmbeddings() \
    .setInputCols(["document", "bert_embeddings"]) \
    .setOutputCol("sentence_bert_embeddings") \
    .setPoolingStrategy("AVERAGE")

genderClassifier = ClassifierDLModel.pretrained('classifierdl_gender_biobert', 'en', 'clinical/models') \
    .setInputCols(["document", "sentence_bert_embeddings"]) \
    .setOutputCol("gender")

gender_pred_pipeline = Pipeline(
   stages = [
       documentAssembler,
       clf_tokenizer,
       biobert_embeddings,
       biobert_embeddings_avg,
       genderClassifier
   ])

New ICD10CM and RxCUI resolvers powered by s-Bert embeddings

The advent of s-Bert sentence embeddings changed the landscape of Clinical Entity Resolvers completely in Spark NLP. Since s-Bert is already tuned on MedNLI (medical natural language inference) dataset, it is now capable of populating the chunk embeddings in a more precise way than before.

We now release two new resolvers:

sbiobertresolve_icd10cm_augmented (augmented with synonyms, four times richer than previous resolver accuracy:

73% for top-1 (exact match), 89% for top-5 (previous accuracy was 59% and 64% respectively)
sbiobertresolve_rxcui (extract RxNorm concept unique identifiers to map with ATC or durg families) accuracy:

71% for top-1 (exact match), 72% for top-5 (previous accuracy was 22% and 41% respectively)

a. ICD10CM augmented resolver

Text = “This is an 82 year old male with a history of prior tobacco use , hypertension , chronic renal insufficiency , COPD , gastritis , and TIA who initially presented to Braintree with a non-ST elevation MI and Guaiac positive stools , transferred to St . Margaret's Center for Women & Infants for cardiac catheterization with PTCA to mid LAD lesion complicated by hypotension and bradycardia requiring Atropine , IV fluids and transient dopamine possibly secondary to vagal reaction , subsequently transferred to CCU for close monitoring , hemodynamically stable at the time of admission to the CCU . “

	chunk	begin	end	code	term
0	hypertension	66	77	I10	hypertension
1	chronic renal insufficiency	81	107	N189	chronic renal insufficiency
2	COPD	111	114	J449	copd - chronic obstructive pulmonary disease
3	gastritis	118	126	K2970	gastritis
4	TIA	134	136	S0690	transient ischemic attack (disorder)
5	a non-ST elevation MI	180	200	I219	silent myocardial infarction (disorder)
6	Guaiac positive stools	206	227	K921	guaiac-positive stools
7	mid LAD lesion	330	343	I2102	stemi involving left anterior descending coronary artery
8	hypotension	360	370	I959	hypotension
9	bradycardia	376	386	O9941	bradycardia

b. RxCUI resolver

Text= “He was seen by the endocrinology service and she was discharged on 50 mg of eltrombopag oral at night, 5 mg amlodipine with meals, and metformin 1000 mg two times a day . “

	chunk	begin	end	code	term
0	50 mg of eltrombopag oral	67	91	825427	eltrombopag 50 MG Oral Tablet
1	5 mg amlodipine	103	117	197361	amlodipine 5 MG Oral Tablet
2	metformin 1000 mg	135	151	861004	metformin hydrochloride 1000 MG Oral Tablet

Using this new resolver and some other resources like Snomed Resolver, RxTerm, MESHPA and ATC dictionary, you can link the drugs to the pharmacological actions (PA), ingredients and the disease treated with that.

Code sample:

(after getting the chunk from ChunkConverter)

c2doc = Chunk2Doc().setInputCols("ner_chunk").setOutputCol("ner_chunk_doc")

sbert_embedder = BertSentenceEmbeddings\
    .pretrained("sbiobert_base_cased_mli",'en','clinical/models')\
    .setInputCols(["ner_chunk_doc"])\
    .setOutputCol("sbert_embeddings")

icd10_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_augmented","en", "clinical/models") \
    .setInputCols(["sbert_embeddings"]) \
    .setOutputCol("icd10cm_code")\
    .setDistanceFunction("EUCLIDEAN")

See the notebook for details.

Versions

Version
Version
Version

PREVIOUSVersion Compatibility