Detect Problems, Tests and Treatments

Description

Pretrained named entity recognition deep learning model for healthcare. Includes Problem, Test and Treatment entities. The SparkNLP deep learning model (NerDL) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN.

Predicted Entities

PROBLEM, TEST, TREATMENT.

Live Demo Open in Colab Download

How to use

...
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_healthcare", "en", "clinical/models")\
  .setInputCols(["sentence", "token"])\
  .setOutputCol("embeddings")

clinical_ner = NerDLModel.pretrained("ner_healthcare", "en", "clinical/models") \
  .setInputCols(["sentence", "token", "embeddings"]) \
  .setOutputCol("ner")
...
nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter])

model = nlpPipeline.fit(spark.createDataFrame([["A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG ."]]).toDF("text"))

results = model.transform(data)
...
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_healthcare", "en", "clinical/models")
  .setInputCols(Array("sentence", "token"))
  .setOutputCol("embeddings")

val ner = NerDLModel.pretrained("ner_healthcare", "en", "clinical/models") 
  .setInputCols(Array("sentence", "token", "embeddings"))
  .setOutputCol("ner")
...
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter))

val data = Seq(A 28-year-old female with a history of gestational diabetes mellitus diagnosed eight years prior to presentation and subsequent type two diabetes mellitus ( T2DM ), one prior episode of HTG-induced pancreatitis three years prior to presentation , associated with an acute hepatitis , and obesity with a body mass index ( BMI ) of 33.5 kg/m2 , presented with a one-week history of polyuria , polydipsia , poor appetite , and vomiting . Two weeks prior to presentation , she was treated with a five-day course of amoxicillin for a respiratory tract infection . She was on metformin , glipizide , and dapagliflozin for T2DM and atorvastatin and gemfibrozil for HTG .).toDF("text")

val result = pipeline.fit(data).transform(data)


Results

|   | chunk                         | ner_label |
|---|-------------------------------|-----------|
| 0 | a respiratory tract infection | PROBLEM   |
| 1 | metformin                     | TREATMENT |
| 2 | glipizide                     | TREATMENT |
| 3 | dapagliflozin                 | TREATMENT |
| 4 | T2DM                          | PROBLEM   |
| 5 | atorvastatin                  | TREATMENT |
| 6 | gemfibrozil                   | TREATMENT |


Model Information

Model Name: ner_healthcare
Compatibility: Spark NLP for Healthcare 3.0.0+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: en

Data Source

Trained on 2010 i2b2 challenge data. https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/

Benchmarking

label             tp    fp     fn      prec       rec        f1
I-TREATMENT     6625  1187   1329  0.848054  0.832914  0.840416
I-PROBLEM      15142  1976   2542  0.884566  0.856254  0.87018 
B-PROBLEM      11005  1065   1587  0.911765  0.873968  0.892466
I-TEST          6748   923   1264  0.879677  0.842237  0.86055 
B-TEST          8196   942   1029  0.896914  0.888455  0.892665
B-TREATMENT     8271  1265   1073  0.867345  0.885167  0.876165
Macro-average  55987  7358   8824  0.881387  0.863166  0.872181
Micro-average  55987  7358   8824  0.883842  0.86385   0.873732