Detect Risk Factors (LangTest)

Description

Pretrained named entity recognition deep learning model for Heart Disease Risk Factors and Personal Health Information. It is the version of ner_risk_factors model augmented with langtest library.

test_type before fail_count after fail_count before pass_count after pass_count minimum pass_rate before pass_rate after pass_rate
add_ocr_typo 376 255 5287 5408 95% 93% 95%
lowercase 616 403 5957 6170 92% 91% 94%
titlecase 603 360 6083 6326 94% 91% 95%
uppercase 1054 587 5715 6182 90% 84% 91%
weighted average 2649 1605 23042 24086 93% 89.69% 93.75%

Predicted Entities

CAD, DIABETES, FAMILY_HIST, HYPERLIPIDEMIA, HYPERTENSION, MEDICATION, OBESE, PHI, SMOKER

Copy S3 URI

How to use

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
         
sentence_detector = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

clinical_ner = MedicalNerModel.pretrained("ner_risk_factors_langtest", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverter()\
 	.setInputCols(["sentence", "token", "ner"])\
 	.setOutputCol("ner_chunk")

nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter])

model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

result = model.transform(spark.createDataFrame([["""PAST SURGICAL HISTORY: She has also had a hysterectomy, salpingoophorectomy, appendectomy, tonsillectomy, two carpal tunnel releases. She also has had a parathyroidectomy but still has had some borderline elevated calcium. Also, hypertension, hyperlipidemia, as well as diabetes. She also has osteoporosis.

SOCIAL HISTORY: The patient still smokes about a third of a pack a day, also drinks only occasional alcoholic drinks. The patient is married. She has three grown sons, all of which are very successful in professional positions. One son is a gastroenterologist in San Diego, California.

MEDICATIONS: Nifedipine-XR 90 mg daily, furosemide 20 mg half tablet b.i.d., lisinopril 20 mg daily, gemfibrozil 600 mg b.i.d., Synthroid 0.1 mg daily, Miacalcin one spray in alternate nostrils daily, Ogen 0.625 mg daily, Daypro 600 mg t.i.d., also Lortab 7.5 two or three a day, also Flexeril occasionally, also other vitamin.

ALLERGIES: She had some adverse reactions to penicillin, sulfa, perhaps contrast medium, and some mycins."""]], ["text"]))
val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")
         
val sentence_detector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val ner = MedicalNerModel.pretrained("ner_risk_factors_langtest", "en", "clinical/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")

val ner_converter = new NerConverter()
 	.setInputCols(Array("sentence", "token", "ner"))
 	.setOutputCol("ner_chunk")

val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter))

val data = Seq("""PAST SURGICAL HISTORY: She has also had a hysterectomy, salpingoophorectomy, appendectomy, tonsillectomy, two carpal tunnel releases. She also has had a parathyroidectomy but still has had some borderline elevated calcium. Also, hypertension, hyperlipidemia, as well as diabetes. She also has osteoporosis.

SOCIAL HISTORY: The patient still smokes about a third of a pack a day, also drinks only occasional alcoholic drinks. The patient is married. She has three grown sons, all of which are very successful in professional positions. One son is a gastroenterologist in San Diego, California.

MEDICATIONS: Nifedipine-XR 90 mg daily, furosemide 20 mg half tablet b.i.d., lisinopril 20 mg daily, gemfibrozil 600 mg b.i.d., Synthroid 0.1 mg daily, Miacalcin one spray in alternate nostrils daily, Ogen 0.625 mg daily, Daypro 600 mg t.i.d., also Lortab 7.5 two or three a day, also Flexeril occasionally, also other vitamin.

ALLERGIES: She had some adverse reactions to penicillin, sulfa, perhaps contrast medium, and some mycins.""").toDS().toDF("text")

val result = pipeline.fit(data).transform(data)

Results

+------------------------------------------+--------------+
|chunk                                     |ner_label     |
+------------------------------------------+--------------+
|hypertension                              |HYPERTENSION  |
|hyperlipidemia                            |HYPERLIPIDEMIA|
|diabetes                                  |DIABETES      |
|still smokes about a third of a pack a day|SMOKER        |
|San Diego                                 |PHI           |
|California                                |PHI           |
|lisinopril                                |MEDICATION    |
|gemfibrozil                               |MEDICATION    |
+------------------------------------------+--------------+

Model Information

Model Name: ner_risk_factors_langtest
Compatibility: Healthcare NLP 5.1.1+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: en
Size: 14.7 MB

References

Trained on plain n2c2 2014: De-identification and Heart Disease Risk Factors Challenge datasets with embeddings_clinical. https://portal.dbmi.hms.harvard.edu/projects/n2c2-2014/

Benchmarking

label             precision  recall  f1-score  support 
B-CAD             0.55       0.48    0.51      183     
B-DIABETES        0.74       0.80    0.77      246     
B-FAMILY_HIST     0.00       0.00    0.00      3       
B-HYPERLIPIDEMIA  0.85       0.89    0.87      95      
B-HYPERTENSION    0.73       0.81    0.77      233     
B-MEDICATION      0.84       0.85    0.84      894     
B-OBESE           0.69       0.82    0.75      49      
B-PHI             0.81       0.80    0.81      2106    
B-SMOKER          0.66       0.19    0.30      118     
I-CAD             0.38       0.36    0.37      476     
I-DIABETES        0.59       0.59    0.59      214     
I-FAMILY_HIST     0.83       0.29    0.43      17      
I-HYPERLIPIDEMIA  0.55       0.35    0.43      17      
I-HYPERTENSION    0.24       0.31    0.27      108     
I-MEDICATION      0.73       0.20    0.31      347     
I-OBESE           0.23       0.43    0.30      7       
I-PHI             0.79       0.72    0.75      618     
I-SMOKER          0.74       0.30    0.43      220     
micro-avg         0.73       0.67    0.70      5951    
macro-avg         0.61       0.51    0.53      5951    
weighted-avg      0.73       0.67    0.69      5951