Detect Risk Factors (LangTest)

Description

Pretrained named entity recognition deep learning model for Heart Disease Risk Factors and Personal Health Information. It is the version of ner_risk_factors model augmented with langtest library.

test_type	before fail_count	after fail_count	before pass_count	after pass_count	minimum pass_rate	before pass_rate	after pass_rate
add_ocr_typo	376	255	5287	5408	95%	93%	95%
lowercase	616	403	5957	6170	92%	91%	94%
titlecase	603	360	6083	6326	94%	91%	95%
uppercase	1054	587	5715	6182	90%	84%	91%
weighted average	2649	1605	23042	24086	93%	89.69%	93.75%

Predicted Entities

CAD, DIABETES, FAMILY_HIST, HYPERLIPIDEMIA, HYPERTENSION, MEDICATION, OBESE, PHI, SMOKER

Download Copy S3 URI

How to use

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
         
sentence_detector = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

clinical_ner = MedicalNerModel.pretrained("ner_risk_factors_langtest", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverter()\
 	.setInputCols(["sentence", "token", "ner"])\
 	.setOutputCol("ner_chunk")

nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, ner_converter])

model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

result = model.transform(spark.createDataFrame([["""PAST SURGICAL HISTORY: She has also had a hysterectomy, salpingoophorectomy, appendectomy, tonsillectomy, two carpal tunnel releases. She also has had a parathyroidectomy but still has had some borderline elevated calcium. Also, hypertension, hyperlipidemia, as well as diabetes. She also has osteoporosis.

SOCIAL HISTORY: The patient still smokes about a third of a pack a day, also drinks only occasional alcoholic drinks. The patient is married. She has three grown sons, all of which are very successful in professional positions. One son is a gastroenterologist in San Diego, California.

MEDICATIONS: Nifedipine-XR 90 mg daily, furosemide 20 mg half tablet b.i.d., lisinopril 20 mg daily, gemfibrozil 600 mg b.i.d., Synthroid 0.1 mg daily, Miacalcin one spray in alternate nostrils daily, Ogen 0.625 mg daily, Daypro 600 mg t.i.d., also Lortab 7.5 two or three a day, also Flexeril occasionally, also other vitamin.

ALLERGIES: She had some adverse reactions to penicillin, sulfa, perhaps contrast medium, and some mycins."""]], ["text"]))

val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")
         
val sentence_detector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val ner = MedicalNerModel.pretrained("ner_risk_factors_langtest", "en", "clinical/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")

val ner_converter = new NerConverter()
 	.setInputCols(Array("sentence", "token", "ner"))
 	.setOutputCol("ner_chunk")

val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, ner_converter))

val data = Seq("""PAST SURGICAL HISTORY: She has also had a hysterectomy, salpingoophorectomy, appendectomy, tonsillectomy, two carpal tunnel releases. She also has had a parathyroidectomy but still has had some borderline elevated calcium. Also, hypertension, hyperlipidemia, as well as diabetes. She also has osteoporosis.

SOCIAL HISTORY: The patient still smokes about a third of a pack a day, also drinks only occasional alcoholic drinks. The patient is married. She has three grown sons, all of which are very successful in professional positions. One son is a gastroenterologist in San Diego, California.

MEDICATIONS: Nifedipine-XR 90 mg daily, furosemide 20 mg half tablet b.i.d., lisinopril 20 mg daily, gemfibrozil 600 mg b.i.d., Synthroid 0.1 mg daily, Miacalcin one spray in alternate nostrils daily, Ogen 0.625 mg daily, Daypro 600 mg t.i.d., also Lortab 7.5 two or three a day, also Flexeril occasionally, also other vitamin.

ALLERGIES: She had some adverse reactions to penicillin, sulfa, perhaps contrast medium, and some mycins.""").toDS().toDF("text")

val result = pipeline.fit(data).transform(data)

Results

+------------------------------------------+--------------+
|chunk                                     |ner_label     |
+------------------------------------------+--------------+
|hypertension                              |HYPERTENSION  |
|hyperlipidemia                            |HYPERLIPIDEMIA|
|diabetes                                  |DIABETES      |
|still smokes about a third of a pack a day|SMOKER        |
|San Diego                                 |PHI           |
|California                                |PHI           |
|lisinopril                                |MEDICATION    |
|gemfibrozil                               |MEDICATION    |
+------------------------------------------+--------------+

Model Information

Model Name:	ner_risk_factors_langtest
Compatibility:	Healthcare NLP 5.1.1+
License:	Licensed
Edition:	Official
Input Labels:	[sentence, token, embeddings]
Output Labels:	[ner]
Language:	en
Size:	14.7 MB

References

Trained on plain n2c2 2014: De-identification and Heart Disease Risk Factors Challenge datasets with embeddings_clinical. https://portal.dbmi.hms.harvard.edu/projects/n2c2-2014/

Benchmarking

label             precision  recall  f1-score  support 
B-CAD             0.55       0.48    0.51      183     
B-DIABETES        0.74       0.80    0.77      246     
B-FAMILY_HIST     0.00       0.00    0.00      3       
B-HYPERLIPIDEMIA  0.85       0.89    0.87      95      
B-HYPERTENSION    0.73       0.81    0.77      233     
B-MEDICATION      0.84       0.85    0.84      894     
B-OBESE           0.69       0.82    0.75      49      
B-PHI             0.81       0.80    0.81      2106    
B-SMOKER          0.66       0.19    0.30      118     
I-CAD             0.38       0.36    0.37      476     
I-DIABETES        0.59       0.59    0.59      214     
I-FAMILY_HIST     0.83       0.29    0.43      17      
I-HYPERLIPIDEMIA  0.55       0.35    0.43      17      
I-HYPERTENSION    0.24       0.31    0.27      108     
I-MEDICATION      0.73       0.20    0.31      347     
I-OBESE           0.23       0.43    0.30      7       
I-PHI             0.79       0.72    0.75      618     
I-SMOKER          0.74       0.30    0.43      220     
micro-avg         0.73       0.67    0.70      5951    
macro-avg         0.61       0.51    0.53      5951    
weighted-avg      0.73       0.67    0.69      5951   

PREVIOUSPretrained Pipeline for Reading and Skewing Correction in Mixed Scanned and Digital PDF Documents

NEXTFinance E5 Embedding Large