Detect PHI for Deidentification (LangTest - Subentity - Augmented)

Description

Named Entity recognition annotator allows for a generic model to be trained by utilizing a deep learning algorithm (Char CNNs - BiLSTM - CRF - word embeddings) inspired by a former state-of-the-art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM, CNN. Deidentification NER is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 23 entities. This ner model is trained with a combination of the i2b2 train set and a re-augmented version of the i2b2 train set. It is the version of ner_deid_subentity_augmented model augmented with langtest library.

test_type	before fail_count	after fail_count	before pass_count	after pass_count	minimum pass_rate	before pass_rate	after pass_rate
add_typo	306	256	17377	17416	95%	98%	99%
lowercase	910	336	15226	15800	95%	94%	98%
swap_entities	358	322	3688	3734	95%	91%	92%
titlecase	396	237	17014	17173	95%	98%	99%
uppercase	1096	500	16262	16858	95%	94%	97%
weighted average	3066	1651	69567	70981	95%	95.78%	97.73%

We stuck to the official annotation guideline (AG) for the 2014 i2b2 Deid challenge while annotating new datasets for this model. All the details regarding the nuances and explanations for AG can be found here https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/

Predicted Entities

MEDICALRECORD, ORGANIZATION, DOCTOR, USERNAME, PROFESSION, HEALTHPLAN, URL, CITY, DATE, LOCATION-OTHER, STATE, PATIENT, DEVICE, COUNTRY, ZIP, PHONE, HOSPITAL, EMAIL, IDNUM, STREET, BIOID, FAX, AGE

Download Copy S3 URI

How to use

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

deid_ner = MedicalNerModel.pretrained("ner_deid_subentity_augmented_langtest", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_converter = NerConverter()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk_subentity")

nlpPipeline = Pipeline(stages=[
                    document_assembler, 
                    sentence_detector, 
                    tokenizer, 
                    word_embeddings, 
                    deid_ner, 
                    ner_converter])

model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

results = model.transform(spark.createDataFrame(pd.DataFrame({"text": ["""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25-year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine."""]})))

val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentence_detector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val deid_ner = MedicalNerModel.pretrained("ner_deid_subentity_augmented_langtest", "en", "clinical/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")

val ner_converter = new NerConverter()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk_subentity")

val nlpPipeline = new Pipeline().setStages(Array(
    document_assembler, 
    sentence_detector, 
    tokenizer, 
    word_embeddings, 
    deid_ner, 
    ner_converter))

val data = Seq("""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25-year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.""").toDS.toDF("text")

val result = pipeline.fit(data).transform(data)

Results

+-----------------------------+-------------+
|chunk                        |ner_label    |
+-----------------------------+-------------+
|2093-01-13                   |DATE         |
|David Hale                   |DOCTOR       |
|Hendrickson, Ora             |PATIENT      |
|7194334                      |MEDICALRECORD|
|01/13/93                     |DATE         |
|Oliveira                     |DOCTOR       |
|25-year-old                  |AGE          |
|1-11-2000                    |DATE         |
|Cocke County Baptist Hospital|HOSPITAL     |
|0295 Keats Street            |STREET       |
|(302) 786-5227               |PHONE        |
|Brothers Coal-Mine           |ORGANIZATION |
+-----------------------------+-------------+

Model Information

Model Name:	ner_deid_subentity_augmented_langtest
Compatibility:	Healthcare NLP 5.1.0+
License:	Licensed
Edition:	Official
Input Labels:	[sentence, token, embeddings]
Output Labels:	[ner]
Language:	en
Size:	14.8 MB

References

A custom data set which is created from the i2b2-PHI train and the re-augmented version of the i2b2-PHI train set is used.

Benchmarking

label           precision  recall  f1-score  support 
AGE             0.96       0.93    0.94      377     
BIOID           0.00       0.00    0.00      1       
CITY            0.89       0.82    0.85      104     
COUNTRY         0.88       0.86    0.87      57      
DATE            0.98       0.99    0.99      2375    
DEVICE          0.83       0.71    0.77      7       
DOCTOR          0.96       0.91    0.93      918     
FAX             0.00       0.00    0.00      2       
HOSPITAL        0.90       0.91    0.90      410     
IDNUM           0.83       0.66    0.74      83      
LOCATION-OTHER  0.50       0.75    0.60      8       
MEDICALRECORD   0.91       0.97    0.94      164     
ORGANIZATION    0.78       0.63    0.70      57      
PATIENT         0.85       0.92    0.89      408     
PHONE           0.93       0.88    0.91      112     
PROFESSION      0.84       0.80    0.82      97      
STATE           0.94       0.92    0.93      73      
STREET          0.87       0.95    0.91      77      
USERNAME        0.96       0.90    0.93      52      
ZIP             0.96       1.00    0.98      45      
micro-avg       0.94       0.94    0.94      5427    
macro-avg       0.79       0.78    0.78      5427    
weighted-avg    0.94       0.94    0.94      5427    

PREVIOUSDetect PHI for Deidentification (LangTest - Generic - Augmented)

NEXTMultilabel Text Classification For Respiratory Disease