Detect PHI for Deidentification (ner_deidentification_dl)

Description

Named Entity Recognition annotator (NERDLModel) allows for a generic model to be trained by utilizing a deep learning algorithm (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN.Deidentification NER (DL) is a Named Entity Recognition model that annotates text to find protected health information that may need to be deidentified.

Predicted Entities

AGE,BIOID,CITY,COUNTRY,DATE,DEVICE,DOCTOR,EMAIL,HOSPITAL,FAX,HEALTHPLAN,HOSPITAL,IDNUM,LOCATION-OTHER,MEDICALRECORD,ORGANIZATION,PATIENT,PHONE,PROFESSION,STATE,STREET,URL,USERNAME,ZIP

Live Demo Open in Colab Download Copy S3 URI

How to use

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
         
sentence_detector = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner = MedicalNerModel.pretrained("ner_deidentify_dl","en","clinical/models") \
    .setInputCols("sentence","token","embeddings") \
    .setOutputCol("ner")

ner_converter = NerConverter()\
 	.setInputCols(["sentence", "token", "ner"])\
 	.setOutputCol("ner_chunk")

nlp_pipeline = Pipeline(stages=[
    document_assembler, 
    sentence_detector, 
    tokenizer, 
    word_embeddings, 
    ner, 
    ner_converter])

model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

results = model.transform(spark.createDataFrame([["A . Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson , Ora MR . # 7194334 Date : 01/13/93 PCP : Oliveira , 25 month years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital . 0295 Keats Street"]], ["text"]))

val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")
         
val sentence_detector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val ner = MedicalNerModel.pretrained("ner_deidentify_dl","en","clinical/models")
	.setInputCols(Array("sentence","token","embeddings"))
	.setOutputCol("ner")

val ner_converter = new NerConverter()
 	.setInputCols(Array("sentence", "token", "ner"))
 	.setOutputCol("ner_chunk")

val pipeline = new Pipeline().setStages(Array(
    document_assembler, 
    sentence_detector, 
    tokenizer,
     word_embeddings, 
     ner, 
     ner_converter))

val data = Seq("""A . Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson , Ora MR . # 7194334 Date : 01/13/93 PCP : Oliveira , 25 month years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital . 0295 Keats Street""").toDS.toDF("text")

val result = pipeline.fit(data).transform(data)

import nlu
nlu.load("en.med_ner.deid").predict("""A . Record date : 2093-01-13 , David Hale , M.D . , Name : Hendrickson , Ora MR . # 7194334 Date : 01/13/93 PCP : Oliveira , 25 month years-old , Record date : 2079-11-09 . Cocke County Baptist Hospital . 0295 Keats Street""")

Results

+---------------+-----+
|ner_label      |count|
+---------------+-----+
|O              |28   |
|I-HOSPITAL     |4    |
|B-DATE         |3    |
|I-STREET       |3    |
|I-PATIENT      |2    |
|B-DOCTOR       |2    |
|B-AGE          |1    |
|B-PATIENT      |1    |
|I-DOCTOR       |1    |
|B-MEDICALRECORD|1    |
+---------------+-----+. 

+-----------------------------+-------------+
|chunk                        |ner_label    |
+-----------------------------+-------------+
|2093-01-13                   |DATE         |
|David Hale                   |DOCTOR       |
|Hendrickson , Ora            |PATIENT      |
|7194334                      |MEDICALRECORD|
|01/13/93                     |DATE         |
|Oliveira                     |DOCTOR       |
|25                           |AGE          |
|2079-11-09                   |DATE         |
|Cocke County Baptist Hospital|HOSPITAL     |
|0295 Keats Street            |STREET       |
+-----------------------------+-------------+

Model Information

Model Name:	ner_deidentify_dl
Compatibility:	Healthcare NLP 3.0.0+
License:	Licensed
Edition:	Official
Input Labels:	[sentence, token, embeddings]
Output Labels:	[ner]
Language:	en

Data Source

Trained on JSL enriched n2c2 2014: De-identification and Heart Disease Risk Factors Challenge datasets with embeddings_clinical https://portal.dbmi.hms.harvard.edu/projects/n2c2-2014/

Benchmarking

|    | label            |    tp |   fp |   fn |     prec |      rec |       f1 |
|---:|:-----------------|------:|-----:|-----:|---------:|---------:|---------:|
|  1 | I-AGE            |     7 |    3 |    6 | 0.7      | 0.538462 | 0.608696 |
|  2 | I-DOCTOR         |   800 |   27 |   94 | 0.967352 | 0.894855 | 0.929692 |
|  3 | I-IDNUM          |     6 |    0 |    2 | 1        | 0.75     | 0.857143 |
|  4 | B-DATE           |  1883 |   34 |   56 | 0.982264 | 0.971119 | 0.97666  |
|  5 | I-DATE           |   425 |   28 |   25 | 0.93819  | 0.944444 | 0.941307 |
|  6 | B-PHONE          |    29 |    7 |    9 | 0.805556 | 0.763158 | 0.783784 |
|  7 | B-STATE          |    87 |    4 |   11 | 0.956044 | 0.887755 | 0.920635 |
|  8 | B-CITY           |    35 |   11 |   26 | 0.76087  | 0.57377  | 0.654206 |
|  9 | I-ORGANIZATION   |    12 |    4 |   15 | 0.75     | 0.444444 | 0.55814  |
| 10 | B-DOCTOR         |   728 |   75 |   53 | 0.9066   | 0.932138 | 0.919192 |
| 11 | I-PROFESSION     |    43 |   11 |   13 | 0.796296 | 0.767857 | 0.781818 |
| 12 | I-PHONE          |    62 |    4 |    4 | 0.939394 | 0.939394 | 0.939394 |
| 13 | B-AGE            |   234 |   13 |   16 | 0.947368 | 0.936    | 0.94165  |
| 14 | B-STREET         |    20 |    7 |   16 | 0.740741 | 0.555556 | 0.634921 |
| 15 | I-ZIP            |    60 |    3 |    2 | 0.952381 | 0.967742 | 0.96     |
| 16 | I-MEDICALRECORD  |    54 |    5 |    2 | 0.915254 | 0.964286 | 0.93913  |
| 17 | B-ZIP            |     2 |    1 |    0 | 0.666667 | 1        | 0.8      |
| 18 | B-HOSPITAL       |   256 |   23 |   66 | 0.917563 | 0.795031 | 0.851913 |
| 19 | I-STREET         |   150 |   17 |   20 | 0.898204 | 0.882353 | 0.890208 |
| 20 | B-COUNTRY        |    22 |    2 |    8 | 0.916667 | 0.733333 | 0.814815 |
| 21 | I-COUNTRY        |     1 |    0 |    0 | 1        | 1        | 1        |
| 22 | I-STATE          |     6 |    0 |    1 | 1        | 0.857143 | 0.923077 |
| 23 | B-USERNAME       |    30 |    0 |    4 | 1        | 0.882353 | 0.9375   |
| 24 | I-HOSPITAL       |   295 |   37 |   64 | 0.888554 | 0.821727 | 0.853835 |
| 25 | I-PATIENT        |   243 |   26 |   41 | 0.903346 | 0.855634 | 0.878843 |
| 26 | B-PROFESSION     |    52 |    8 |   17 | 0.866667 | 0.753623 | 0.806202 |
| 27 | B-IDNUM          |    32 |    3 |   12 | 0.914286 | 0.727273 | 0.810127 |
| 28 | I-CITY           |    76 |   15 |   13 | 0.835165 | 0.853933 | 0.844444 |
| 29 | B-PATIENT        |   337 |   29 |   40 | 0.920765 | 0.893899 | 0.907133 |
| 30 | B-MEDICALRECORD  |    74 |    6 |    4 | 0.925    | 0.948718 | 0.936709 |
| 31 | B-ORGANIZATION   |    20 |    5 |   13 | 0.8      | 0.606061 | 0.689655 |
| 32 | Macro-average    | 6083  | 408  |  673 | 0.7976   | 0.697533 | 0.744218 |
| 33 | Micro-average    | 6083  | 408  |  673 | 0.937144 | 0.900385 | 0.918397 |

PREVIOUSDetect PHI for Deidentification (Augmented)

NEXTDetect Diagnoses and Procedures (Spanish)