Detect PHI for Deidentification purposes (French)

Description

Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM, CNN.

Deidentification NER (French) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 15 entities. This NER model is trained with a custom dataset internally annotated, the French WikiNER dataset, a public dataset of French company names, a public dataset of French hospital names and several data augmentation mechanisms.

Predicted Entities

PATIENT, HOSPITAL, DATE, ORGANIZATION, E-MAIL, USERNAME, ZIP, MEDICALRECORD, PROFESSION, PHONE, DOCTOR, AGE, STREET, CITY, COUNTRY

Live Demo Open in Colab Copy S3 URI

How to use

documentAssembler = DocumentAssembler()\
	.setInputCol("text")\
	.setOutputCol("document")

sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
	.setInputCols(["document"])\
	.setOutputCol("sentence")

tokenizer = Tokenizer()\
	.setInputCols(["sentence"])\
	.setOutputCol("token")

embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "fr")\
	.setInputCols(["sentence", "token"])\
	.setOutputCol("word_embeddings")

clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity", "fr", "clinical/models")\
	.setInputCols(["sentence","token", "word_embeddings"])\
	.setOutputCol("ner")

nlpPipeline = Pipeline(stages=[
	documentAssembler,
	sentenceDetector,
	tokenizer,
	embeddings,
	clinical_ner])

text = ["J'ai vu en consultation Michel Martinez (49 ans) adressé au Centre Hospitalier De Plaisir pour un diabète mal contrôlé avec des symptômes datant de Mars 2015."]

data = spark.createDataFrame([text]).toDF("text")

results = nlpPipeline.fit(data).transform(data)
val documentAssembler = new DocumentAssembler()
	.setInputCol("text")
	.setOutputCol("document")

val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
	.setInputCols("document")
	.setOutputCol("sentence")

val tokenizer = new Tokenizer()
	.setInputCols("sentence")
	.setOutputCol("token")

val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "fr")
	.setInputCols(Array("sentence", "token"))
	.setOutputCol("embeddings")

val clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity", "fr", "clinical/models")
	.setInputCols(Array("sentence","token","embeddings"))
	.setOutputCol("ner")

val pipeline = new Pipeline().setStages(Array(
	documentAssembler, 
	sentenceDetector, 
	tokenizer, 
	embeddings, 
	clinical_ner))

val text = "J'ai vu en consultation Michel Martinez (49 ans) adressé au Centre Hospitalier De Plaisir pour un diabète mal contrôlé avec des symptômes datant de Mars 2015."

val data = Seq(text).toDF("text")

val results = pipeline.fit(data).transform(data)
import nlu
nlu.load("fr.med_ner.deid_subentity").predict("""J'ai vu en consultation Michel Martinez (49 ans) adressé au Centre Hospitalier De Plaisir pour un diabète mal contrôlé avec des symptômes datant de Mars 2015.""")

Results

+------------+----------+
|       token| ner_label|
+------------+----------+
|        J'ai|         O|
|          vu|         O|
|          en|         O|
|consultation|         O|
|      Michel| B-PATIENT|
|    Martinez| I-PATIENT|
|           (|         O|
|          49|     B-AGE|
|         ans|         O|
|           )|         O|
|     adressé|         O|
|          au|         O|
|      Centre|B-HOSPITAL|
| Hospitalier|I-HOSPITAL|
|          De|I-HOSPITAL|
|     Plaisir|I-HOSPITAL|
|        pour|         O|
|          un|         O|
|     diabète|         O|
|         mal|         O|
|    contrôlé|         O|
|        avec|         O|
|         des|         O|
|   symptômes|         O|
|      datant|         O|
|          de|         O|
|        Mars|    B-DATE|
|        2015|    I-DATE|
|           .|         O|
+------------+----------+

Model Information

Model Name: ner_deid_subentity
Compatibility: Healthcare NLP 3.4.1+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: fr
Size: 15.0 MB

References

Benchmarking

label      tp     fp     fn   total  precision  recall      f1
PATIENT  1966.0  124.0  135.0  2101.0     0.9407  0.9357  0.9382
HOSPITAL   315.0   23.0   19.0   334.0      0.932  0.9431  0.9375
DATE  2605.0   31.0   49.0  2654.0     0.9882  0.9815  0.9849
ORGANIZATION   503.0  142.0  159.0   662.0     0.7798  0.7598  0.7697
CITY  2296.0  370.0  351.0  2647.0     0.8612  0.8674  0.8643
MAIL    46.0    0.0    0.0    46.0        1.0     1.0     1.0
STREET    31.0    4.0    3.0    34.0     0.8857  0.9118  0.8986
USERNAME    91.0    1.0   14.0   105.0     0.9891  0.8667  0.9239
ZIP    33.0    0.0    0.0    33.0        1.0     1.0     1.0
MEDICALRECORD   100.0   11.0    2.0   102.0     0.9009  0.9804   0.939
PROFESSION   321.0   59.0   87.0   408.0     0.8447  0.7868  0.8147
PHONE   114.0    3.0    2.0   116.0     0.9744  0.9828  0.9785
COUNTRY   287.0   14.0   51.0   338.0     0.9535  0.8491  0.8983
DOCTOR   622.0    7.0    4.0   626.0     0.9889  0.9936  0.9912
AGE   370.0   52.0   71.0   441.0     0.8768   0.839  0.8575
macro       -      -      -       -          -       -  0.9197
micro       -      -      -       -          -       -  0.9154