Detect PHI for Deidentification purposes (Italian)

Description

Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM, CNN.

Deidentification NER (Italian) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 19 entities. This NER model is trained with a custom dataset internally annotated, a COVID-19 Italian de-identification research dataset making up 15% of the total data (Catelli et al.) and several data augmentation mechanisms.

Predicted Entities

DATE, AGE, SEX, PROFESSION, ORGANIZATION, PHONE, EMAIL, ZIP, STREET, CITY, COUNTRY, PATIENT, DOCTOR, HOSPITAL, MEDICALRECORD, SSN, IDNUM, USERNAME, URL

Live Demo Open in Colab Copy S3 URI

How to use

documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "it")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("word_embeddings")

clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity", "it", "clinical/models")\
    .setInputCols(["sentence","token", "word_embeddings"])\
    .setOutputCol("ner")

nlpPipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
clinical_ner])

text = ["Ho visto Gastone Montanariello (49 anni) riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015."]

data = spark.createDataFrame([text]).toDF("text")

results = nlpPipeline.fit(data).transform(data)
val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
    .setInputCols(Array("document"))
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols(Array("sentence"))
    .setOutputCol("token")

val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "it")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity", "it", "clinical/models")
    .setInputCols(Array("sentence","token","embeddings"))
    .setOutputCol("ner")

val pipeline = new Pipeline().setStages(Array(
    documentAssembler, 
    sentenceDetector, 
    tokenizer, 
    embeddings, 
    clinical_ner))

val text = "Ho visto Gastone Montanariello (49 anni) riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015."

val data = Seq(text).toDF("text")

val results = pipeline.fit(data).transform(data)
import nlu
nlu.load("it.med_ner.deid_subentity").predict("""Ho visto Gastone Montanariello (49 anni) riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015.""")

Results

+-------------+----------+
|        token| ner_label|
+-------------+----------+
|           Ho|         O|
|        visto|         O|
|      Gastone| B-PATIENT|
|Montanariello| I-PATIENT|
|            (|         O|
|           49|     B-AGE|
|         anni|         O|
|            )|         O|
|     riferito|         O|
|          all|         O|
|            '|         O|
|     Ospedale|B-HOSPITAL|
|          San|I-HOSPITAL|
|      Camillo|I-HOSPITAL|
|          per|         O|
|      diabete|         O|
|          mal|         O|
|  controllato|         O|
|          con|         O|
|      sintomi|         O|
|    risalenti|         O|
|            a|         O|
|        marzo|    B-DATE|
|         2015|    I-DATE|
|            .|         O|
+-------------+----------+

Model Information

Model Name: ner_deid_subentity
Compatibility: Healthcare NLP 3.4.2+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: it
Size: 15.0 MB

References

Benchmarking

label      tp    fp    fn   total  precision  recall      f1
PATIENT   263.0  29.0  25.0   288.0     0.9007  0.9132  0.9069
HOSPITAL   365.0  36.0  48.0   413.0     0.9102  0.8838  0.8968
DATE  1164.0  13.0  26.0  1190.0      0.989  0.9782  0.9835
ORGANIZATION    72.0  25.0  26.0    98.0     0.7423  0.7347  0.7385
URL    41.0   0.0   0.0    41.0        1.0     1.0     1.0
CITY   421.0   9.0  19.0   440.0     0.9791  0.9568  0.9678
STREET   198.0   4.0   6.0   204.0     0.9802  0.9706  0.9754
USERNAME    20.0   2.0   2.0    22.0     0.9091  0.9091  0.9091
SEX   753.0  26.0  21.0   774.0     0.9666  0.9729  0.9697
IDNUM   113.0   3.0   7.0   120.0     0.9741  0.9417  0.9576
EMAIL   148.0   0.0   0.0   148.0        1.0     1.0     1.0
ZIP   148.0   3.0   1.0   149.0     0.9801  0.9933  0.9867
MEDICALRECORD    19.0   3.0   6.0    25.0     0.8636    0.76  0.8085
SSN    13.0   1.0   1.0    14.0     0.9286  0.9286  0.9286
PROFESSION   316.0  28.0  53.0   369.0     0.9186  0.8564  0.8864
PHONE    53.0   0.0   2.0    55.0        1.0  0.9636  0.9815
COUNTRY   182.0  14.0  15.0   197.0     0.9286  0.9239  0.9262
DOCTOR   769.0  77.0  62.0   831.0      0.909  0.9254  0.9171
AGE   763.0   8.0  18.0   781.0     0.9896   0.977  0.9832
macro       -     -     -       -          -       -  0.9328
micro       -     -     -       -          -       -  0.9494