Description
Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM, CNN.
Deidentification NER (Italian) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 8 entities. This NER model is trained with a custom dataset internally annotated, a COVID-19 Italian de-identification research dataset making up 15% of the total data (Catelli et al.) and several data augmentation mechanisms.
Predicted Entities
CONTACT, NAME, DATE, ID, LOCATION, PROFESSION, AGE, SEX
Live Demo Open in Colab Copy S3 URI
How to use
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")
tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "it")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("word_embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_deid_generic", "it", "clinical/models")\
    .setInputCols(["sentence","token", "word_embeddings"])\
    .setOutputCol("ner")
nlpPipeline = Pipeline(stages=[
    documentAssembler,
    sentenceDetector,
    tokenizer,
    embeddings,
    clinical_ner])
text = ["Ho visto Gastone Montanariello (49 anni) riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015."]
data = spark.createDataFrame([text]).toDF("text")
results = nlpPipeline.fit(data).transform(data)
val documentAssembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
    .setInputCols(Array("document"))
    .setOutputCol("sentence")
val tokenizer = new Tokenizer()
    .setInputCols(Array("sentence"))
    .setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "it")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")
val clinical_ner = MedicalNerModel.pretrained("ner_deid_generic", "it", "clinical/models")
    .setInputCols(Array("sentence","token","embeddings"))
    .setOutputCol("ner")
val pipeline = new Pipeline().setStages(Array(
    documentAssembler, 
    sentenceDetector, 
    tokenizer, 
    embeddings, 
    clinical_ner))
val text = "Ho visto Gastone Montanariello (49 anni) riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015."
val data = Seq(text).toDF("text")
val results = pipeline.fit(data).transform(data)
import nlu
nlu.load("it.med_ner.deid_generic").predict("""Ho visto Gastone Montanariello (49 anni) riferito all' Ospedale San Camillo per diabete mal controllato con sintomi risalenti a marzo 2015.""")
Results
+-------------+----------+
|        token| ner_label|
+-------------+----------+
|           Ho|         O|
|        visto|         O|
|      Gastone|    B-NAME|
|Montanariello|    I-NAME|
|            (|         O|
|           49|     B-AGE|
|         anni|         O|
|            )|         O|
|     riferito|         O|
|          all|         O|
|            '|         O|
|     Ospedale|B-LOCATION|
|          San|I-LOCATION|
|      Camillo|I-LOCATION|
|          per|         O|
|      diabete|         O|
|          mal|         O|
|  controllato|         O|
|          con|         O|
|      sintomi|         O|
|    risalenti|         O|
|            a|         O|
|        marzo|    B-DATE|
|         2015|    I-DATE|
|            .|         O|
+-------------+----------+
Model Information
| Model Name: | ner_deid_generic | 
| Compatibility: | Healthcare NLP 3.4.2+ | 
| License: | Licensed | 
| Edition: | Official | 
| Input Labels: | [sentence, token, embeddings] | 
| Output Labels: | [ner] | 
| Language: | it | 
| Size: | 15.0 MB | 
References
Benchmarking
label      tp    fp     fn   total  precision  recall      f1
CONTACT   244.0   1.0    0.0   244.0     0.9959     1.0   0.998
NAME  1082.0  69.0   59.0  1141.0     0.9401  0.9483  0.9442
DATE  1173.0  26.0   17.0  1190.0     0.9783  0.9857   0.982
ID   138.0   2.0   21.0   159.0     0.9857  0.8679  0.9231
SEX   742.0  21.0   32.0   774.0     0.9725  0.9587  0.9655
LOCATION  1039.0  64.0  108.0  1147.0      0.942  0.9058  0.9236
PROFESSION   300.0  15.0   69.0   369.0     0.9524   0.813  0.8772
AGE   746.0   5.0   35.0   781.0     0.9933  0.9552  0.9739
macro       -     -      -       -          -       -  0.9484
micro       -     -      -       -          -       -  0.9521