Detect PHI for Deidentification purposes (Spanish, Roberta embeddings)

Description

Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN.

Deidentification NER (Spanish) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 13 entities. This NER model is trained with a combination of custom datasets, Spanish 2002 conLL, MeddoProf dataset and several data augmentation mechanisms. This model uses Roberta Clinical Embeddings.

Predicted Entities

PATIENT, HOSPITAL, DATE, ORGANIZATION, E-MAIL, USERNAME, LOCATION, ZIP, MEDICALRECORD, PROFESSION, PHONE, DOCTOR, AGE

Live Demo Open in Colab Download Copy S3 URI

How to use

documentAssembler = nlp.DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

sentenceDetector = nlp.SentenceDetectorDLModel.pretrained("sentence_detector_dl","xx")\
        .setInputCols(["document"])\
        .setOutputCol("sentence")

tokenizer = nlp.Tokenizer()\
        .setInputCols(["sentence"])\
        .setOutputCol("token")

roberta_embeddings = nlp.RoBertaEmbeddings.pretrained("roberta_base_biomedical", "es")\
        .setInputCols(["sentence", "token"])\
        .setOutputCol("embeddings")

clinical_ner = medical.NerModel.pretrained("ner_deid_subentity_roberta", "es", "clinical/models")\
        .setInputCols(["sentence","token","embeddings"])\
        .setOutputCol("ner")

nlpPipeline = nlp.Pipeline(stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        roberta_embeddings,
        clinical_ner])

text = ['''
Antonio Pérez Juan, nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos.
''']

df = spark.createDataFrame([text]).toDF("text")

results = nlpPipeline.fit(df).transform(df)

val documentAssembler = new DocumentAssembler()
        .setInputCol("text")
        .setOutputCol("document")

val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","xx")
        .setInputCols(Array("document"))
        .setOutputCol("sentence")

val tokenizer = new Tokenizer()
        .setInputCols(Array("sentence"))
        .setOutputCol("token")

val roberta_embeddings = RoBertaEmbeddings.pretrained("roberta_base_biomedical", "es")
        .setInputCols(Array("sentence", "token"))
        .setOutputCol("embeddings")

val clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity_roberta", "es", "clinical/models")
        .setInputCols(Array("sentence","token","embeddings"))
        .setOutputCol("ner")

val pipeline = new Pipeline().setStages(Array(
        documentAssembler, 
        sentenceDetector, 
        tokenizer, 
        roberta_embeddings, 
        clinical_ner))

val text = """Antonio Pérez Juan, nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos."""

val df = Seq(text).toDS.toDF("text")

val results = pipeline.fit(df).transform(df)

import nlu
nlu.load("es.med_ner.deid.subentity_roberta").predict("""
Antonio Pérez Juan, nacido en Cadiz, España. Aún no estaba vacunado, se infectó con Covid-19 el dia 14 de Marzo y tuvo que ir al Hospital. Fue tratado con anticuerpos monoclonales en la Clinica San Carlos.
""")

Results

+------------+----------+
|       token| ner_label|
+------------+----------+
|     Antonio| B-PATIENT|
|       Pérez| I-PATIENT|
|        Juan| I-PATIENT|
|           ,|         O|
|      nacido|         O|
|          en|         O|
|       Cadiz|B-LOCATION|
|           ,|         O|
|      España|B-LOCATION|
|           .|         O|
|         Aún|         O|
|          no|         O|
|      estaba|         O|
|    vacunado|         O|
|           ,|         O|
|          se|         O|
|     infectó|         O|
|         con|         O|
|    Covid-19|         O|
|          el|         O|
|         dia|         O|
|          14|    B-DATE|
|          de|    I-DATE|
|       Marzo|    I-DATE|
|           y|         O|
|        tuvo|         O|
|         que|         O|
|          ir|         O|
|          al|         O|
|    Hospital|         O|
|         Fue|         O|
|     tratado|         O|
|         con|         O|
| anticuerpos|         O|
|monoclonales|         O|
|          en|         O|
|          la|         O|
|     Clinica|B-HOSPITAL|
|         San|I-HOSPITAL|
|      Carlos|I-HOSPITAL|
|           .|         O|
+------------+----------+

Model Information

Model Name:	ner_deid_subentity_roberta
Compatibility:	Healthcare NLP 3.3.4+
License:	Licensed
Edition:	Official
Input Labels:	[sentence, token, embeddings]
Output Labels:	[ner]
Language:	es
Size:	16.3 MB
Dependencies:	roberta_base_biomedical

Data Source

Internal JSL annotated corpus
Spanish conLL
MeddoProf

Benchmarking

        label      tp     fp     fn   total  precision  recall      f1
      PATIENT  1946.0  157.0  213.0  2159.0     0.9253  0.9013  0.9132
     HOSPITAL   272.0   82.0   87.0   359.0     0.7684  0.7577   0.763
         DATE  1632.0   24.0   35.0  1667.0     0.9855   0.979  0.9822
 ORGANIZATION  2460.0  479.0  513.0  2973.0      0.837  0.8274  0.8322
         MAIL    58.0    0.0    0.0    58.0        1.0     1.0     1.0
     USERNAME    95.0    1.0   10.0   105.0     0.9896  0.9048  0.9453
     LOCATION  1734.0  416.0  381.0  2115.0     0.8065  0.8199  0.8131
          ZIP    13.0    0.0    4.0    17.0        1.0  0.7647  0.8667
MEDICALRECORD   111.0   11.0   10.0   121.0     0.9098  0.9174  0.9136
   PROFESSION   273.0   72.0  116.0   389.0     0.7913  0.7018  0.7439
        PHONE   108.0   12.0    8.0   116.0        0.9   0.931  0.9153
       DOCTOR   641.0   32.0   46.0   687.0     0.9525   0.933  0.9426
          AGE   284.0   37.0   64.0   348.0     0.8847  0.8161   0.849
        macro     -      -      -       -         -       -    0.88308
        micro     -      -      -       -         -       -    0.87258

PREVIOUSDetect PHI for Deidentification purposes (Spanish, reduced entities, Roberta Embeddings)

NEXTNamed Entity Recognition Profiling (Biobert)