Detect Subentity PHI for Deidentification (Arabic)

Description

Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN.

Deidentification NER (Arabic) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 17 entities. This NER model is trained with a combination of custom datasets, and several data augmentation mechanisms. This model Word2Vec Arabic Clinical Embeddings.

Predicted Entities

PATIENT, HOSPITAL, DATE, ORGANIZATION, CITY, STREET, USERNAME, SEX, IDNUM, EMAIL, ZIP, MEDICALRECORD, PROFESSION, PHONE, COUNTRY, DOCTOR, AGE

Open in Colab Copy S3 URI

How to use

documentAssembler = DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")
        
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
        .setInputCols(["document"])\
        .setOutputCol("sentence")

tokenizer = Tokenizer()\
        .setInputCols(["sentence"])\
        .setOutputCol("token")

embeddings = WordEmbeddingsModel.pretrained("arabic_w2v_cc_300d", "ar")\
    .setInputCols(["document", "token"])\
    .setOutputCol("embeddings")

clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity", "ar", "clinical/models")\
        .setInputCols(["sentence","token","embeddings"])\
        .setOutputCol("ner")

ner_converter = NerConverterInternal()\
        .setInputCols(["sentence","token","ner"])\
        .setOutputCol("ner_chunk")

nlpPipeline = Pipeline(stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        embeddings,
        clinical_ner,
        ner_converter])

text = '''
عالج الدكتور محمد المريض أحمد البالغ من العمر 55 سنة  في 15/05/2000  في مستشفى مدينة الرباط. رقم هاتفه هو  0610948235 وبريده الإلكتروني
mohamedmell@gmail.com.
 '''

data = spark.createDataFrame([[text]]).toDF("text")

results = nlpPipeline .fit(data).transform(data)

val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")

val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")

val embeddings = WordEmbeddingsModel.pretrained("arabic_w2v_cc_300d", "ar")
.setInputCols(Array("sentence","token"))
.setOutputCol("word_embeddings")

val clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity", "ar", "clinical/models")
.setInputCols(Array("sentence","token","word_embeddings"))
.setOutputCol("ner")

val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")

val pipeline = new Pipeline().setStages(Array(
    documentAssembler, 
    sentenceDetector, 
    tokenizer, 
    embeddings, 
    clinical_ner, 
    ner_converter))

text = '''
عالج الدكتور محمد المريض أحمد البالغ من العمر 55 سنة  في 15/05/2000  في مستشفى مدينة الرباط. رقم هاتفه هو  0610948235 وبريده الإلكتروني
mohamedmell@gmail.com.
 '''

val data = Seq(text).toDS.toDF("text")

val results = pipeline.fit(data).transform(data)

Results

+------------------------------------------------+----------------+
|chunk                                             |ner_label|
+------------------------------------------------+---------------+
|محمد                                                 |DOCTOR   |
|55 سنة                                              |AGE          |
|15/05/2000                                   |DATE        |
|الرباط                                                |CITY          |
|0610948235                                 |PHONE     |
|mohamedmell@gmail.com       |EMAIL       |
+------------------------------------------------+--------------+

Model Information

Model Name: ner_deid_subentity
Compatibility: Healthcare NLP 4.4.2+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: ar
Size: 15.0 MB

References

Custom John Snow Labs datasets

Data augmentation techniques

Benchmarking

        label     tp    fp    fn  total  precision  recall      f1
      PATIENT  196.0  26.0  32.0  228.0     0.8829  0.8596  0.8711
     HOSPITAL  193.0  41.0  37.0  230.0     0.8248  0.8391  0.8319
         DATE  877.0  14.0   8.0  885.0     0.9843   0.991  0.9876
 ORGANIZATION   41.0  11.0   6.0   47.0     0.7885  0.8723  0.8283
         CITY  260.0   8.0   5.0  265.0     0.9701  0.9811  0.9756
       STREET  103.0   3.0   0.0  103.0     0.9717     1.0  0.9856
     USERNAME    8.0   0.0   0.0    8.0        1.0     1.0     1.0
          SEX  300.0   9.0  69.0  369.0     0.9709   0.813   0.885
        IDNUM   13.0   1.0   0.0   13.0     0.9286     1.0   0.963
        EMAIL  112.0   5.0   0.0  112.0     0.9573     1.0  0.9782
          ZIP   80.0   4.0   0.0   80.0     0.9524     1.0  0.9756
MEDICALRECORD   17.0   1.0   0.0   17.0     0.9444     1.0  0.9714
   PROFESSION  303.0  27.0  32.0  335.0     0.9182  0.9045  0.9113
        PHONE   38.0   4.0   2.0   40.0     0.9048    0.95  0.9268
      COUNTRY  158.0  10.0   8.0  166.0     0.9405  0.9518  0.9461
       DOCTOR  440.0  23.0  34.0  474.0     0.9503  0.9283  0.9392
          AGE  610.0  18.0   7.0  617.0     0.9713  0.9887  0.9799
        macro     -     -     -      -       -       -      0.9386
        micro     -     -     -      -       -       -      0.9434