Description
Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN.
Deidentification NER (Arabic) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 17 entities. This NER model is trained with a combination of custom datasets, and several data augmentation mechanisms. This model uses CamelBERT Arabic Embeddings.
Predicted Entities
PATIENT
, HOSPITAL
, DATE
, ORGANIZATION
, CITY
, STREET
, USERNAME
, SEX
, IDNUM
, EMAIL
, ZIP
, MEDICALRECORD
, PROFESSION
, PHONE
, COUNTRY
, DOCTOR
, AGE
How to use
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_base_arabic_camel_msa","ar") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity_camelbert", "ar", "clinical/models")\
.setInputCols(["sentence","token","embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverterInternal()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(
stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
clinical_ner,
ner_converter
])
text = '''
عالج الدكتور محمد المريض أحمد البالغ من العمر 55 سنة في 15/05/2000 في مستشفى مدينة الرباط. رقم هاتفه هو 0610948234 وبريده الإلكتروني
abcd@gmail.com.
'''
data = spark.createDataFrame([[text]]).toDF("text")
results = nlpPipeline.fit(data).transform(data)
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_base_arabic_camel_msa", "ar")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(true)
val clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity_camelbert", "ar", "clinical/models")
.setInputCols(Array("sentence","token","word_embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
clinical_ner,
ner_converter))
val text = '''
عالج الدكتور محمد المريض أحمد البالغ من العمر 55 سنة في 15/05/2000 في مستشفى مدينة الرباط. رقم هاتفه هو 0610948234 وبريده الإلكتروني
abcd@gmail.com.
'''
val data = Seq(text).toDS.toDF("text")
val results = pipeline.fit(data).transform(data)
Results
+--------------+--------------------------+
|chunk |ner_label|
+--------------+--------------------------+
|الدكتور محمد |DOCTOR |
|55 سنة |AGE |
|15/05/2000 |DATE |
|الرباط |CITY |
|0610948234 |PHONE |
|abcd@gmail.com |EMAIL |
+--------------+---------------------------+
Model Information
Model Name: | ner_deid_subentity_camelbert |
Compatibility: | Healthcare NLP 5.1.0+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence, token, embeddings] |
Output Labels: | [ner] |
Language: | ar |
Size: | 16.4 MB |
Benchmarking
label Precision Recall F1
AGE 98.08 99.51 98.79
CITY 87.17 92.03 89.53
COUNTRY 90.32 84.85 87.50
DATE 98.07 97.30 97.68
DOCTOR 92.09 91.90 91.99
EMAIL 100.00 100.00 100.00
HOSPITAL 84.32 91.28 87.67
IDNUM 92.86 100.00 96.30
MEDICALRECORD 61.54 94.12 74.42
ORGANIZATION 81.58 68.89 74.70
PATIENT 75.00 83.41 78.98
PHONE 97.06 82.50 89.19
PROFESSION 89.72 85.97 87.80
SEX 98.71 81.33 89.18
STREET 100.00 96.77 98.36
USERNAME 100.00 100.00 100.00
ZIP 95.24 100.00 97.56
Macro - - 94.00
Micro - - 95.00