Description
Named Entity Recognition annotators allow for a generic model to be trained by using a Deep Learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM,CNN.
Deidentification NER (Arabic) is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 17 entities. This NER model is trained with a combination of custom datasets, and several data augmentation mechanisms. This model uses AraBERT Arabic Embeddings.
Predicted Entities
PATIENT
, HOSPITAL
, DATE
, ORGANIZATION
, CITY
, STREET
, USERNAME
, SEX
, IDNUM
, EMAIL
, ZIP
, MEDICALRECORD
, PROFESSION
, PHONE
, COUNTRY
, DOCTOR
, AGE
Live Demo Open in Colab Copy S3 URI
How to use
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_arabert","ar") \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings")
clinical_ner = MedicalNerModel.pretrained("ner_deid_subentity_arabert", "ar", "clinical/models")\
.setInputCols(["sentence","token","embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverterInternal()\
.setInputCols(["sentence","token","ner"])\
.setOutputCol("ner_chunk")
nlpPipeline = Pipeline(
stages=[
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
clinical_ner,
ner_converter
])
text = '''
عالج الدكتور محمد المريض أحمد البالغ من العمر 55 سنة في 15/05/2000 في مستشفى مدينة الرباط. رقم هاتفه هو 0610948234 وبريده الإلكتروني
abcd@gmail.com.
'''
data = spark.createDataFrame([[text]]).toDF("text")
results = nlpPipeline.fit(data).transform(data)
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetector
.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_base_arabert", "ar")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
.setCaseSensitive(true)
val clinicalNer = MedicalNerModel.pretrained("ner_deid_subentity_arabert", "ar", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val nerConverter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val nlpPipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
embeddings,
clinicalNer,
nerConverter
))
val text = '''
عالج الدكتور محمد المريض أحمد البالغ من العمر 55 سنة في 15/05/2000 في مستشفى مدينة الرباط. رقم هاتفه هو 0610948234 وبريده الإلكتروني
abcd@gmail.com.
'''
val data: DataFrame = spark.createDataFrame(Seq((text,))).toDF("text")
val results = nlpPipeline.fit(data).transform(data)
Results
+----------------------------+------------------------+
|chunk |ner_label|
+----------------------------+-----------------------+
|الدكتور محمد المريض |DOCTOR |
|55 سنة |AGE |
|15/05/2000 |DATE |
|مستشفى مدينة الرباط |HOSPITAL |
|abcd@gmail.com |EMAIL |
+----------------------------+------------------------+
Model Information
Model Name: | ner_deid_subentity_arabert |
Compatibility: | Healthcare NLP 5.1.0+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence, token, embeddings] |
Output Labels: | [ner] |
Language: | ar |
Size: | 16.3 MB |
Benchmarking
label precision recall f1
AGE 97.86 97.24 97.56
CITY 90.68 85.26 87.89
COUNTRY 87.58 85.45 86.50
DATE 97.75 97.75 97.75
DOCTOR 88.04 91.04 89.52
EMAIL 100.0 99.11 99.55
HOSPITAL 79.56 82.11 80.81
IDNUM 92.86 100.0 96.30
MEDICALRECORD 83.33 88.24 85.71
ORGANIZATION 63.41 57.78 60.47
PATIENT 90.0 68.61 77.86
PHONE 90.24 92.5 91.36
PROFESSION 85.90 80.0 82.84
SEX 97.76 81.6 88.95
STREET 96.88 100.0 98.41
USERNAME 100.0 100.0 100.0
ZIP 91.95 100.0 95.81
Macro - - 91.78
Micro - - 94.19