Description
This Named Entity Recognition (NER) annotator is trained using the ai4privacy/pii-masking-400k
dataset. It leverages a deep learning architecture (Char CNNs - BiLSTM - CRF - word embeddings), inspired by the state-of-the-art model from Chiu & Nichols in their work “Named Entity Recognition with Bidirectional LSTM-CNN”. This model is particularly effective in identifying and labeling various entities, making it useful for detecting protected health information (PHI) that may need to be masked or de-identified.
Predicted Entities
LICENSE
, SSN
, ZIP
, NAME
, PHONE
, CITY
, EMAIL
, DATE
, IDNUM
, STREET
, ACCOUNT
How to use
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "en")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
clinical_embeddings = WordEmbeddingsModel.pretrained('embeddings_clinical', "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
ner_model = MedicalNerModel.pretrained('ner_deid_aipii', "en", "clinical/models")\
.setInputCols(["sentence", "token","embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverterInternal()\
.setInputCols(['sentence', 'token', 'ner'])\
.setOutputCol('ner_chunk')
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
clinical_embeddings,
ner_model,
ner_converter
])
sample_texts = ["""
Ora Hendrickson, is 50 years old, Patient's ID no: 3454362A, SSN: 333-44-6666,
Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: ora@gmail.com.
"""]
data = spark.createDataFrame(sample_texts, StringType()).toDF("text")
result = pipeline.fit(data).transform(data)
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl","en","clinical/models")
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val clinical_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val ner_model = MedicalNerModel.pretrained("ner_deid_aipii", "en", "clinical/models")
.setInputCols(Array("sentence", "token","embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(
document_assembler,
sentenceDetector,
tokenizer,
clinical_embeddings,
ner_model,
ner_converter
))
val sample_texts = Seq("""Ora Hendrickson, is 50 years old, Patient's ID no: 3454362A, SSN: 333-44-6666,
Phone (302) 786-5227, 0295 Keats Street, San Francisco, E-MAIL: ora@gmail.com.""").toDF("text")
val result = pipeline.fit(sample_texts).transform(sample_texts)
Results
+-----------------+-----+---+---------+
|chunk |begin|end|ner_label|
+-----------------+-----+---+---------+
|Ora Hendrickson |2 |16 |NAME |
|3454362A |54 |61 |IDNUM |
|333-44-6666 |69 |79 |SSN |
|(302) 786-5227 |88 |101|PHONE |
|0295 Keats Street|104 |120|STREET |
|San Francisco |123 |135|CITY |
|ora@gmail.com |146 |158|EMAIL |
+-----------------+-----+---+---------+
Model Information
Model Name: | ner_deid_aipii |
Compatibility: | Healthcare NLP 5.4.1+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence, token, embeddings] |
Output Labels: | [ner] |
Language: | en |
Size: | 2.9 MB |
Benchmarking
label precision recall f1-score support
ACCOUNT 0.86 0.59 0.70 867
CITY 0.95 0.94 0.94 2735
DATE 0.91 0.66 0.77 1408
EMAIL 1.00 1.00 1.00 1469
IDNUM 0.87 0.87 0.87 2763
LICENSE 0.95 0.93 0.94 691
NAME 0.96 0.97 0.97 6071
PHONE 0.99 0.99 0.99 2182
SSN 0.83 0.90 0.86 914
STREET 0.93 0.91 0.92 2882
ZIP 0.91 0.98 0.94 1271
micro-avg 0.94 0.91 0.92 23253
macro-avg 0.92 0.89 0.90 23253
weighted-avg 0.93 0.91 0.92 23253