Description
Named Entity recognition annotator allows for a generic model to be trained by utilizing a deep learning algorithm (Char CNNs - BiLSTM - CRF - word embeddings) inspired by a former state-of-the-art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM, CNN. Deidentification NER is a Named Entity Recognition model that annotates text to find protected health information that may need to be de-identified. It detects 23 entities. This ner model is trained with a combination of the i2b2 train set and a re-augmented version of the i2b2 train set. It is the version of ner_deid_subentity_augmented model augmented with langtest
library.
test_type | before fail_count | after fail_count | before pass_count | after pass_count | minimum pass_rate | before pass_rate | after pass_rate |
---|---|---|---|---|---|---|---|
add_typo | 306 | 256 | 17377 | 17416 | 95% | 98% | 99% |
lowercase | 910 | 336 | 15226 | 15800 | 95% | 94% | 98% |
swap_entities | 358 | 322 | 3688 | 3734 | 95% | 91% | 92% |
titlecase | 396 | 237 | 17014 | 17173 | 95% | 98% | 99% |
uppercase | 1096 | 500 | 16262 | 16858 | 95% | 94% | 97% |
weighted average | 3066 | 1651 | 69567 | 70981 | 95% | 95.78% | 97.73% |
We stuck to the official annotation guideline (AG) for the 2014 i2b2 Deid challenge while annotating new datasets for this model. All the details regarding the nuances and explanations for AG can be found here https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/
Predicted Entities
MEDICALRECORD
, ORGANIZATION
, DOCTOR
, USERNAME
, PROFESSION
, HEALTHPLAN
, URL
, CITY
, DATE
, LOCATION-OTHER
, STATE
, PATIENT
, DEVICE
, COUNTRY
, ZIP
, PHONE
, HOSPITAL
, EMAIL
, IDNUM
, STREET
, BIOID
, FAX
, AGE
How to use
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
deid_ner = MedicalNerModel.pretrained("ner_deid_subentity_augmented_langtest", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk_subentity")
nlpPipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
deid_ner,
ner_converter])
model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame(pd.DataFrame({"text": ["""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25-year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine."""]})))
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val deid_ner = MedicalNerModel.pretrained("ner_deid_subentity_augmented_langtest", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk_subentity")
val nlpPipeline = new Pipeline().setStages(Array(
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
deid_ner,
ner_converter))
val data = Seq("""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25-year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
Results
+-----------------------------+-------------+
|chunk |ner_label |
+-----------------------------+-------------+
|2093-01-13 |DATE |
|David Hale |DOCTOR |
|Hendrickson, Ora |PATIENT |
|7194334 |MEDICALRECORD|
|01/13/93 |DATE |
|Oliveira |DOCTOR |
|25-year-old |AGE |
|1-11-2000 |DATE |
|Cocke County Baptist Hospital|HOSPITAL |
|0295 Keats Street |STREET |
|(302) 786-5227 |PHONE |
|Brothers Coal-Mine |ORGANIZATION |
+-----------------------------+-------------+
Model Information
Model Name: | ner_deid_subentity_augmented_langtest |
Compatibility: | Healthcare NLP 5.1.0+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence, token, embeddings] |
Output Labels: | [ner] |
Language: | en |
Size: | 14.8 MB |
References
A custom data set which is created from the i2b2-PHI train and the re-augmented version of the i2b2-PHI train set is used.
Benchmarking
label precision recall f1-score support
AGE 0.96 0.93 0.94 377
BIOID 0.00 0.00 0.00 1
CITY 0.89 0.82 0.85 104
COUNTRY 0.88 0.86 0.87 57
DATE 0.98 0.99 0.99 2375
DEVICE 0.83 0.71 0.77 7
DOCTOR 0.96 0.91 0.93 918
FAX 0.00 0.00 0.00 2
HOSPITAL 0.90 0.91 0.90 410
IDNUM 0.83 0.66 0.74 83
LOCATION-OTHER 0.50 0.75 0.60 8
MEDICALRECORD 0.91 0.97 0.94 164
ORGANIZATION 0.78 0.63 0.70 57
PATIENT 0.85 0.92 0.89 408
PHONE 0.93 0.88 0.91 112
PROFESSION 0.84 0.80 0.82 97
STATE 0.94 0.92 0.93 73
STREET 0.87 0.95 0.91 77
USERNAME 0.96 0.90 0.93 52
ZIP 0.96 1.00 0.98 45
micro-avg 0.94 0.94 0.94 5427
macro-avg 0.79 0.78 0.78 5427
weighted-avg 0.94 0.94 0.94 5427