Description
Named Entity recognition annotator allows for a generic model to be trained by utilizing a deep learning algorithm (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM, CNN. Deidentification NER is a Named Entity Recognition model that annotates text to find protected health information that may need to be deidentified. It detects 23 entities. This ner model is trained with a combination of the i2b2 train set and a re-augmented version of i2b2 train set.
We sticked to official annotation guideline (AG) for 2014 i2b2 Deid challenge while annotating new datasets for this model. All the details regarding the nuances and explanations for AG can be found here https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4978170/
Predicted Entities
MEDICALRECORD
, ORGANIZATION
, DOCTOR
, USERNAME
, PROFESSION
, HEALTHPLAN
, URL
, CITY
, DATE
, LOCATION-OTHER
, STATE
, PATIENT
, DEVICE
, COUNTRY
, ZIP
, PHONE
, HOSPITAL
, EMAIL
, IDNUM
, SREET
, BIOID
, FAX
, AGE
Live Demo Open in Colab Copy S3 URI
How to use
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
deid_ner = MedicalNerModel.pretrained("ner_deid_subentity_augmented", "en", "clinical/models") \
.setInputCols(["sentence", "token", "embeddings"]) \
.setOutputCol("ner")
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk_subentity")
nlpPipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
deid_ner,
ner_converter])
model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
results = model.transform(spark.createDataFrame(pd.DataFrame({"text": ["""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25-year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine."""]})))
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val deid_ner = MedicalNerModel.pretrained("ner_deid_subentity_augmented", "en", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk_subentity")
val nlpPipeline = new Pipeline().setStages(Array(
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
deid_ner,
ner_converter))
val data = Seq("""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25-year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
import nlu
nlu.load("en.med_ner.deid.subentity_augmented").predict("""A. Record date : 2093-01-13, David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25-year-old, Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227. Patient's complaints first surfaced when he started working for Brothers Coal-Mine.""")
Results
+-----------------------------+-------------+
|chunk |ner_label |
+-----------------------------+-------------+
|2093-01-13 |DATE |
|David Hale |DOCTOR |
|Hendrickson, Ora |PATIENT |
|7194334 |MEDICALRECORD|
|01/13/93 |DATE |
|Oliveira |DOCTOR |
|25-year-old |AGE |
|1-11-2000 |DATE |
|Cocke County Baptist Hospital|HOSPITAL |
|0295 Keats Street. |STREET |
|(302) 786-5227 |PHONE |
|Brothers Coal-Mine |ORGANIZATION |
+-----------------------------+-------------+
Model Information
Model Name: | ner_deid_subentity_augmented |
Compatibility: | Healthcare NLP 3.2.0+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence, token, embeddings] |
Output Labels: | [ner] |
Language: | en |
Data Source
A custom data set which is created from the i2b2-PHI train and the re-augmented version of the i2b2-PHI train set is used.
Benchmarking
label tp fp fn total precision recall f1
PATIENT 1465.0 159.0 162.0 1627.0 0.9021 0.9004 0.9013
HOSPITAL 1417.0 120.0 167.0 1584.0 0.9219 0.8946 0.908
DATE 5513.0 57.0 129.0 5642.0 0.9898 0.9771 0.9834
ORGANIZATION 101.0 25.0 37.0 138.0 0.8016 0.7319 0.7652
CITY 277.0 47.0 64.0 341.0 0.8549 0.8123 0.8331
STREET 405.0 7.0 10.0 415.0 0.983 0.9759 0.9794
USERNAME 88.0 2.0 13.0 101.0 0.9778 0.8713 0.9215
DEVICE 10.0 0.0 0.0 10.0 1.0 1.0 1.0
IDNUM 168.0 27.0 42.0 210.0 0.8615 0.8 0.8296
STATE 172.0 15.0 33.0 205.0 0.9198 0.839 0.8776
ZIP 137.0 0.0 2.0 139.0 1.0 0.9856 0.9928
MEDICALRECORD 416.0 14.0 28.0 444.0 0.9674 0.9369 0.9519
OTHER 16.0 4.0 5.0 21.0 0.8 0.7619 0.7805
PROFESSION 261.0 22.0 75.0 336.0 0.9223 0.7768 0.8433
PHONE 328.0 21.0 20.0 348.0 0.9398 0.9425 0.9412
COUNTRY 97.0 15.0 31.0 128.0 0.8661 0.7578 0.8083
DOCTOR 3279.0 139.0 268.0 3547.0 0.9593 0.9244 0.9416
AGE 715.0 39.0 47.0 762.0 0.9483 0.9383 0.9433
macro - - - - - - 0.7715
micro - - - - - - 0.9406