Description
The Named Entity Recognition (NER) annotator works at the document level, allowing it to identify and annotate entities throughout an entire document. It leverages a deep learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired by the former state-of-the-art model for NER developed by Chiu & Nichols: “Named Entity Recognition with Bidirectional LSTM-CNN”. This NER model is particularly useful for detecting protected health information (PHI) that may need to be de-identified. It can recognize and annotate specific entities: MEDICALRECORD
, ORGANIZATION
, PROFESSION
, HEALTHPLAN
, DOCTOR
, USERNAME
, URL
, DEVICE
, CITY
, DATE
, ZIP
, PATIENT
, COUNTRY
, AGE
, FAX
, HOSPITAL
, BIOID
, IDNUM
, STREET
, EMAIL
.
Predicted Entities
MEDICALRECORD
, ORGANIZATION
, PROFESSION
, HEALTHPLAN
, DOCTOR
, USERNAME
, URL
, DEVICE
, CITY
, DATE
, ZIP
, PATIENT
, COUNTRY
, AGE
, FAX
, HOSPITAL
, BIOID
, IDNUM
, STREET
, EMAIL
How to use
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = Tokenizer()\
.setInputCols(["document"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical") \
.setInputCols(["document", "token"])\
.setOutputCol("embeddings")
ner_deid = MedicalNerModel.pretrained("ner_deid_subentity_docwise", "en", "clinical/models") \
.setInputCols(["document", "token", "embeddings"]) \
.setOutputCol("ner_deid_subentity_docwise")
ner_deid_converter = NerConverterInternal()\
.setInputCols(["document", "token", "ner_deid_subentity_docwise"])\
.setOutputCol("ner_chunk_subentity_docwise")
nlpPipeline = Pipeline(stages=[
documentAssembler,
tokenizer,
word_embeddings,
ner_deid,
ner_deid_converter,
])
text = '''Emily Davis, a 34-year-old woman, Dr. Michael Johnson cares wit her, at CarePlus Clinic, located at 456 Elm Street, NewYork, NY has recommended starting insulin therapy. She has an appointment scheduled for March 15, 2024.'''
data = spark.createDataFrame([[text]]).toDF("text")
result = nlpPipeline.fit(data).transform(data)
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val wordEmbeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val nerDeid = MedicalNerModel.pretrained("ner_deid_subentity_docwise", "en", "clinical/models")
.setInputCols(Array("document", "token", "embeddings"))
.setOutputCol("ner_deid_subentity_docwise")
val nerDeidConverter = new NerConverterInternal()
.setInputCols(Array("document", "token", "ner_deid_subentity_docwise"))
.setOutputCol("ner_chunk_subentity_docwise")
val nlpPipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
wordEmbeddings,
nerDeid,
nerDeidConverter
))
val text = Seq("Emily Davis, a 34-year-old woman, Dr. Michael Johnson cares with her, at CarePlus Clinic, located at 456 Elm Street, New York, NY has recommended starting insulin therapy. She has an appointment scheduled for March 15, 2024.").toDF("text")
val data = Seq((text)).toDF("text")
val result = nlpPipeline.fit(data).transform(data)
Results
+---------------+-----+---+---------+
|chunk |begin|end|ner_label|
+---------------+-----+---+---------+
|Emily Davis |0 |10 |PATIENT |
|34-year-old |15 |25 |AGE |
|Michael Johnson|38 |52 |DOCTOR |
|CarePlus Clinic|72 |86 |HOSPITAL |
|456 Elm Street |100 |113|STREET |
|NewYork |116 |122|CITY |
|NY |125 |126|STATE |
|March 15, 2024 |207 |220|DATE |
+---------------+-----+---+---------+
Model Information
Model Name: | ner_deid_subentity_docwise |
Compatibility: | Healthcare NLP 5.4.1+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence, token, embeddings] |
Output Labels: | [ner] |
Language: | en |
Size: | 9.3 MB |
Benchmarking
label tp fp fn total precision recall f1
PATIENT 1216.0 60.0 64.0 1280.0 0.9530 0.9500 0.9515
HOSPITAL 1030.0 68.0 125.0 1155.0 0.9381 0.8918 0.9143
DATE 3859.0 63.0 89.0 3948.0 0.9839 0.9775 0.9807
ORGANIZATION 60.0 41.0 27.0 87.0 0.5941 0.6897 0.6383
CITY 226.0 20.0 30.0 256.0 0.9187 0.8828 0.9004
STREET 305.0 6.0 2.0 307.0 0.9807 0.9935 0.9871
USERNAME 57.0 10.0 11.0 68.0 0.8507 0.8382 0.8444
DEVICE 5.0 0.0 3.0 8.0 1.0000 0.6250 0.7692
FAX 0.0 0.0 4.0 4.0 0.0000 0.0000 0.0000
IDNUM 128.0 12.0 35.0 163.0 0.9143 0.7853 0.8449
STATE 156.0 19.0 16.0 172.0 0.8914 0.9070 0.8991
EMAIL 0.0 0.0 1.0 1.0 0.0000 0.0000 0.0000
ZIP 102.0 1.0 0.0 102.0 0.9903 1.0000 0.9951
MEDICALRECORD 310.0 6.0 16.0 326.0 0.9810 0.9509 0.9657
OTHER 4.0 9.0 0.0 4.0 0.3077 1.0000 0.4706
PROFESSION 234.0 20.0 64.0 298.0 0.9213 0.7852 0.8478
PHONE 253.0 10.0 7.0 260.0 0.9620 0.9731 0.9675
COUNTRY 35.0 3.0 18.0 53.0 0.9211 0.6604 0.7692
DOCTOR 2566.0 25.0 211.0 2777.0 0.9904 0.9240 0.9560
AGE 498.0 23.0 26.0 524.0 0.9559 0.9504 0.9531
MACRO - - - - - - 0.7826
MICRO - - - - - - 0.9504