Description
The Named Entity Recognition (NER) annotator works at the document level, allowing it to identify and annotate entities throughout an entire document. It leverages a deep learning architecture (Char CNNs - BiLSTM - CRF - word embeddings) inspired by the former state-of-the-art model for NER developed by Chiu & Nichols: “Named Entity Recognition with Bidirectional LSTM-CNN”. This NER model is particularly useful for detecting protected health information (PHI) that may need to be de-identified. It can recognize and annotate 7 specific entities: DATE
, NAME
, LOCATION
, PROFESSION
, CONTACT
, AGE
, and ID
.
Predicted Entities
DATE
, NAME
, LOCATION
, PROFESSION
, CONTACT
, AGE
, ID
How to use
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = Tokenizer()\
.setInputCols(["document"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
.setInputCols(["document", "token"])\
.setOutputCol("embeddings")
ner_deid_generic = MedicalNerModel.pretrained("ner_deid_generic_docwise", "en", "clinical/models") \
.setInputCols(["document", "token", "embeddings"]) \
.setOutputCol("ner_deid_generic_docwise")
ner_deid_generic_converter = NerConverterInternal()\
.setInputCols(["document", "token", "ner_deid_generic_docwise"])\
.setOutputCol("ner_chunk_generic_docwise")
nlpPipeline = Pipeline(stages=[
documentAssembler,
tokenizer,
word_embeddings,
ner_deid_generic,
ner_deid_generic_converter,
])
text = '''Dr. John Taylor, ID 982345, a cardiologist at St. Mary's Hospital in Boston, was contacted on 05/10/2023 regarding a 45-year-old male patient.'''
data = spark.createDataFrame([[text]]).toDF("text")
result = nlpPipeline.fit(data).transform(data)
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")
val wordEmbeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("document", "token"))
.setOutputCol("embeddings")
val nerDeidGeneric = MedicalNerModel.pretrained("ner_deid_generic_docwise", "en", "clinical/models")
.setInputCols(Array("document", "token", "embeddings"))
.setOutputCol("ner_deid_generic_docwise")
val nerDeidGenericConverter = new NerConverterInternal()
.setInputCols(Array("document", "token", "ner_deid_generic_docwise"))
.setOutputCol("ner_chunk_generic_docwise")
val nlpPipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
wordEmbeddings,
nerDeidGeneric,
nerDeidGenericConverter
))
val text = Seq("""Dr. John Taylor, ID 982345, a cardiologist at St. Mary's Hospital in Boston, was contacted on 05/10/2023 regarding a 45-year-old male patient.""").toDF("text")
val data = Seq((text)).toDF("text")
val result = nlpPipeline.fit(data).transform(data)
Results
+-------------------+-----+---+----------+
|chunk |begin|end|ner_label |
+-------------------+-----+---+----------+
|John Taylor |5 |15 |NAME |
|982345 |21 |26 |CONTACT |
|cardiologist |31 |42 |PROFESSION|
|St. Mary's Hospital|47 |65 |LOCATION |
|Boston |70 |75 |LOCATION |
|05/10/2023 |95 |104|DATE |
|45-year-old |118 |128|AGE |
+-------------------+-----+---+----------+
Model Information
Model Name: | ner_deid_generic_docwise |
Compatibility: | Healthcare NLP 5.4.1+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence, token, embeddings] |
Output Labels: | [ner] |
Language: | en |
Size: | 9.3 MB |
Benchmarking
label tp fp fn total precision recall f1
CONTACT 254.0 7.0 13.0 267.0 0.9732 0.9513 0.9621
NAME 3616.0 92.0 200.0 3816.0 0.9752 0.9476 0.9612
DATE 3844.0 55.0 93.0 3937.0 0.9859 0.9764 0.9811
ID 521.0 67.0 328.0 849.0 0.8861 0.6137 0.7251
LOCATION 1918.0 126.0 224.0 2142.0 0.9384 0.8954 0.9164
PROFESSION 272.0 21.0 241.0 513.0 0.9283 0.5302 0.6749
AGE 470.0 25.0 29.0 499.0 0.9495 0.9419 0.9457
MACRO - - - - - - 0.8809
MICRO - - - - - - 0.9302