Description
Named Entity Recognition annotator allows for a generic model to be trained by utilizing a deep learning algorithm (Char CNNs - BiLSTM - CRF - word embeddings) inspired on a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM, CNN. Deidentification NER is a Named Entity Recognition model that annotates German text to find protected health information (PHI) that may need to be deidentified. It was trained with in-house annotations and detects 12 entities.
Predicted Entities
PATIENT
, HOSPITAL
, DATE
, ORGANIZATION
, CITY
, STREET
, USERNAME
, PROFESSION
, PHONE
, COUNTRY
, DOCTOR
, AGE
Live Demo Open in Colab Copy S3 URI
How to use
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentence_detector = SentenceDetector()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "de", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("embeddings")
deid_ner = MedicalNerModel.pretrained("ner_deid_subentity_langtest", "de", "clinical/models")\
.setInputCols(["sentence", "token", "embeddings"])\
.setOutputCol("ner")
ner_converter = NerConverterInternal()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_deid_subentity_chunk")
nlpPipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
deid_ner,
ner_converter])
data = spark.createDataFrame([["""Am Morgen des 12. Dezember 2018 wird der Patient Michael Berger ins St. Elisabeth-Krankenhaus, einer Klinik in der Stadt Bad Kissingen, eingeliefer. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen."""]]).toDF("text")
result = nlpPipeline.fit(data).transform(data)
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = new SentenceDetector()
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d", "de", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("embeddings")
val deid_ner = MedicalNerModel.pretrained("ner_deid_subentity_langtest", "de", "clinical/models")
.setInputCols(Array("sentence", "token", "embeddings"))
.setOutputCol("ner")
val ner_converter = new NerConverterInternal()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_deid_subentity_chunk")
val nlpPipeline = new Pipeline().setStages(Array(
document_assembler,
sentence_detector,
tokenizer,
word_embeddings,
deid_ner,
ner_converter))
val data = Seq("""Am Morgen des 12. Dezember 2018 wird der Patient Michael Berger ins St. Elisabeth-Krankenhaus, einer Klinik in der Stadt Bad Kissingen, eingeliefer. Herr Berger ist 76 Jahre alt und hat zu viel Wasser in den Beinen.""").toDS.toDF("text")
val result = nlpPipeline.fit(data).transform(data)
Results
+---------------------+-----+---+---------+
|chunk |begin|end|ner_label|
+---------------------+-----+---+---------+
|Dezember 2018 |18 |30 |DATE |
|Michael Berger |49 |62 |PATIENT |
|Elisabeth-Krankenhaus|72 |92 |HOSPITAL |
|Bad Kissingen |121 |133|CITY |
|Berger |154 |159|PATIENT |
|76 |165 |166|AGE |
+---------------------+-----+---+---------+
Model Information
Model Name: | ner_deid_subentity_langtest |
Compatibility: | Healthcare NLP 5.5.0+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence, token, embeddings] |
Output Labels: | [ner] |
Language: | de |
Size: | 3.0 MB |
Benchmarking
label precision recall f1-score support
AGE 0.96 0.97 0.96 476
CITY 0.87 0.78 0.82 200
COUNTRY 0.96 0.47 0.63 312
DATE 1.00 1.00 1.00 4049
DOCTOR 0.96 0.95 0.95 1453
HOSPITAL 0.95 0.94 0.94 1598
MEDICALRECORD 0.88 0.92 0.90 217
ORGANIZATION 0.82 0.73 0.77 1355
PATIENT 0.90 0.91 0.90 2154
PHONE 0.80 0.81 0.81 80
PROFESSION 0.98 0.70 0.82 262
STREET 0.85 0.96 0.90 128
USERNAME 0.91 0.93 0.92 45
micro-avg 0.94 0.91 0.93 12329
macro-avg 0.91 0.85 0.87 12329
weighted-avg 0.94 0.91 0.92 12329