Detect Symptoms, Treatments and Other Entities in German

Description

This model can be used to detect symptoms, treatments and other entities in medical text in German language.

Predicted Entities

DIAGLAB_PROCEDURE, MEDICAL_SPECIFICATION, MEDICAL_DEVICE, MEASUREMENT, BIOLOGICAL_CHEMISTRY, BODY_FLUID, TIME_INFORMATION, LOCAL_SPECIFICATION, BIOLOGICAL_PARAMETER, PROCESS, MEDICATION, DOSING, DEGREE, MEDICAL_CONDITION, PERSON, TISSUE, STATE_OF_HEALTH, BODY_PART, TREATMENT

Live Demo Open in Colab Copy S3 URI

How to use

Use as part of an nlp pipeline with the following stages: DocumentAssembler, SentenceDetector, Tokenizer, WordEmbeddingsModel, NerDLModel. Add the NerConverter to the end of the pipeline to convert entity tokens into full entity chunks.

...
word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","de","clinical/models")\
   .setInputCols(["document","token"])\
   .setOutputCol("embeddings")
clinical_ner = NerDLModel.pretrained("ner_healthcare", "de", "clinical/models") \
  .setInputCols(["sentence", "token", "embeddings"]) \
  .setOutputCol("ner")
...
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, clinical_ner, clinical_ner_converter])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
annotations = light_pipeline.fullAnnotate("Das Kleinzellige Bronchialkarzinom (Kleinzelliger Lungenkrebs, SCLC) ist ein hochmalignes bronchogenes Karzinom")

...
val word_embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","de","clinical/models")
   .setInputCols(Array("document","token"))
   .setOutputCol("embeddings")
val ner = NerDLModel.pretrained("ner_healthcare", "de", "clinical/models") 
  .setInputCols("sentence", "token", "embeddings") 
  .setOutputCol("ner")
...
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, word_embeddings, ner, clinical_ner_converter))
val data = Seq("Das Kleinzellige Bronchialkarzinom (Kleinzelliger Lungenkrebs, SCLC) ist ein hochmalignes bronchogenes Karzinom").toDF("text")
val result = pipeline.fit(data).transform(data)

import nlu
nlu.load("de.med_ner.healthcare").predict("""Das Kleinzellige Bronchialkarzinom (Kleinzelliger Lungenkrebs, SCLC) ist ein hochmalignes bronchogenes Karzinom""")

Results

+----+-------------------+---------+---------+--------------------------+
|    | chunk             |   begin |   end   | entity                   |
+====+===================+=========+=========+==========================+
|  0 | Kleinzellige      |      4  |    15   | MEDICAL_SPECIFICATION    |
+----+-------------------+---------+---------+--------------------------+
|  1 | Bronchialkarzinom |      17 |   33    | MEDICAL_CONDITION        |
+----+-------------------+---------+---------+--------------------------+
|  2 | Kleinzelliger     |      36 |    48   | MEDICAL_SPECIFICATION    |
+----+-------------------+---------+---------+--------------------------+
|  3 | Lungenkrebs       |      50 |   60    | MEDICAL_CONDITION        |
+----+-------------------+---------+---------+--------------------------+
|  4 | SCLC              |      63 |   66    | MEDICAL_CONDITION        |
+----+-------------------+---------+---------+--------------------------+
|  5 | hochmalignes      |      77 |    88   | MEASUREMENT              |
+----+-------------------+---------+---------+--------------------------+
|  6 | bronchogenes      |      90 |   101   | BODY_PART                |
+----+-------------------+---------+---------+--------------------------+
|  7 | Karzinom          |     103 |   110   | MEDICAL_CONDITION        |
+----+-------------------+---------+---------+--------------------------+

Model Information

Model Name: ner_healthcare
Type: ner
Compatibility: Healthcare NLP 2.6.0 +
Edition: Official
License: Licensed
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: [de]
Case sensitive: false
Dependencies: w2v_cc_300d

Data Source

Trained with augmented version of 2010 i2b2/VA dataset on concepts, assertions, and relations in clinical text with w2v_cc_300d.

Benchmarking

|    | label               |     tp |    fp |   fn | precision|    recall|       f1 |
|---:|--------------------:|-------:|------:|-----:|---------:|---------:|---------:|
|  0 | BIOLOGICAL_PARAMETER|    103 |    52 |   57 | 0.6645   | 0.6438   |  0.654   |
|  1 | BODY_FLUID          |    166 |    16 |   24 | 0.9121   | 0.8737   | 0.8925   |
|  2 | PERSON              |    475 |    74 |  142 | 0.8652   | 0.7699   | 0.8148   |
|  3 | DOSING              |     38 |    14 |   31 | 0.7308   | 0.5507   | 0.6281   |
|  4 | DIAGLAB_PROCEDURE   |    236 |    58 |   68 | 0.8027   | 0.7763   | 0.7893   |
|  5 | BODY_PART           |    690 |    72 |   79 | 0.9055   | 0.8973   | 0.9014   |
|  6 | MEDICATION          |    391 |   117 |  167 | 0.7697   | 0.7007   | 0.7336   |
|  7 | STATE_OF_HEALTH     |    321 |    41 |   76 | 0.8867   | 0.8086   | 0.8458   |
|  8 | LOCAL_SPECIFICATION |     57 |    19 |   24 |   0.75   | 0.7037   | 0.7261   |
|  9 | MEASUREMENT         |    574 |   260 |  222 | 0.6882   | 0.7211   | 0.7043   |
| 10 | TREATMENT           |    476 |   131 |  135 | 0.7842   | 0.7791   | 0.7816   |
| 11 | MEDICAL_CONDITION   |   1741 |   442 |  271 | 0.7975   | 0.8653   |   0.83   |
| 12 | TIME_INFORMATION    |    651 |   126 |  161 | 0.8378   | 0.8017   | 0.8194   |
| 13 | BIOLOGICAL_CHEMISTRY|    192 |    55 |   60 | 0.7773   | 0.7619   | 0.7695   |