Detect Problems, Tests and Treatments (ner_clinical) in Norwegian

Description

Pretrained named entity recognition deep learning model for clinical terms in Norwegian. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN.

Predicted Entities

PROBLEM, TEST, TREATMENT

Live Demo Open in Colab Copy S3 URI

How to use

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","no") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = MedicalNerModel.pretrained("ner_clinical", "no", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[
    document_assembler, 
    sentence_detector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter   
    ])

sample_text = """Natrium var 140, kalium 3,7 ,klorid 96, bikarbonat 30, BUN og kreatinin 14/0,9 , glukose105, hematokrit42, hvittblodtall 8,6 , blodplater 644, protrombintid 10,4 , delvis tromboplastintid 28,7 , urinanalyse spor av hvite blodceller, svake skjulte røde blodceller. Natrium 148, kalium 3.4, glukose 174, P02 102, PC02 115, PH 7.11 på 40% 02."""


data = spark.createDataFrame([[sample_text]]).toDF("text")

result = pipeline.fit(data).transform(data)
val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","no")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val ner_model = MedicalNerModel.pretrained("ner_clinical", "no", "clinical/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")

val ner_converter = new NerConverterInternal()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")

val pipeline = new Pipeline().setStages(Array(
    document_assembler, 
    sentence_detector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter   
))

sample_data = Seq("""Natrium var 140, kalium 3,7 ,klorid 96, bikarbonat 30, BUN og kreatinin 14/0,9 , glukose105, hematokrit42, hvittblodtall 8,6 , blodplater 644, protrombintid 10,4 , delvis tromboplastintid 28,7 , urinanalyse spor av hvite blodceller, svake skjulte røde blodceller. Natrium 148, kalium 3.4, glukose 174, P02 102, PC02 115, PH 7.11 på 40% 02.""").toDS.toDF("text")


val result = pipeline.fit(sample_data).transform(sample_data)

Results

+-----------------------------+-----+---+---------+
|chunk                        |begin|end|ner_label|
+-----------------------------+-----+---+---------+
|Natrium                      |0    |6  |TEST     |
|kalium                       |17   |22 |TEST     |
|klorid                       |29   |34 |TEST     |
|bikarbonat                   |40   |49 |TEST     |
|BUN                          |55   |57 |TEST     |
|kreatinin                    |62   |70 |TEST     |
|glukose105                   |81   |90 |TEST     |
|hematokrit42                 |93   |104|TEST     |
|hvittblodtall                |107  |119|TEST     |
|blodplater                   |127  |136|TEST     |
|protrombintid                |143  |155|TEST     |
|delvis tromboplastintid      |164  |186|TEST     |
|urinanalyse                  |195  |205|TEST     |
|spor av hvite blodceller     |207  |230|PROBLEM  |
|svake skjulte røde blodceller|233  |261|PROBLEM  |
|Natrium                      |264  |270|TEST     |
|kalium                       |277  |282|TEST     |
|glukose                      |289  |295|TEST     |
|P02                          |302  |304|TEST     |
|PC02                         |311  |314|TEST     |
|PH                           |321  |322|TEST     |
|40% 02                       |332  |337|TREATMENT|
+-----------------------------+-----+---+---------+

Model Information

Model Name: ner_clinical
Compatibility: Healthcare NLP 5.1.0+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: no
Size: 2.9 MB

Benchmarking

       label  precision    recall  f1-score   support
   TREATMENT       0.69      0.75      0.72       358
        TEST       0.89      0.87      0.88       415
     PROBLEM       0.88      0.74      0.81       749
   micro-avg       0.83      0.78      0.81      1522
   macro-avg       0.82      0.79      0.80      1522
weighted-avg       0.84      0.78      0.81      1522