Detect Problems, Tests and Treatments (ner_clinical) in Finnish

Description

Pretrained named entity recognition deep learning model for clinical terms in Finnish. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN.

Predicted Entities

PROBLEM, TEST, TREATMENT

Live Demo Open in Colab Copy S3 URI

How to use

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","fi") \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

ner_model = MedicalNerModel.pretrained("ner_clinical", "fi", "clinical/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[
    document_assembler, 
    sentence_detector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter   
    ])

sample_text = """Potilaan virtsanalyysi 19.10.97 osoitti kohtalaisen okkulttista verta , pH 5 , albumiini 1+, valkosoluja läsnä ,2-50 punasoluja , 10-20 valkosoluja , joitain bakteereja ja kohtalaisesti virtsarakon epiteelisoluja. FBS alle 200 = 0 yksikköä CZI FBS 201-250 = 2 yksikköä CZI FBS 251-300 = 4 yksikköä CZI FBS 301-350 = 6 yksikköä CZI FBS351-400 = 8 yksikköä CZI FBS suurempi kuin 400 = 10 yksikköä CZI."""

data = spark.createDataFrame([[sample_text]]).toDF("text")

result = pipeline.fit(data).transform(data)
val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val embeddings = WordEmbeddingsModel.pretrained("w2v_cc_300d","fi")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val ner_model = MedicalNerModel.pretrained("ner_clinical", "fi", "clinical/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")

val ner_converter = new NerConverterInternal()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")

val pipeline = new Pipeline().setStages(Array(
    document_assembler, 
    sentence_detector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter   
))

sample_data = Seq("""Potilaan virtsanalyysi 19.10.97 osoitti kohtalaisen okkulttista verta , pH 5 , albumiini 1+, valkosoluja läsnä ,2-50 punasoluja , 10-20 valkosoluja , joitain bakteereja ja kohtalaisesti virtsarakon epiteelisoluja. FBS alle 200 = 0 yksikköä CZI FBS 201-250 = 2 yksikköä CZI FBS 251-300 = 4 yksikköä CZI FBS 301-350 = 6 yksikköä CZI FBS351-400 = 8 yksikköä CZI FBS suurempi kuin 400 = 10 yksikköä CZI.""").toDS.toDF("text")


val result = pipeline.fit(sample_data).transform(sample_data)

Results

+----------------------------------------+-----+---+---------+
|chunk                                   |begin|end|ner_label|
+----------------------------------------+-----+---+---------+
|Potilaan virtsanalyysi                  |0    |21 |TEST     |
|kohtalaisen okkulttista verta           |40   |68 |PROBLEM  |
|pH                                      |72   |73 |TEST     |
|albumiini                               |79   |87 |TEST     |
|valkosoluja                             |93   |103|TEST     |
|punasoluja                              |117  |126|TEST     |
|valkosoluja                             |136  |146|TEST     |
|joitain bakteereja                      |150  |167|PROBLEM  |
|kohtalaisesti virtsarakon epiteelisoluja|172  |211|PROBLEM  |
|FBS                                     |214  |216|TEST     |
|CZI                                     |240  |242|TREATMENT|
|FBS                                     |244  |246|TEST     |
|CZI                                     |269  |271|TREATMENT|
|FBS                                     |273  |275|TEST     |
|CZI                                     |298  |300|TREATMENT|
|FBS                                     |302  |304|TEST     |
|CZI                                     |327  |329|TREATMENT|
|FBS351-400                              |331  |340|TEST     |
|CZI                                     |355  |357|TREATMENT|
|FBS                                     |359  |361|TEST     |
|CZI                                     |395  |397|TREATMENT|
+----------------------------------------+-----+---+---------+

Model Information

Model Name: ner_clinical
Compatibility: Healthcare NLP 5.1.1+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: fi
Size: 2.9 MB

Benchmarking

       label  precision    recall  f1-score   support
   TREATMENT       0.67      0.74      0.70       275
        TEST       0.93      0.73      0.82       350
     PROBLEM       0.73      0.76      0.75       678
   micro-avg       0.76      0.75      0.75      1303
   macro-avg       0.78      0.74      0.76      1303
weighted-avg       0.77      0.75      0.76      1303