Detect Problems, Tests and Treatments (ner_clinical) in Japanese

Description

Pretrained named entity recognition deep learning model for clinical terms in Japanese. The SparkNLP deep learning model (MedicalNerModel) is inspired by a former state of the art model for NER: Chiu & Nicols, Named Entity Recognition with Bidirectional LSTM-CNN.

Predicted Entities

PROBLEM, TEST, TREATMENT

Live Demo Open in Colab Copy S3 URI

How to use

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

clinical_embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_large_japanese","ja") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings")

ner_model = MedicalNerModel.pretrained("ner_clinical", "ja", "clinical/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = NerConverterInternal()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[
    document_assembler, 
    sentence_detector,
    tokenizer,
    clinical_embeddings,
    ner_model,
    ner_converter   
    ])

sample_df = spark.createDataFrame([["""中等度肺高血圧 、 PA圧 48/24、 1+僧帽弁逆流 、 重度大動脈弁狭窄 、 LVEDP 19、 駆出率 43%。 クロトリマゾール 、1錠 p.o . q.i.d .;"""]]).toDF("text")

result = pipeline.fit(sample_df).transform(sample_df)
val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val clinical_embeddings = BertEmbeddings.pretrained("bert_embeddings_bert_large_japanese","ja")
    .setInputCols(Array("document", "token"))
    .setOutputCol("embeddings")

val ner_model = MedicalNerModel.pretrained("ner_clinical", "ja", "clinical/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")

val ner_converter = new NerConverterInternal()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")

val pipeline = new Pipeline().setStages(Array(
    document_assembler, 
    sentence_detector,
    tokenizer,
    clinical_embeddings,
    ner_model,
    ner_converter))

val sample_data = Seq("""中等度肺高血圧 、 PA圧 48/24、 1+僧帽弁逆流 、 重度大動脈弁狭窄 、 LVEDP 19、 駆出率 43%。 クロトリマゾール 、1錠 p.o . q.i.d .;""").toDS.toDF("text")

val result = pipeline.fit(sample_data).transform(sample_data)

Results

+----------------+-----+---+---------+
|chunk           |begin|end|ner_label|
+----------------+-----+---+---------+
|中等度肺高血圧  |0    |6  |PROBLEM  |
|PA圧            |10   |12 |TEST     |
|1+僧帽弁逆流    |21   |27 |PROBLEM  |
|重度大動脈弁狭窄|31   |38 |PROBLEM  |
|LVEDP           |42   |46 |TEST     |
|駆出率          |52   |54 |TEST     |
|クロトリマゾール|61   |68 |TREATMENT|
+----------------+-----+---+---------+

Model Information

Model Name: ner_clinical
Compatibility: Healthcare NLP 5.1.0+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: ja
Size: 4.3 MB

Benchmarking

       label  precision    recall  f1-score   support
        TEST       0.90      0.90      0.90       105
     PROBLEM       0.86      0.90      0.89       134
   TREATMENT       0.71      0.61      0.66        36
   micro-avg       0.86      0.86      0.86       275
   macro-avg       0.83      0.80      0.81       275
weighted-avg       0.86      0.86      0.86       275