Extract Pharmacological Entities From Spanish Medical Texts (BertForTokenClassification)

Description

This Named Entity Recognition model is intended for detecting pharmacological entities from Spanish medical texts and trained using the BertForTokenClassification method from the transformers library and BERT based embeddings. The model detects PROTEINAS and NORMALIZABLES.

Predicted Entities

PROTEINAS, NORMALIZABLES

Copy S3 URI

How to use

documentAssembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

sentenceDetector = SentenceDetectorDLModel.pretrained()\
  .setInputCols(["document"])\
  .setOutputCol("sentence")

tokenizer = Tokenizer()\
  .setInputCols("sentence")\
  .setOutputCol("token")

tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_pharmacology", "es", "clinical/models")\
  .setInputCols("token", "sentence")\
  .setOutputCol("label")\
  .setCaseSensitive(True)

ner_converter = NerConverter()\
  .setInputCols(["sentence","token","label"])\
  .setOutputCol("ner_chunk")


pipeline =  Pipeline(stages=[
                      documentAssembler,
                      sentenceDetector,
                      tokenizer,
                      tokenClassifier,
                      ner_converter])

                          
data = spark.createDataFrame([["""Se realiza analítica destacando creatinkinasa 736 UI, LDH 545 UI, urea 63 mg/dl, CA 19.9 64,1 U/ml. Inmunofenotípicamente el tumor expresó vimentina, S-100, HMB-45 y actina. Se instauró el tratamiento con quimioterapia (Cisplatino, Interleukina II, Dacarbacina e Interferon alfa)."""]]).toDF("text")

result = pipeline.fit(data).transform(data)

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = SentenceDetectorDLModel.pretrained()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_pharmacology", "es", "clinical/models")
  .setInputCols(Array("token", "sentence"))
  .setOutputCol("label")
  .setCaseSensitive(True)

val ner_converter = new NerConverter()
  .setInputCols(Array("sentence","token","label"))
  .setOutputCol("ner_chunk")


val pipeline =  new Pipeline().setStages(Array(
                      documentAssembler,
                      sentenceDetector,
                      tokenizer,
                      tokenClassifier,
                      ner_converter))

val data = Seq(Array("Se realiza analítica destacando creatinkinasa 736 UI, LDH 545 UI, urea 63 mg/dl, CA 19.9 64,1 U/ml. Inmunofenotípicamente el tumor expresó vimentina, S-100, HMB-45 y actina. Se instauró el tratamiento con quimioterapia (Cisplatino, Interleukina II, Dacarbacina e Interferon alfa).")).toDS().toDF("text")

val result = pipeline.fit(data).transform(data)
import nlu
nlu.load("es.classify.bert_token.pharmacology").predict("""Se realiza analítica destacando creatinkinasa 736 UI, LDH 545 UI, urea 63 mg/dl, CA 19.9 64,1 U/ml. Inmunofenotípicamente el tumor expresó vimentina, S-100, HMB-45 y actina. Se instauró el tratamiento con quimioterapia (Cisplatino, Interleukina II, Dacarbacina e Interferon alfa).""")

Results

+---------------+-------------+
|chunk          |ner_label    |
+---------------+-------------+
|creatinkinasa  |PROTEINAS    |
|LDH            |PROTEINAS    |
|urea           |NORMALIZABLES|
|CA 19.9        |PROTEINAS    |
|vimentina      |PROTEINAS    |
|S-100          |PROTEINAS    |
|HMB-45         |PROTEINAS    |
|actina         |PROTEINAS    |
|Cisplatino     |NORMALIZABLES|
|Interleukina II|PROTEINAS    |
|Dacarbacina    |NORMALIZABLES|
|Interferon alfa|PROTEINAS    |
+---------------+-------------+

Model Information

Model Name: bert_token_classifier_pharmacology
Compatibility: Healthcare NLP 4.0.2+
License: Licensed
Edition: Official
Input Labels: [sentence, token]
Output Labels: [ner]
Language: es
Size: 410.0 MB
Case sensitive: true
Max sentence length: 512

Benchmarking

          label  precision    recall  f1-score   support
B-NORMALIZABLES     0.9458    0.9694    0.9575      3076
I-NORMALIZABLES     0.8788    0.8969    0.8878       291
    B-PROTEINAS     0.9164    0.9369    0.9265      2234
    I-PROTEINAS     0.8825    0.7634    0.8186       748
      micro-avg     0.9257    0.9304    0.9280      6349
      macro-avg     0.9059    0.8917    0.8976      6349
   weighted-avg     0.9249    0.9304    0.9270      6349