Description
This Named Entity Recognition model is intended for detecting pharmacological entities from Spanish medical texts and trained using the BertForTokenClassification method from the transformers library and BERT based embeddings. The model detects PROTEINAS and NORMALIZABLES.
Predicted Entities
PROTEINAS
, NORMALIZABLES
How to use
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
sentenceDetector = SentenceDetectorDLModel.pretrained()\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols("sentence")\
.setOutputCol("token")
tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_pharmacology", "es", "clinical/models")\
.setInputCols("token", "sentence")\
.setOutputCol("label")\
.setCaseSensitive(True)
ner_converter = NerConverter()\
.setInputCols(["sentence","token","label"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[
documentAssembler,
sentenceDetector,
tokenizer,
tokenClassifier,
ner_converter])
data = spark.createDataFrame([["""Se realiza analítica destacando creatinkinasa 736 UI, LDH 545 UI, urea 63 mg/dl, CA 19.9 64,1 U/ml. Inmunofenotípicamente el tumor expresó vimentina, S-100, HMB-45 y actina. Se instauró el tratamiento con quimioterapia (Cisplatino, Interleukina II, Dacarbacina e Interferon alfa)."""]]).toDF("text")
result = pipeline.fit(data).transform(data)
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentenceDetector = SentenceDetectorDLModel.pretrained()
.setInputCols("document")
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols("sentence")
.setOutputCol("token")
val tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_pharmacology", "es", "clinical/models")
.setInputCols(Array("token", "sentence"))
.setOutputCol("label")
.setCaseSensitive(True)
val ner_converter = new NerConverter()
.setInputCols(Array("sentence","token","label"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentenceDetector,
tokenizer,
tokenClassifier,
ner_converter))
val data = Seq(Array("Se realiza analítica destacando creatinkinasa 736 UI, LDH 545 UI, urea 63 mg/dl, CA 19.9 64,1 U/ml. Inmunofenotípicamente el tumor expresó vimentina, S-100, HMB-45 y actina. Se instauró el tratamiento con quimioterapia (Cisplatino, Interleukina II, Dacarbacina e Interferon alfa).")).toDS().toDF("text")
val result = pipeline.fit(data).transform(data)
import nlu
nlu.load("es.classify.bert_token.pharmacology").predict("""Se realiza analítica destacando creatinkinasa 736 UI, LDH 545 UI, urea 63 mg/dl, CA 19.9 64,1 U/ml. Inmunofenotípicamente el tumor expresó vimentina, S-100, HMB-45 y actina. Se instauró el tratamiento con quimioterapia (Cisplatino, Interleukina II, Dacarbacina e Interferon alfa).""")
Results
+---------------+-------------+
|chunk |ner_label |
+---------------+-------------+
|creatinkinasa |PROTEINAS |
|LDH |PROTEINAS |
|urea |NORMALIZABLES|
|CA 19.9 |PROTEINAS |
|vimentina |PROTEINAS |
|S-100 |PROTEINAS |
|HMB-45 |PROTEINAS |
|actina |PROTEINAS |
|Cisplatino |NORMALIZABLES|
|Interleukina II|PROTEINAS |
|Dacarbacina |NORMALIZABLES|
|Interferon alfa|PROTEINAS |
+---------------+-------------+
Model Information
Model Name: | bert_token_classifier_pharmacology |
Compatibility: | Healthcare NLP 4.0.2+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence, token] |
Output Labels: | [ner] |
Language: | es |
Size: | 410.0 MB |
Case sensitive: | true |
Max sentence length: | 512 |
Benchmarking
label precision recall f1-score support
B-NORMALIZABLES 0.9458 0.9694 0.9575 3076
I-NORMALIZABLES 0.8788 0.8969 0.8878 291
B-PROTEINAS 0.9164 0.9369 0.9265 2234
I-PROTEINAS 0.8825 0.7634 0.8186 748
micro-avg 0.9257 0.9304 0.9280 6349
macro-avg 0.9059 0.8917 0.8976 6349
weighted-avg 0.9249 0.9304 0.9270 6349