Description
Detect chemical compounds and genes in the medical text using the pretrained NER model. This model is trained with the BertForTokenClassification
method from the transformers
library and imported into Spark NLP.
Predicted Entities
CHEMICAL
, GENE-Y
, GENE-N
How to use
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = Tokenizer()\
.setInputCols(["document"])\
.setOutputCol("token")
tokenClassifier = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_chemprot", "en", "clinical/models")\
.setInputCols("token", "document")\
.setOutputCol("ner")\
.setCaseSensitive(True)
ner_converter = NerConverter()\
.setInputCols(["document","token","ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier, ner_converter])
data = spark.createDataFrame([["""Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium."""
]]).toDF("text")
result = pipeline.fit(data).transform(data)
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_chemprot", "en", "clinical/models")
.setInputCols(Array("document","token"))
.setOutputCol("ner")
.setCaseSensitive(True)
val ner_converter = new NerConverter()
.setInputCols(Array("document","token","ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier, ner_converter))
val data = Seq("""Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
import nlu
nlu.load("en.med_ner.chemprot.bert").predict("""Keratinocyte growth factor and acidic fibroblast growth factor are mitogens for primary cultures of mammary epithelium.""")
Results
+-------------------------------+---------+
|chunk |ner_label|
+-------------------------------+---------+
|Keratinocyte growth factor |GENE-Y |
|acidic fibroblast growth factor|GENE-Y |
+-------------------------------+---------+
Model Information
Model Name: | bert_token_classifier_ner_chemprot |
Compatibility: | Healthcare NLP 3.3.4+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence, token] |
Output Labels: | [ner] |
Language: | en |
Size: | 404.3 MB |
Case sensitive: | true |
Max sentense length: | 512 |
Data Source
This model is trained on a ChemProt corpus.
Benchmarking
label precision recall f1-score support
B-CHEMICAL 0.93 0.79 0.85 8649
B-GENE-N 0.63 0.56 0.59 2752
B-GENE-Y 0.82 0.73 0.77 5490
I-CHEMICAL 0.90 0.79 0.84 1313
I-GENE-N 0.72 0.62 0.67 1993
I-GENE-Y 0.81 0.72 0.77 2420
accuracy - - 0.73 22617
macro-avg 0.75 0.74 0.75 22617
weighted-avg 0.83 0.73 0.78 22617