Description
The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps.
This model is trained with the BertForTokenClassification
method from the transformers
library and imported into Spark NLP. The model detects chemical entities from a medical text.
Predicted Entities
CHEM
How to use
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")\
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_bc4chemd_chemicals", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("ner")\
.setCaseSensitive(True)\
.setMaxSentenceLength(512)
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
ner_model,
ner_converter
])
data = spark.createDataFrame([["""The main isolated compounds were triterpenes (alpha - amyrin, beta - amyrin, lupeol, betulin, betulinic acid, uvaol, erythrodiol and oleanolic acid) and phenolic acid derivatives from 4 - hydroxybenzoic acid (gallic and protocatechuic acids and isocorilagin)."""]]).toDF("text")
result = pipeline.fit(data).transform(data)
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_bc4chemd_chemicals", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
.setCaseSensitive(True)
.setMaxSentenceLength(512)
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
ner_model,
ner_converter))
val data = Seq("""The main isolated compounds were triterpenes (alpha - amyrin, beta - amyrin, lupeol, betulin, betulinic acid, uvaol, erythrodiol and oleanolic acid) and phenolic acid derivatives from 4 - hydroxybenzoic acid (gallic and protocatechuic acids and isocorilagin).""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
import nlu
nlu.load("en.classify.token_bert.bc4chemd_chemicals").predict("""The main isolated compounds were triterpenes (alpha - amyrin, beta - amyrin, lupeol, betulin, betulinic acid, uvaol, erythrodiol and oleanolic acid) and phenolic acid derivatives from 4 - hydroxybenzoic acid (gallic and protocatechuic acids and isocorilagin).""")
Results
+-------------------------------+-----+
|ner_chunk |label|
+-------------------------------+-----+
|triterpenes |CHEM |
|alpha - amyrin |CHEM |
|beta - amyrin |CHEM |
|lupeol |CHEM |
|betulin |CHEM |
|betulinic acid |CHEM |
|uvaol |CHEM |
|erythrodiol |CHEM |
|oleanolic acid |CHEM |
|phenolic acid |CHEM |
|4 - hydroxybenzoic acid |CHEM |
|gallic and protocatechuic acids|CHEM |
|isocorilagin |CHEM |
+-------------------------------+-----+
Model Information
Model Name: | bert_token_classifier_ner_bc4chemd_chemicals |
Compatibility: | Healthcare NLP 4.0.0+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence, token] |
Output Labels: | [ner] |
Language: | en |
Size: | 404.2 MB |
Case sensitive: | true |
Max sentence length: | 512 |
References
https://github.com/cambridgeltl/MTL-Bioinformatics-2016
Benchmarking
label precision recall f1-score support
B-CHEM 0.7642 0.9536 0.8485 25346
I-CHEM 0.9446 0.9502 0.9474 29642
micro-avg 0.8517 0.9518 0.8990 54988
macro-avg 0.8544 0.9519 0.8979 54988
weighted-avg 0.8614 0.9518 0.9018 54988