Description
The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps.
This model is trained with the BertForTokenClassification method from the transformers library and imported into Spark NLP. The model detects chemical entities from a medical text.
Predicted Entities
CHEM
How to use
document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")\
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")
tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")
ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_bc4chemd_chemicals", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("ner")\
    .setCaseSensitive(True)\
    .setMaxSentenceLength(512)
ner_converter = NerConverter()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[
    document_assembler, 
    sentence_detector,
    tokenizer,
    ner_model,
    ner_converter   
    ])
data = spark.createDataFrame([["""The main isolated compounds were triterpenes (alpha - amyrin, beta - amyrin, lupeol, betulin, betulinic acid, uvaol, erythrodiol and oleanolic acid) and phenolic acid derivatives from 4 - hydroxybenzoic acid (gallic and protocatechuic acids and isocorilagin)."""]]).toDF("text")
result = pipeline.fit(data).transform(data)
val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
    .setInputCols(Array("document"))
    .setOutputCol("sentence")
val tokenizer = new Tokenizer()
    .setInputCols(Array("sentence"))
    .setOutputCol("token")
val ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_bc4chemd_chemicals", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("ner")
    .setCaseSensitive(True)
    .setMaxSentenceLength(512)
val ner_converter = new NerConverter()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler, 
                                                   sentence_detector,
                                                   tokenizer,
                                                   ner_model,
                                                   ner_converter))
val data = Seq("""The main isolated compounds were triterpenes (alpha - amyrin, beta - amyrin, lupeol, betulin, betulinic acid, uvaol, erythrodiol and oleanolic acid) and phenolic acid derivatives from 4 - hydroxybenzoic acid (gallic and protocatechuic acids and isocorilagin).""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
import nlu
nlu.load("en.classify.token_bert.bc4chemd_chemicals").predict("""The main isolated compounds were triterpenes (alpha - amyrin, beta - amyrin, lupeol, betulin, betulinic acid, uvaol, erythrodiol and oleanolic acid) and phenolic acid derivatives from 4 - hydroxybenzoic acid (gallic and protocatechuic acids and isocorilagin).""")
Results
+-------------------------------+-----+
|ner_chunk                      |label|
+-------------------------------+-----+
|triterpenes                    |CHEM |
|alpha - amyrin                 |CHEM |
|beta - amyrin                  |CHEM |
|lupeol                         |CHEM |
|betulin                        |CHEM |
|betulinic acid                 |CHEM |
|uvaol                          |CHEM |
|erythrodiol                    |CHEM |
|oleanolic acid                 |CHEM |
|phenolic acid                  |CHEM |
|4 - hydroxybenzoic acid        |CHEM |
|gallic and protocatechuic acids|CHEM |
|isocorilagin                   |CHEM |
+-------------------------------+-----+
Model Information
| Model Name: | bert_token_classifier_ner_bc4chemd_chemicals | 
| Compatibility: | Healthcare NLP 4.0.0+ | 
| License: | Licensed | 
| Edition: | Official | 
| Input Labels: | [sentence, token] | 
| Output Labels: | [ner] | 
| Language: | en | 
| Size: | 404.2 MB | 
| Case sensitive: | true | 
| Max sentence length: | 512 | 
References
https://github.com/cambridgeltl/MTL-Bioinformatics-2016
Benchmarking
 label         precision  recall  f1-score  support 
 B-CHEM        0.7642     0.9536  0.8485    25346   
 I-CHEM        0.9446     0.9502  0.9474    29642   
 micro-avg     0.8517     0.9518  0.8990    54988   
 macro-avg     0.8544     0.9519  0.8979    54988   
 weighted-avg  0.8614     0.9518  0.9018    54988