Detect Chemicals in Medical Text

Description

The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps.

This model is trained with the BertForTokenClassification method from the transformers library and imported into Spark NLP. The model detects chemical entities from a medical text.

Predicted Entities

CHEM

Open in Colab Copy S3 URI

How to use

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")\

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_bc4chemd_chemicals", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("ner")\
    .setCaseSensitive(True)\
    .setMaxSentenceLength(512)

ner_converter = NerConverter()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[
    document_assembler, 
    sentence_detector,
    tokenizer,
    ner_model,
    ner_converter   
    ])

data = spark.createDataFrame([["""The main isolated compounds were triterpenes (alpha - amyrin, beta - amyrin, lupeol, betulin, betulinic acid, uvaol, erythrodiol and oleanolic acid) and phenolic acid derivatives from 4 - hydroxybenzoic acid (gallic and protocatechuic acids and isocorilagin)."""]]).toDF("text")

result = pipeline.fit(data).transform(data)
val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
    .setInputCols(Array("document"))
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols(Array("sentence"))
    .setOutputCol("token")

val ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_bc4chemd_chemicals", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("ner")
    .setCaseSensitive(True)
    .setMaxSentenceLength(512)

val ner_converter = new NerConverter()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")

val pipeline = new Pipeline().setStages(Array(document_assembler, 
                                                   sentence_detector,
                                                   tokenizer,
                                                   ner_model,
                                                   ner_converter))

val data = Seq("""The main isolated compounds were triterpenes (alpha - amyrin, beta - amyrin, lupeol, betulin, betulinic acid, uvaol, erythrodiol and oleanolic acid) and phenolic acid derivatives from 4 - hydroxybenzoic acid (gallic and protocatechuic acids and isocorilagin).""").toDS.toDF("text")

val result = pipeline.fit(data).transform(data)
import nlu
nlu.load("en.classify.token_bert.bc4chemd_chemicals").predict("""The main isolated compounds were triterpenes (alpha - amyrin, beta - amyrin, lupeol, betulin, betulinic acid, uvaol, erythrodiol and oleanolic acid) and phenolic acid derivatives from 4 - hydroxybenzoic acid (gallic and protocatechuic acids and isocorilagin).""")

Results

+-------------------------------+-----+
|ner_chunk                      |label|
+-------------------------------+-----+
|triterpenes                    |CHEM |
|alpha - amyrin                 |CHEM |
|beta - amyrin                  |CHEM |
|lupeol                         |CHEM |
|betulin                        |CHEM |
|betulinic acid                 |CHEM |
|uvaol                          |CHEM |
|erythrodiol                    |CHEM |
|oleanolic acid                 |CHEM |
|phenolic acid                  |CHEM |
|4 - hydroxybenzoic acid        |CHEM |
|gallic and protocatechuic acids|CHEM |
|isocorilagin                   |CHEM |
+-------------------------------+-----+

Model Information

Model Name: bert_token_classifier_ner_bc4chemd_chemicals
Compatibility: Healthcare NLP 4.0.0+
License: Licensed
Edition: Official
Input Labels: [sentence, token]
Output Labels: [ner]
Language: en
Size: 404.2 MB
Case sensitive: true
Max sentence length: 512

References

https://github.com/cambridgeltl/MTL-Bioinformatics-2016

Benchmarking

 label         precision  recall  f1-score  support 
 B-CHEM        0.7642     0.9536  0.8485    25346   
 I-CHEM        0.9446     0.9502  0.9474    29642   
 micro-avg     0.8517     0.9518  0.8990    54988   
 macro-avg     0.8544     0.9519  0.8979    54988   
 weighted-avg  0.8614     0.9518  0.9018    54988