Detect Cellular/Molecular Biology Entities (BertForTokenClassification)

Description

This model detects molecular biology-related terms in medical texts. This model is trained with the BertForTokenClassification method from the transformers library and imported into Spark NLP.

Predicted Entities

DNA, Cell_type, Cell_line, RNA, Protein

Open in Colab Copy S3 URICopied!

How to use

documentAssembler = DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")
         
sentence_detector = SentenceDetector()\
        .setInputCols(["document"])\
        .setOutputCol("sentence")

tokenizer = Tokenizer()\
        .setInputCols(["sentence"])\
        .setOutputCol("token")

tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_cellular", "en", "clinical/models")\
        .setInputCols("token", "document")\
        .setOutputCol("ner")\
        .setCaseSensitive(True)

ner_converter = NerConverter()\
        .setInputCols(["document","token","ner"])\
        .setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[documentAssembler, sentence_detector, tokenizer, tokenClassifier, ner_converter])

p_model = pipeline.fit(spark.createDataFrame(pd.DataFrame({'text': ['']})))

test_sentence = """Detection of various other intracellular signaling proteins is also described. Genetic characterization of transactivation of the human T-cell leukemia virus type 1 promoter: Binding of Tax to Tax-responsive element 1 is mediated by the cyclic AMP-responsive members of the CREB/ATF family of transcription factors. To achieve a better understanding of the mechanism of transactivation by Tax of human T-cell leukemia virus type 1 Tax-responsive element 1 (TRE-1), we developed a genetic approach with Saccharomyces cerevisiae. We constructed a yeast reporter strain containing the lacZ gene under the control of the CYC1 promoter associated with three copies of TRE-1. Expression of either the cyclic AMP response element-binding protein (CREB) or CREB fused to the GAL4 activation domain (GAD) in this strain did not modify the expression of the reporter gene. Tax alone was also inactive."""

result = p_model.transform(spark.createDataFrame(pd.DataFrame({'text': [test_sentence]})))

Results

+-------------------------------------------+---------+
|chunk                                      |ner_label|
+-------------------------------------------+---------+
|intracellular signaling proteins           |protein  |
|human T-cell leukemia virus type 1 promoter|DNA      |
|Tax                                        |protein  |
|Tax-responsive element 1                   |DNA      |
|cyclic AMP-responsive members              |protein  |
|CREB/ATF family                            |protein  |
|transcription factors                      |protein  |
|Tax                                        |protein  |
|human T-cell leukemia virus type 1         |DNA      |
|Tax-responsive element 1                   |DNA      |
|TRE-1                                      |DNA      |
|lacZ gene                                  |DNA      |
|CYC1 promoter                              |DNA      |
|TRE-1                                      |DNA      |
|cyclic AMP response element-binding protein|protein  |
|CREB                                       |protein  |
|CREB                                       |protein  |
|GAL4 activation domain                     |protein  |
|GAD                                        |protein  |
|reporter gene                              |DNA      |
|Tax                                        |protein  |
+-------------------------------------------+---------+

Model Information

Model Name: bert_token_classifier_ner_cellular
Compatibility: Healthcare NLP 3.3.0+
License: Licensed
Edition: Official
Input Labels: [sentence, token]
Output Labels: [ner]
Language: en
Case sensitive: true
Max sentense length: 512

Data Source

Trained on the JNLPBA corpus containing more than 2.404 publication abstracts. https://www.geniaproject.org/

Benchmarking

label           precision    recall   f1-score   support
B-DNA             0.87        0.77      0.82      1056
B-RNA             0.85        0.79      0.82       118
B-cell_line       0.66        0.70      0.68       500
B-cell_type       0.87        0.75      0.81      1921
B-protein         0.90        0.85      0.88      5067
I-DNA             0.93        0.86      0.90      1789
I-RNA             0.92        0.84      0.88       187
I-cell_line       0.67        0.76      0.71       989
I-cell_type       0.92        0.76      0.84      2991
I-protein         0.94        0.80      0.87      4774
accuracy           -           -        0.80     19392
macro-avg         0.76        0.81      0.78     19392
weighted-avg      0.89        0.80      0.85     19392