Description
The BioCreative II Gene Mention Recognition (BC2GM) Dataset contains data where participants are asked to identify a gene mentioned in a sentence by giving its start and end characters. The training set consists of a set of sentences and a set of gene mentions (GENE annotations) in the English language for each sentence.
This model is trained with the BertForTokenClassification
method from the transformers
library and imported into Spark NLP. The model detects genes/proteins from a medical text.
Predicted Entities
GENE/PROTEIN
How to use
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")\
sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\
.setInputCols(["document"])\
.setOutputCol("sentence")
tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")
ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_bc2gm_gene", "en", "clinical/models")\
.setInputCols(["sentence", "token"])\
.setOutputCol("ner")\
.setCaseSensitive(True)\
.setMaxSentenceLength(512)
ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")
pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
ner_model,
ner_converter
])
data = spark.createDataFrame([["""ROCK-I, Kinectin, and mDia2 can bind the wild type forms of both RhoA and Cdc42 in a GTP-dependent manner in vitro. These results support the hypothesis that in the presence of tryptophan the ribosome translating tnaC blocks Rho ' s access to the boxA and rut sites, thereby preventing transcription termination."""]]).toDF("text")
result = pipeline.fit(data).transform(data)
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
.setInputCols(Array("document"))
.setOutputCol("sentence")
val tokenizer = new Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")
val ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_bc2gm_gene", "en", "clinical/models")
.setInputCols(Array("sentence", "token"))
.setOutputCol("ner")
.setCaseSensitive(True)
.setMaxSentenceLength(512)
val ner_converter = new NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")
val pipeline = new Pipeline().setStages(Array(document_assembler,
sentence_detector,
tokenizer,
ner_model,
ner_converter))
val data = Seq("""ROCK-I, Kinectin, and mDia2 can bind the wild type forms of both RhoA and Cdc42 in a GTP-dependent manner in vitro. These results support the hypothesis that in the presence of tryptophan the ribosome translating tnaC blocks Rho ' s access to the boxA and rut sites, thereby preventing transcription termination.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
import nlu
nlu.load("en.classify.token_bert.bc2gm_gene").predict("""ROCK-I, Kinectin, and mDia2 can bind the wild type forms of both RhoA and Cdc42 in a GTP-dependent manner in vitro. These results support the hypothesis that in the presence of tryptophan the ribosome translating tnaC blocks Rho ' s access to the boxA and rut sites, thereby preventing transcription termination.""")
Results
+---------+------------+
|ner_chunk|label |
+---------+------------+
|ROCK-I |GENE/PROTEIN|
|Kinectin |GENE/PROTEIN|
|mDia2 |GENE/PROTEIN|
|RhoA |GENE/PROTEIN|
|Cdc42 |GENE/PROTEIN|
|tnaC |GENE/PROTEIN|
|Rho |GENE/PROTEIN|
|boxA |GENE/PROTEIN|
|rut sites|GENE/PROTEIN|
+---------+------------+
Model Information
Model Name: | bert_token_classifier_ner_bc2gm_gene |
Compatibility: | Healthcare NLP 4.0.0+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence, token] |
Output Labels: | [ner] |
Language: | en |
Size: | 404.2 MB |
Case sensitive: | true |
Max sentence length: | 512 |
References
https://github.com/cambridgeltl/MTL-Bioinformatics-2016
Benchmarking
label precision recall f1-score support
B-GENE/PROTEIN 0.7907 0.9320 0.8556 6325
I-GENE/PROTEIN 0.8350 0.8651 0.8498 8776
micro-avg 0.8151 0.8931 0.8523 15101
macro-avg 0.8129 0.8986 0.8527 15101
weighted-avg 0.8165 0.8931 0.8522 15101