Detect Cancer Genetics (BertForTokenClassification)

Description

This model extracts biological and genetics terms in cancer-related texts using pre-trained NER model. This model is trained with the BertForTokenClassification method from the transformers library and imported into Spark NLP.

Predicted Entities

Amino_acid, Anatomical_system, Cancer, Cell, Cellular_component, Developing_anatomical_Structure, Gene_or_gene_product, Immaterial_anatomical_entity, Multi-tissue_structure, Organ, Organism, Organism_subdivision, Simple_chemical, Tissue, Organism_substance, Pathological_formation

Copy S3 URI

How to use

documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")

tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")

tokenClassifier = MedicalBertForTokenClassification.pretrained("bert_token_classifier_ner_bionlp", "en", "clinical/models")\
.setInputCols("token", "document")\
.setOutputCol("ner")\
.setCaseSensitive(True)

ner_converter = NerConverter()\
.setInputCols(["document","token","ner"])\
.setOutputCol("ner_chunk") 

pipeline = Pipeline(stages=[documentAssembler, 
tokenizer, 
tokenClassifier, 
ner_converter])

data = spark.createDataFrame([["""Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay."""]]).toDF("text")

result = pipeline.fit(data).transform(data)
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")

val tokenClassifier = MedicalBertForTokenClassification.pretrained("bert_token_classifier_ner_bionlp", "en", "clinical/models")
.setInputCols(Array("document","token"))
.setOutputCol("ner")
.setCaseSensitive(True)

val ner_converter = new NerConverter()
.setInputCols(Array("document","token","ner"))
.setOutputCol("ner_chunk")

val pipeline =  new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier, ner_converter))

val data = Seq("""Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay.""").toDS.toDF("text")

val result = pipeline.fit(data).transform(data)
import nlu
nlu.load("en.classify.token_bert.bionlp").predict("""Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay.""")

Results

+-------------------+----------------------+
|chunk              |ner_label             |
+-------------------+----------------------+
|erbA IRES          |Organism              |
|erbA/myb virus     |Organism              |
|erythroid cells    |Cell                  |
|bone marrow        |Multi-tissue_structure|
|blastoderm cultures|Cell                  |
|erbA/myb IRES virus|Organism              |
|erbA IRES virus    |Organism              |
|blastoderm         |Cell                  |
+-------------------+----------------------+

Model Information

Model Name: bert_token_classifier_ner_bionlp
Compatibility: Healthcare NLP 3.4.0+
License: Licensed
Edition: Official
Input Labels: [sentence, token]
Output Labels: [ner]
Language: en
Size: 404.4 MB
Case sensitive: true
Max sentense length: 256

Data Source

Trained on Cancer Genetics (CG) task of the BioNLP Shared Task 2013. https://aclanthology.org/W13-2008/

Benchmarking

Label  precision    recall  f1-score   support
B-Cancer       0.88      0.82      0.85       924
B-Cell       0.84      0.86      0.85      1013
B-Cellular_component       0.87      0.84      0.86       180
B-Developing_anatomical_structure       0.65      0.65      0.65        17
B-Gene_or_gene_product       0.62      0.79      0.69      2520
B-Immaterial_anatomical_entity       0.68      0.74      0.71        31
B-Multi-tissue_structure       0.84      0.76      0.80       303
B-Organ       0.78      0.74      0.76       156
B-Organism       0.93      0.86      0.89       518
B-Organism_subdivision       0.74      0.51      0.61        39
B-Organism_substance       0.93      0.66      0.77       102
B-Pathological_formation       0.85      0.60      0.71        88
B-Simple_chemical       0.61      0.75      0.68       727
B-Tissue       0.74      0.83      0.78       184
I-Amino_acid       0.60      1.00      0.75         3
I-Cancer       0.91      0.69      0.78       604
I-Cell       0.98      0.74      0.84      1091
I-Cellular_component       0.88      0.62      0.73        69
I-Multi-tissue_structure       0.89      0.86      0.87       162
I-Organ       0.67      0.59      0.62        17
I-Organism       0.84      0.45      0.59       120
I-Organism_substance       0.80      0.50      0.62        24
I-Pathological_formation       0.81      0.56      0.67        39
I-Tissue       0.83      0.86      0.84       111
accuracy       -         -         0.64     12129
macro-avg       0.73      0.56      0.60     12129
weighted-avg       0.83      0.64      0.68     12129