Description
This model extracts biological and genetics terms in cancer-related texts using pre-trained NER model. This model is trained with the BertForTokenClassification method from the transformers library and imported into Spark NLP.
Predicted Entities
Amino_acid, Anatomical_system, Cancer, Cell, Cellular_component, Developing_anatomical_Structure, Gene_or_gene_product, Immaterial_anatomical_entity, Multi-tissue_structure, Organ, Organism, Organism_subdivision, Simple_chemical, Tissue, Organism_substance, Pathological_formation
How to use
documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")
tokenClassifier = MedicalBertForTokenClassification.pretrained("bert_token_classifier_ner_bionlp", "en", "clinical/models")\
.setInputCols("token", "document")\
.setOutputCol("ner")\
.setCaseSensitive(True)
ner_converter = NerConverter()\
.setInputCols(["document","token","ner"])\
.setOutputCol("ner_chunk") 
pipeline = Pipeline(stages=[documentAssembler, 
tokenizer, 
tokenClassifier, 
ner_converter])
data = spark.createDataFrame([["""Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay."""]]).toDF("text")
result = pipeline.fit(data).transform(data)
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val tokenClassifier = MedicalBertForTokenClassification.pretrained("bert_token_classifier_ner_bionlp", "en", "clinical/models")
.setInputCols(Array("document","token"))
.setOutputCol("ner")
.setCaseSensitive(True)
val ner_converter = new NerConverter()
.setInputCols(Array("document","token","ner"))
.setOutputCol("ner_chunk")
val pipeline =  new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier, ner_converter))
val data = Seq("""Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
import nlu
nlu.load("en.classify.token_bert.bionlp").predict("""Both the erbA IRES and the erbA/myb virus constructs transformed erythroid cells after infection of bone marrow or blastoderm cultures. The erbA/myb IRES virus exhibited a 5-10-fold higher transformed colony forming efficiency than the erbA IRES virus in the blastoderm assay.""")
Results
+-------------------+----------------------+
|chunk              |ner_label             |
+-------------------+----------------------+
|erbA IRES          |Organism              |
|erbA/myb virus     |Organism              |
|erythroid cells    |Cell                  |
|bone marrow        |Multi-tissue_structure|
|blastoderm cultures|Cell                  |
|erbA/myb IRES virus|Organism              |
|erbA IRES virus    |Organism              |
|blastoderm         |Cell                  |
+-------------------+----------------------+
Model Information
| Model Name: | bert_token_classifier_ner_bionlp | 
| Compatibility: | Healthcare NLP 3.4.0+ | 
| License: | Licensed | 
| Edition: | Official | 
| Input Labels: | [sentence, token] | 
| Output Labels: | [ner] | 
| Language: | en | 
| Size: | 404.4 MB | 
| Case sensitive: | true | 
| Max sentense length: | 256 | 
Data Source
Trained on Cancer Genetics (CG) task of the BioNLP Shared Task 2013. https://aclanthology.org/W13-2008/
Benchmarking
Label  precision    recall  f1-score   support
B-Cancer       0.88      0.82      0.85       924
B-Cell       0.84      0.86      0.85      1013
B-Cellular_component       0.87      0.84      0.86       180
B-Developing_anatomical_structure       0.65      0.65      0.65        17
B-Gene_or_gene_product       0.62      0.79      0.69      2520
B-Immaterial_anatomical_entity       0.68      0.74      0.71        31
B-Multi-tissue_structure       0.84      0.76      0.80       303
B-Organ       0.78      0.74      0.76       156
B-Organism       0.93      0.86      0.89       518
B-Organism_subdivision       0.74      0.51      0.61        39
B-Organism_substance       0.93      0.66      0.77       102
B-Pathological_formation       0.85      0.60      0.71        88
B-Simple_chemical       0.61      0.75      0.68       727
B-Tissue       0.74      0.83      0.78       184
I-Amino_acid       0.60      1.00      0.75         3
I-Cancer       0.91      0.69      0.78       604
I-Cell       0.98      0.74      0.84      1091
I-Cellular_component       0.88      0.62      0.73        69
I-Multi-tissue_structure       0.89      0.86      0.87       162
I-Organ       0.67      0.59      0.62        17
I-Organism       0.84      0.45      0.59       120
I-Organism_substance       0.80      0.50      0.62        24
I-Pathological_formation       0.81      0.56      0.67        39
I-Tissue       0.83      0.86      0.84       111
accuracy       -         -         0.64     12129
macro-avg       0.73      0.56      0.60     12129
weighted-avg       0.83      0.64      0.68     12129