Detect Bacterial Species (BertForTokenClassification)

Description

Detect different types of species of bacteria in text using pretrained NER model. This model is trained with the BertForTokenClassification method from transformers library and imported into Spark NLP.

Predicted Entities

SPECIES

Open in Colab Copy S3 URI

How to use

documentAssembler = DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

tokenizer = Tokenizer()\
        .setInputCols(["document"])\
        .setOutputCol("token")

tokenClassifier = MedicalBertForTokenClassification.pretrained("bert_token_classifier_ner_bacteria", "en", "clinical/models")\
        .setInputCols("token", "document")\
        .setOutputCol("ner")\
        .setCaseSensitive(True)

ner_converter = NerConverter()\
        .setInputCols(["document","token","ner"])\
        .setOutputCol("ner_chunk")

pipeline =  Pipeline(stages=[documentAssembler, tokenizer, tokenClassifier, ner_converter])

p_model = pipeline.fit(spark.createDataFrame(pd.DataFrame({'text': ['']})))

test_sentence = """Based on these genetic and phenotypic properties, we propose that strain SMSP (T) represents \
a novel species of the genus Methanoregula, for which we propose the name Methanoregula formicica \
sp. nov., with the type strain SMSP (T) (= NBRC 105244 (T) = DSM 22288 (T))."""

result = p_model.transform(spark.createDataFrame(pd.DataFrame({'text': [test_sentence]})))
val documentAssembler = new DocumentAssembler()
        .setInputCol("text")
        .setOutputCol("document")

val tokenizer = new Tokenizer()
        .setInputCols("document")
        .setOutputCol("token")

val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_ner_bacteria", "en", "clincal/models")
        .setInputCols(Array("token", "document"))
        .setOutputCol("ner")
        .setCaseSensitive(True)

val ner_converter = new NerConverter()
        .setInputCols(Array("document","token","ner"))
        .setOutputCol("ner_chunk")

val pipeline =  new Pipeline().setStages(Array(documentAssembler, tokenizer, tokenClassifier, ner_converter))

val data = Seq("""Based on these genetic and phenotypic properties, we propose that strain SMSP (T) represents a novel species of the genus Methanoregula, for which we propose the name Methanoregula formicica sp. nov., with the type strain SMSP (T) (= NBRC 105244 (T) = DSM 22288 (T)).""").toDS.toDF("text")

val result = pipeline.fit(data).transform(data)
import nlu
nlu.load("en.classify.token_bert.ner_bacteria").predict("""Based on these genetic and phenotypic properties, we propose that strain SMSP (T) represents \
a novel species of the genus Methanoregula, for which we propose the name Methanoregula formicica \
sp. nov., with the type strain SMSP (T) (= NBRC 105244 (T) = DSM 22288 (T)).""")

Results

+-----------------------+---------+
|chunk                  |ner_label|
+-----------------------+---------+
|SMSP (T)               |SPECIES  |
|Methanoregula formicica|SPECIES  |
|SMSP (T)               |SPECIES  |
+-----------------------+---------+

Model Information

Model Name: bert_token_classifier_ner_bacteria
Compatibility: Healthcare NLP 3.2.2+
License: Licensed
Edition: Official
Input Labels: [sentence, token]
Output Labels: [ner]
Language: en
Case sensitive: true
Max sentense length: 512

Data Source

Trained on a custom dataset by John Snow Labs.

Benchmarking

label         precision    recall   f1-score   support
B-SPECIES       0.98        0.84      0.91       767
I-SPECIES       0.99        0.84      0.91      1043
accuracy         -           -        0.84      1810
macro-avg       0.85        0.89      0.87      1810
weighted-avg    0.99        0.84      0.91      1810