Detect Organism in Medical Text

Description

Species-800 is a corpus for species entities, which is based on manually annotated abstracts. It comprises 800 PubMed abstracts that contain identified organism mentions.

This model is trained with the BertForTokenClassification method from the transformers library and imported into Spark NLP.

Predicted Entities

SPECIES

Open in Colab Download Copy S3 URI

How to use

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")\

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_species", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("ner")\
    .setCaseSensitive(True)\
    .setMaxSentenceLength(512)

ner_converter = NerConverter()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[
    document_assembler, 
    sentence_detector,
    tokenizer,
    ner_model,
    ner_converter   
    ])

data = spark.createDataFrame([["""As determined by 16S rRNA gene sequence analysis, strain 6C (T) represents a distinct species belonging to the class Betaproteobacteria and is most closely related to Thiomonas intermedia DSM 18155 (T) and Thiomonas perometabolis DSM 18570 (T) ."""]]).toDF("text")

result = pipeline.fit(data).transform(data)

val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
    .setInputCols(Array("document"))
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols(Array("sentence"))
    .setOutputCol("token")

val ner_model = MedicalBertForTokenClassifier.pretrained("bert_token_classifier_ner_species", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("ner")
    .setCaseSensitive(True)
    .setMaxSentenceLength(512)

val ner_converter = new NerConverter()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")

val pipeline = new Pipeline().setStages(Array(document_assembler, 
                                                   sentence_detector,
                                                   tokenizer,
                                                   ner_model,
                                                   ner_converter))

val data = Seq("""As determined by 16S rRNA gene sequence analysis, strain 6C (T) represents a distinct species belonging to the class Betaproteobacteria and is most closely related to Thiomonas intermedia DSM 18155 (T) and Thiomonas perometabolis DSM 18570 (T) .""").toDS.toDF("text")

val result = pipeline.fit(data).transform(data)

import nlu
nlu.load("en.classify.token_bert.species").predict("""As determined by 16S rRNA gene sequence analysis, strain 6C (T) represents a distinct species belonging to the class Betaproteobacteria and is most closely related to Thiomonas intermedia DSM 18155 (T) and Thiomonas perometabolis DSM 18570 (T) .""")

Results

+-----------------------+-------+
|ner_chunk              |label  |
+-----------------------+-------+
|6C (T)                 |SPECIES|
|Betaproteobacteria     |SPECIES|
|Thiomonas intermedia   |SPECIES|
|DSM 18155 (T)          |SPECIES|
|Thiomonas perometabolis|SPECIES|
|DSM 18570 (T)          |SPECIES|
+-----------------------+-------+

Model Information

Model Name:	bert_token_classifier_ner_species
Compatibility:	Healthcare NLP 4.0.0+
License:	Licensed
Edition:	Official
Input Labels:	[sentence, token]
Output Labels:	[ner]
Language:	en
Size:	404.2 MB
Case sensitive:	true
Max sentence length:	512

References

https://species.jensenlab.org/

Benchmarking

 label         precision  recall  f1-score  support
 B-SPECIES     0.6073     0.9374  0.7371    767     
 I-SPECIES     0.7418     0.8648  0.7986    1043    
 micro-avg     0.6754     0.8956  0.7701    1810    
 macro-avg     0.6745     0.9011  0.7678    1810    
 weighted-avg  0.6848     0.8956  0.7725    1810 

PREVIOUSDetect Diseases in Medical Text

NEXTPublic Health Mention Sequence Classifier (PHS-BERT)