Detect Bacterial Species (LangTest)

Description

This model detects different types of species of bacteria in clinical texts. It is the version of er_bacterial_species model augmented with langtest library.

test_type before fail_count after fail_count before pass_count after pass_count minimum pass_rate before pass_rate after pass_rate
lowercase 200 43 864 1021 90% 81% 96%
swap_entities 66 56 264 268 75% 80% 83%
titlecase 273 116 791 948 85% 74% 89%
uppercase 305 114 760 951 90% 71% 89%
weighted average 844 329 2679 3188 83% 76.04% 90.65%

Predicted Entities

SPECIES

Copy S3 URI

How to use

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
         
sentence_detector = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")

clinical_ner = MedicalNerModel.pretrained("ner_bacterial_species_langtest", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")

ner_converter = NerConverter()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings_clinical, clinical_ner, ner_converter])

model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text"))

result = model.transform(spark.createDataFrame([[""""The PRP8 intein, the most widespread among fungi, occurs in important pathogens such as Histoplasma capsulatum and Paracoccidioides brasiliensis, from the Ajellomycetaceae family.""""]], ["text"]))
val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")
         
val sentence_detector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val embeddings_clinical = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val ner = MedicalNerModel.pretrained("ner_bacterial_species_langtest", "en", "clinical/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")

val ner_converter = new NerConverter()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")
    
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings_clinical, ner, ner_converter))

val data = Seq("""The PRP8 intein, the most widespread among fungi, occurs in important pathogens such as Histoplasma capsulatum and Paracoccidioides brasiliensis, from the Ajellomycetaceae family.""").toDS().toDF("text")

val result = pipeline.fit(data).transform(data)

Results

+-----------------------------+---------+
|chunk                        |ner_label|
+-----------------------------+---------+
|Histoplasma capsulatum       |SPECIES  |
|Paracoccidioides brasiliensis|SPECIES  |
+-----------------------------+---------+

Model Information

Model Name: ner_bacterial_species_langtest
Compatibility: Healthcare NLP 5.1.1+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: en
Size: 14.7 MB

Benchmarking

label         precision  recall  f1-score  support 
SPECIES       0.84       0.90    0.87      521     
micro-avg     0.84       0.90    0.87      521     
macro-avg     0.84       0.90    0.87      521     
weighted-avg  0.84       0.90    0.87      521