Detect Living Species (LangTest)

Description

Extract living species from clinical texts which is critical to scientific disciplines like medicine, biology, ecology/biodiversity, nutrition, and agriculture.

It is trained on the LivingNER corpus that is composed of clinical case reports extracted from miscellaneous medical specialties including COVID, oncology, infectious diseases, tropical medicine, urology, pediatrics, and others. This model is the version of ner_living_species model augmented with langtest library.

NOTE:

  1. The text files were translated from Spanish with a neural machine translation system.
  2. The annotations were translated with the same neural machine translation system.
  3. The translated annotations were transferred to the translated text files using an annotation transfer technology.
test_type before fail_count after fail_count before pass_count after pass_count minimum pass_rate before pass_rate after pass_rate
add_ocr_typo 1537 128 832 2241 90% 35% 95%
add_typo 220 169 2120 2186 90% 91% 93%
lowercase 306 146 2090 2250 90% 87% 94%
titlecase 478 290 1960 2148 80% 80% 88%
uppercase 1243 363 1196 2076 80% 49% 85%
weighted average 3784 1096 8198 10901 86% 68.42% 90.86%

Predicted Entities

HUMAN, SPECIES

Copy S3 URI

How to use

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols("sentence", "token")\
    .setOutputCol("embeddings")

ner_model = MedicalNerModel.pretrained("ner_living_species_langtest", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")\

ner_converter = NerConverter()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[
    document_assembler, 
    sentence_detector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter   
])

data = spark.createDataFrame([["""42-year-old woman with end-stage chronic kidney disease, secondary to lupus nephropathy, and on peritoneal dialysis. History of four episodes of bacterial peritonitis and change of Tenckhoff catheter six months prior to admission due to catheter dysfunction. Three peritoneal fluid samples during her hospitalisation tested positive for Fusarium spp. The patient responded favourably and continued outpatient treatment with voriconazole (4mg/kg every 12 hours orally). All three isolates were identified as species of the Fusarium solani complex. In vitro susceptibility to itraconazole, voriconazole and posaconazole, according to Clinical and Laboratory Standards Institute - CLSI (M38-A) methodology, showed a minimum inhibitory concentration (MIC) in all three isolates and for all three antifungals of >16 μg/mL."""]]).toDF("text")

result = pipeline.fit(data).transform(data)
val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val ner_model = MedicalNerModel.pretrained("ner_living_species_langtest", "en", "clinical/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")

val ner_converter = new NerConverter()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")

val pipeline = new Pipeline().setStages(Array(document_assembler, 
    sentence_detector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter
))

val data = Seq("""42-year-old woman with end-stage chronic kidney disease, secondary to lupus nephropathy, and on peritoneal dialysis. History of four episodes of bacterial peritonitis and change of Tenckhoff catheter six months prior to admission due to catheter dysfunction. Three peritoneal fluid samples during her hospitalisation tested positive for Fusarium spp. The patient responded favourably and continued outpatient treatment with voriconazole (4mg/kg every 12 hours orally). All three isolates were identified as species of the Fusarium solani complex. In vitro susceptibility to itraconazole, voriconazole and posaconazole, according to Clinical and Laboratory Standards Institute - CLSI (M38-A) methodology, showed a minimum inhibitory concentration (MIC) in all three isolates and for all three antifungals of >16 μg/mL.""").toDS.toDF("text")

val result = pipeline.fit(data).transform(data)

Results

+-----------------------+---------+
|chunk                  |ner_label|
+-----------------------+---------+
|woman                  |HUMAN    |
|bacterial              |SPECIES  |
|Fusarium spp           |SPECIES  |
|patient                |HUMAN    |
|species                |SPECIES  |
|Fusarium solani complex|SPECIES  |
|antifungals            |SPECIES  |
+-----------------------+---------+

Model Information

Model Name: ner_living_species_langtest
Compatibility: Healthcare NLP 5.1.1+
License: Licensed
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: en
Size: 14.8 MB

References

https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/

Benchmarking

label         precision  recall  f1-score  support 
HUMAN         0.94       0.96    0.95      1830    
SPECIES       0.87       0.87    0.87      2143    
micro-avg     0.90       0.91    0.91      3973    
macro-avg     0.91       0.91    0.91      3973    
weighted-avg  0.90       0.91    0.91      3973