Detect Living Species (LangTest)

Description

Extract living species from clinical texts which is critical to scientific disciplines like medicine, biology, ecology/biodiversity, nutrition, and agriculture.

It is trained on the LivingNER corpus that is composed of clinical case reports extracted from miscellaneous medical specialties including COVID, oncology, infectious diseases, tropical medicine, urology, pediatrics, and others. This model is the version of ner_living_species model augmented with langtest library.

NOTE:

The text files were translated from Spanish with a neural machine translation system.
The annotations were translated with the same neural machine translation system.
The translated annotations were transferred to the translated text files using an annotation transfer technology.

test_type	before fail_count	after fail_count	before pass_count	after pass_count	minimum pass_rate	before pass_rate	after pass_rate
add_ocr_typo	1537	128	832	2241	90%	35%	95%
add_typo	220	169	2120	2186	90%	91%	93%
lowercase	306	146	2090	2250	90%	87%	94%
titlecase	478	290	1960	2148	80%	80%	88%
uppercase	1243	363	1196	2076	80%	49%	85%
weighted average	3784	1096	8198	10901	86%	68.42%	90.86%

Predicted Entities

HUMAN, SPECIES

Download Copy S3 URI

How to use

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")

embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols("sentence", "token")\
    .setOutputCol("embeddings")

ner_model = MedicalNerModel.pretrained("ner_living_species_langtest", "en", "clinical/models")\
    .setInputCols(["sentence", "token", "embeddings"])\
    .setOutputCol("ner")\

ner_converter = NerConverter()\
    .setInputCols(["sentence", "token", "ner"])\
    .setOutputCol("ner_chunk")

pipeline = Pipeline(stages=[
    document_assembler, 
    sentence_detector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter   
])

data = spark.createDataFrame([["""42-year-old woman with end-stage chronic kidney disease, secondary to lupus nephropathy, and on peritoneal dialysis. History of four episodes of bacterial peritonitis and change of Tenckhoff catheter six months prior to admission due to catheter dysfunction. Three peritoneal fluid samples during her hospitalisation tested positive for Fusarium spp. The patient responded favourably and continued outpatient treatment with voriconazole (4mg/kg every 12 hours orally). All three isolates were identified as species of the Fusarium solani complex. In vitro susceptibility to itraconazole, voriconazole and posaconazole, according to Clinical and Laboratory Standards Institute - CLSI (M38-A) methodology, showed a minimum inhibitory concentration (MIC) in all three isolates and for all three antifungals of >16 μg/mL."""]]).toDF("text")

result = pipeline.fit(data).transform(data)

val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val sentence_detector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare", "en", "clinical/models")
    .setInputCols("document")
    .setOutputCol("sentence")

val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")

val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("embeddings")

val ner_model = MedicalNerModel.pretrained("ner_living_species_langtest", "en", "clinical/models")
    .setInputCols(Array("sentence", "token", "embeddings"))
    .setOutputCol("ner")

val ner_converter = new NerConverter()
    .setInputCols(Array("sentence", "token", "ner"))
    .setOutputCol("ner_chunk")

val pipeline = new Pipeline().setStages(Array(document_assembler, 
    sentence_detector,
    tokenizer,
    embeddings,
    ner_model,
    ner_converter
))

val data = Seq("""42-year-old woman with end-stage chronic kidney disease, secondary to lupus nephropathy, and on peritoneal dialysis. History of four episodes of bacterial peritonitis and change of Tenckhoff catheter six months prior to admission due to catheter dysfunction. Three peritoneal fluid samples during her hospitalisation tested positive for Fusarium spp. The patient responded favourably and continued outpatient treatment with voriconazole (4mg/kg every 12 hours orally). All three isolates were identified as species of the Fusarium solani complex. In vitro susceptibility to itraconazole, voriconazole and posaconazole, according to Clinical and Laboratory Standards Institute - CLSI (M38-A) methodology, showed a minimum inhibitory concentration (MIC) in all three isolates and for all three antifungals of >16 μg/mL.""").toDS.toDF("text")

val result = pipeline.fit(data).transform(data)

Results

+-----------------------+---------+
|chunk                  |ner_label|
+-----------------------+---------+
|woman                  |HUMAN    |
|bacterial              |SPECIES  |
|Fusarium spp           |SPECIES  |
|patient                |HUMAN    |
|species                |SPECIES  |
|Fusarium solani complex|SPECIES  |
|antifungals            |SPECIES  |
+-----------------------+---------+

Model Information

Model Name:	ner_living_species_langtest
Compatibility:	Healthcare NLP 5.1.1+
License:	Licensed
Edition:	Official
Input Labels:	[sentence, token, embeddings]
Output Labels:	[ner]
Language:	en
Size:	14.8 MB

References

https://temu.bsc.es/livingner/2022/05/03/multilingual-corpus/

Benchmarking

label         precision  recall  f1-score  support 
HUMAN         0.94       0.96    0.95      1830    
SPECIES       0.87       0.87    0.87      2143    
micro-avg     0.90       0.91    0.91      3973    
macro-avg     0.91       0.91    0.91      3973    
weighted-avg  0.90       0.91    0.91      3973    

PREVIOUSDetect Problems, Tests and Treatments (ner_clinical_large - LangTest)

NEXTExtract Clinical Entities from Voice of the Patient Documents (LangTest)