Description
Multi-label document classification is a technique that identifies semantic categories for entire documents. In the biomedical field, these semantic categories are valuable for understanding the main topics and conducting literature searches. Unlike multi-class classification, which assigns only one label to each instance, multi-label classification allows for assigning multiple labels, up to N, to a single instance. LitCovid is a manually annotated multi-label document classification dataset specifically designed for COVID-19 topics, consisting of seven labels.
The objective of the multilabel classification model is to determine the relevant COVID-19 topics of an article based on its abstract. You will need to determine whether the article is related to any of the following seven topics, each with its own definition:
- Mechanism: explores the underlying causes of COVID-19 infections and transmission, as well as potential mechanisms of action for drugs.
- Transmission: focuses on the characteristics and modes of COVID-19 transmission.
- Diagnosis: pertains to the assessment of COVID-19 through symptoms, test results, and radiological features.
- Treatment: involves strategies for treating COVID-19, including therapeutic procedures and vaccine development.
- Prevention: addresses prevention, control, mitigation, and management strategies for COVID-19.
- Case_Report: entails descriptions of specific patient cases related to COVID-19.
- Epidemic_Forecasting: involves estimating the trend of COVID-19 spread and utilizing related modeling approaches.
This model is inspired from Large language models in biomedical natural language processing: benchmarks, baselines, and recommendations and augmented internally for better coverage and performance.
Predicted Entities
Mechanism
, Transmission
, Diagnosis
, Treatment
, Prevention
, Case_Report
, Epidemic_Forecasting
How to use
document_assembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")
tokenizer = Tokenizer()\
.setInputCols(["document"])\
.setOutputCol("token")
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
.setInputCols(["document", "token"])\
.setOutputCol("word_embeddings")
sentence_embeddings = SentenceEmbeddings()\
.setInputCols(["document", "word_embeddings"])\
.setOutputCol("sentence_embeddings")\
.setPoolingStrategy("AVERAGE")
multi_classifier_dl = MultiClassifierDLModel.pretrained("multiclassifierdl_litcovid", "en", "clinical/models")\
.setInputCols(["sentence_embeddings"])\
.setOutputCol("category")
pipeline = Pipeline(
stages = [
document_assembler,
tokenizer,
word_embeddings,
sentence_embeddings,
multi_classifier_dl
])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = pipeline.fit(empty_data)
text = """Low level of plasminogen increases risk for mortality in COVID-19 patients. The pathophysiology of coronavirus disease 2019 (COVID-19), caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), and especially of its complications is still not fully understood. In fact, a very high number of patients with COVID-19 die because of thromboembolic causes. A role of plasminogen, as precursor of fibrinolysis, has been hypothesized. In this study, we aimed to investigate the association between plasminogen levels and COVID-19-related outcomes in a population of 55 infected Caucasian patients (mean age: 69.8 +/- 14.3, 41.8% female). Low levels of plasminogen were significantly associated with inflammatory markers (CRP, PCT, and IL-6), markers of coagulation (D-dimer, INR, and APTT), and markers of organ dysfunctions (high fasting blood glucose and decrease in the glomerular filtration rate). A multidimensional analysis model, including the correlation of the expression of coagulation with inflammatory parameters, indicated that plasminogen tended to cluster together with IL-6, hence suggesting a common pathway of activation during disease's complication. Moreover, low levels of plasminogen strongly correlated with mortality in COVID-19 patients even after multiple adjustments for presence of confounding. These data suggest that plasminogen may play a pivotal role in controlling the complex mechanisms beyond the COVID-19 complications, and may be useful both as biomarker for prognosis and for therapeutic target against this extremely aggressive infection."""
df = spark.createDataFrame([[text]]).toDF("text")
result = model.transform(df)
result.select("text", "category.result").show(truncate=120)
val document_assembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")
val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")
val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
.setInputCols(Array("document", "token"))
.setOutputCol("word_embeddings")
val sentence_embeddings = new SentenceEmbeddings()
.setInputCols(Array("document", "word_embeddings"))
.setOutputCol("sentence_embeddings")
.setPoolingStrategy("AVERAGE")
val multi_classifier_dl = MultiClassifierDLModel.pretrained("multiclassifierdl_litcovid", "en", "clinical/models")
.setInputCols(Array("sentence_embeddings"))
.setOutputCol("category")
val pipeline = new Pipeline().setStages(Array(
document_assembler,
tokenizer,
word_embeddings,
sentence_embeddings,
multi_classifier_dl))
val data = Seq("""Low level of plasminogen increases risk for mortality in COVID-19 patients. The pathophysiology of coronavirus disease 2019 (COVID-19), caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), and especially of its complications is still not fully understood. In fact, a very high number of patients with COVID-19 die because of thromboembolic causes. A role of plasminogen, as precursor of fibrinolysis, has been hypothesized. In this study, we aimed to investigate the association between plasminogen levels and COVID-19-related outcomes in a population of 55 infected Caucasian patients (mean age: 69.8 +/- 14.3, 41.8% female). Low levels of plasminogen were significantly associated with inflammatory markers (CRP, PCT, and IL-6), markers of coagulation (D-dimer, INR, and APTT), and markers of organ dysfunctions (high fasting blood glucose and decrease in the glomerular filtration rate). A multidimensional analysis model, including the correlation of the expression of coagulation with inflammatory parameters, indicated that plasminogen tended to cluster together with IL-6, hence suggesting a common pathway of activation during disease's complication. Moreover, low levels of plasminogen strongly correlated with mortality in COVID-19 patients even after multiple adjustments for presence of confounding. These data suggest that plasminogen may play a pivotal role in controlling the complex mechanisms beyond the COVID-19 complications, and may be useful both as biomarker for prognosis and for therapeutic target against this extremely aggressive infection.""").toDS.toDF("text")
val result = pipeline.fit(data).transform(data)
Results
+------------------------------------------------------------------------------------------------------------------------+---------------------------------+
| text| result|
+------------------------------------------------------------------------------------------------------------------------+---------------------------------+
|Low level of plasminogen increases risk for mortality in COVID-19 patients. The pathophysiology of coronavirus diseas...|[Mechanism, Treatment, Diagnosis]|
+------------------------------------------------------------------------------------------------------------------------+---------------------------------+
Model Information
Model Name: | multiclassifierdl_litcovid |
Compatibility: | Healthcare NLP 4.4.4+ |
License: | Licensed |
Edition: | Official |
Input Labels: | [sentence_embeddings] |
Output Labels: | [class] |
Language: | en |
Size: | 11.7 MB |
References
The training dataset is available here
Benchmarking
label precision recall f1-score support
Case_Report 0.88 0.85 0.86 252
Diagnosis 0.86 0.86 0.86 886
Epidemic_Forecasting 0.72 0.69 0.70 77
Mechanism 0.84 0.85 0.85 609
Prevention 0.92 0.91 0.92 1419
Transmission 0.67 0.61 0.64 146
Treatment 0.90 0.87 0.88 1221
micro-avg 0.88 0.86 0.87 4610
macro-avg 0.83 0.80 0.82 4610
weighted-avg 0.88 0.86 0.87 4610
samples-avg 0.89 0.89 0.88 4610