Multilabel Classification For LitCovid

Description

Multi-label document classification is a technique that identifies semantic categories for entire documents. In the biomedical field, these semantic categories are valuable for understanding the main topics and conducting literature searches. Unlike multi-class classification, which assigns only one label to each instance, multi-label classification allows for assigning multiple labels, up to N, to a single instance. LitCovid is a manually annotated multi-label document classification dataset specifically designed for COVID-19 topics, consisting of seven labels.

The objective of the multilabel classification model is to determine the relevant COVID-19 topics of an article based on its abstract. You will need to determine whether the article is related to any of the following seven topics, each with its own definition:

  • Mechanism: explores the underlying causes of COVID-19 infections and transmission, as well as potential mechanisms of action for drugs.
  • Transmission: focuses on the characteristics and modes of COVID-19 transmission.
  • Diagnosis: pertains to the assessment of COVID-19 through symptoms, test results, and radiological features.
  • Treatment: involves strategies for treating COVID-19, including therapeutic procedures and vaccine development.
  • Prevention: addresses prevention, control, mitigation, and management strategies for COVID-19.
  • Case_Report: entails descriptions of specific patient cases related to COVID-19.
  • Epidemic_Forecasting: involves estimating the trend of COVID-19 spread and utilizing related modeling approaches.

This model is inspired from Large language models in biomedical natural language processing: benchmarks, baselines, and recommendations and augmented internally for better coverage and performance.

Predicted Entities

Mechanism, Transmission, Diagnosis, Treatment, Prevention, Case_Report, Epidemic_Forecasting

Copy S3 URI

How to use

document_assembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")

word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["document", "token"])\
    .setOutputCol("word_embeddings")

sentence_embeddings = SentenceEmbeddings()\
    .setInputCols(["document", "word_embeddings"])\
    .setOutputCol("sentence_embeddings")\
    .setPoolingStrategy("AVERAGE")

multi_classifier_dl = MultiClassifierDLModel.pretrained("multiclassifierdl_litcovid", "en", "clinical/models")\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("category")
    
pipeline = Pipeline(
    stages = [
        document_assembler,
        tokenizer,
        word_embeddings,
        sentence_embeddings,
        multi_classifier_dl
    ])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)

text = """Low level of plasminogen increases risk for mortality in COVID-19 patients. The pathophysiology of coronavirus disease 2019 (COVID-19), caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), and especially of its complications is still not fully understood. In fact, a very high number of patients with COVID-19 die because of thromboembolic causes. A role of plasminogen, as precursor of fibrinolysis, has been hypothesized. In this study, we aimed to investigate the association between plasminogen levels and COVID-19-related outcomes in a population of 55 infected Caucasian patients (mean age: 69.8 +/- 14.3, 41.8% female). Low levels of plasminogen were significantly associated with inflammatory markers (CRP, PCT, and IL-6), markers of coagulation (D-dimer, INR, and APTT), and markers of organ dysfunctions (high fasting blood glucose and decrease in the glomerular filtration rate). A multidimensional analysis model, including the correlation of the expression of coagulation with inflammatory parameters, indicated that plasminogen tended to cluster together with IL-6, hence suggesting a common pathway of activation during disease's complication. Moreover, low levels of plasminogen strongly correlated with mortality in COVID-19 patients even after multiple adjustments for presence of confounding. These data suggest that plasminogen may play a pivotal role in controlling the complex mechanisms beyond the COVID-19 complications, and may be useful both as biomarker for prognosis and for therapeutic target against this extremely aggressive infection."""

df = spark.createDataFrame([[text]]).toDF("text")

result = model.transform(df)

result.select("text", "category.result").show(truncate=120)
val document_assembler = new DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val tokenizer = new Tokenizer()
    .setInputCols(Array("document"))
    .setOutputCol("token")

val word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
    .setInputCols(Array("document", "token"))
    .setOutputCol("word_embeddings")

val sentence_embeddings = new SentenceEmbeddings()
    .setInputCols(Array("document", "word_embeddings"))
    .setOutputCol("sentence_embeddings")
    .setPoolingStrategy("AVERAGE")

val multi_classifier_dl = MultiClassifierDLModel.pretrained("multiclassifierdl_litcovid", "en", "clinical/models")
    .setInputCols(Array("sentence_embeddings"))
    .setOutputCol("category")
    
val pipeline = new Pipeline().setStages(Array(
     document_assembler, 
     tokenizer,
     word_embeddings, 
     sentence_embeddings, 
     multi_classifier_dl))

val data = Seq("""Low level of plasminogen increases risk for mortality in COVID-19 patients. The pathophysiology of coronavirus disease 2019 (COVID-19), caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), and especially of its complications is still not fully understood. In fact, a very high number of patients with COVID-19 die because of thromboembolic causes. A role of plasminogen, as precursor of fibrinolysis, has been hypothesized. In this study, we aimed to investigate the association between plasminogen levels and COVID-19-related outcomes in a population of 55 infected Caucasian patients (mean age: 69.8 +/- 14.3, 41.8% female). Low levels of plasminogen were significantly associated with inflammatory markers (CRP, PCT, and IL-6), markers of coagulation (D-dimer, INR, and APTT), and markers of organ dysfunctions (high fasting blood glucose and decrease in the glomerular filtration rate). A multidimensional analysis model, including the correlation of the expression of coagulation with inflammatory parameters, indicated that plasminogen tended to cluster together with IL-6, hence suggesting a common pathway of activation during disease's complication. Moreover, low levels of plasminogen strongly correlated with mortality in COVID-19 patients even after multiple adjustments for presence of confounding. These data suggest that plasminogen may play a pivotal role in controlling the complex mechanisms beyond the COVID-19 complications, and may be useful both as biomarker for prognosis and for therapeutic target against this extremely aggressive infection.""").toDS.toDF("text")

val result = pipeline.fit(data).transform(data)

Results

+------------------------------------------------------------------------------------------------------------------------+---------------------------------+
|                                                                                                                    text|                           result|
+------------------------------------------------------------------------------------------------------------------------+---------------------------------+
|Low level of plasminogen increases risk for mortality in COVID-19 patients. The pathophysiology of coronavirus diseas...|[Mechanism, Treatment, Diagnosis]|
+------------------------------------------------------------------------------------------------------------------------+---------------------------------+

Model Information

Model Name: multiclassifierdl_litcovid
Compatibility: Healthcare NLP 4.4.4+
License: Licensed
Edition: Official
Input Labels: [sentence_embeddings]
Output Labels: [class]
Language: en
Size: 11.7 MB

References

The training dataset is available here

Benchmarking

label                 precision  recall  f1-score  support 
Case_Report           0.88       0.85    0.86      252     
Diagnosis             0.86       0.86    0.86      886     
Epidemic_Forecasting  0.72       0.69    0.70      77      
Mechanism             0.84       0.85    0.85      609     
Prevention            0.92       0.91    0.92      1419    
Transmission          0.67       0.61    0.64      146     
Treatment             0.90       0.87    0.88      1221    
micro-avg             0.88       0.86    0.87      4610    
macro-avg             0.83       0.80    0.82      4610    
weighted-avg          0.88       0.86    0.87      4610    
samples-avg           0.89       0.89    0.88      4610