DistilBERT Sequence Classification - Policy (distilbert_sequence_classifier_policy)

Description

This model was trained on 129.669 manually annotated sentences to classify text into one of seven political categories: ‘Economy’, ‘External Relations’, ‘Fabric of Society’, ‘Freedom and Democracy’, ‘Political System’, ‘Welfare and Quality of Life’ or ‘Social Groups’.

Training data

Policy-DistilBERT-7d was trained on the English-speaking subset of the Manifesto Project Dataset (MPDS2020a). The model was trained on 129.669 sentences from 164 political manifestos from 55 political parties in 8 English-speaking countries (Australia, Canada, Ireland, Israel, New Zealand, South Africa, United Kingdom, United States). The manifestos were published between 1992 - 2019.

The Manifesto Project manually annotates individual sentences from political party manifestos in 7 main political domains: ‘Economy’, ‘External Relations’, ‘Fabric of Society’, ‘Freedom and Democracy’, ‘Political System’, ‘Welfare and Quality of Life’ or ‘Social Groups’ - see the codebook for the exact definitions of each domain.

Limitations and bias

The model was trained on sentences in political manifestos from parties in the 8 countries mentioned above between 1992-2019, manually annotated by the Manifesto Project. The model output, therefore, reproduces the limitations of the dataset in terms of country coverage, time span, domain definitions, and potential biases of the annotators - as any supervised machine learning model would. Applying the model to other types of data (other types of texts, countries, etc.) will reduce performance.

Predicted Entities

Economy, External Relations, Fabric of Society, Freedom and Democracy, Political System, Welfare and Quality of Life, Social Groups

Download Copy S3 URI

How to use

document_assembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')

tokenizer = Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')

sequenceClassifier = DistilBertForSequenceClassification \
.pretrained('distilbert_sequence_classifier_policy', 'en') \
.setInputCols(['token', 'document']) \
.setOutputCol('class') \
.setMaxSentenceLength(512)

pipeline = Pipeline(stages=[
document_assembler, 
tokenizer,
sequenceClassifier    
])

example = spark.createDataFrame([['70-85% of the population needs to get vaccinated against the novel coronavirus to achieve herd immunity.']]).toDF("text")
result = pipeline.fit(example).transform(example)

val document_assembler = DocumentAssembler() 
.setInputCol("text") 
.setOutputCol("document")

val tokenizer = Tokenizer() 
.setInputCols("document") 
.setOutputCol("token")

val tokenClassifier = DistilBertForSequenceClassification.pretrained("distilbert_sequence_classifier_policy", "en")
.setInputCols("document", "token")
.setOutputCol("class")
.setMaxSentenceLength(512)

val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier))

val example = Seq("70-85% of the population needs to get vaccinated against the novel coronavirus to achieve herd immunity.").toDS.toDF("text")

val result = pipeline.fit(example).transform(example)

import nlu
nlu.load("en.classify.distilbert_sequence.policy").predict("""70-85% of the population needs to get vaccinated against the novel coronavirus to achieve herd immunity.""")

Model Information

Model Name:	distilbert_sequence_classifier_policy
Compatibility:	Spark NLP 3.3.3+
License:	Open Source
Edition:	Official
Input Labels:	[token, document]
Output Labels:	[class]
Language:	en
Case sensitive:	true
Max sentense length:	512

Data Source

https://huggingface.co/MoritzLaurer/policy-distilbert-7d

Benchmarking

The model was evaluated using 15% of the sentences (85-15 train-test split).

accuracy (balanced)   | F1 (weighted) | precision | recall | accuracy (not balanced) 
-------|---------|----------|---------|----------
0.745  | 0.773 | 0.772 | 0.771 | 0.771


Please note that the label distribution in the dataset is imbalanced:


Welfare and Quality of Life    0.327225
Economy                        0.259191
Fabric of Society              0.111800
Political System               0.095081
Social Groups                  0.094371
External Relations             0.063724
Freedom and Democracy          0.048608

PREVIOUSDistilBERT Sequence Classification - Industry (distilbert_sequence_classifier_industry)

NEXTDistilBERT Sequence Classification - SST-2 (distilbert_sequence_classifier_sst2)