DistilBERT Sequence Classification - Amazon Polarity (distilbert_base_sequence_classifier_amazon_polarity)

Description

DistilBERT Model with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.

distilbert_base_sequence_classifier_amazon_polarity is a fine-tuned DistilBERT model that is ready to be used for Sequence Classification tasks such as sentiment analysis or multi-class text classification and it achieves state-of-the-art performance.

We used TFDistilBertForSequenceClassification to train this model and used BertForSequenceClassification annotator in Spark NLP 🚀 for prediction at scale!

Predicted Entities

negative, positive

Download Copy S3 URI

How to use

document_assembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')

tokenizer = Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')

sequenceClassifier = DistilBertForSequenceClassification \
.pretrained('distilbert_base_sequence_classifier_amazon_polarity', 'en') \
.setInputCols(['token', 'document']) \
.setOutputCol('class') \
.setCaseSensitive(True) \
.setMaxSentenceLength(512)

pipeline = Pipeline(stages=[
document_assembler, 
tokenizer,
sequenceClassifier    
])

example = spark.createDataFrame([['I really liked that movie!']]).toDF("text")
result = pipeline.fit(example).transform(example)

val document_assembler = DocumentAssembler() 
.setInputCol("text") 
.setOutputCol("document")

val tokenizer = Tokenizer() 
.setInputCols("document") 
.setOutputCol("token")

val tokenClassifier = DistilBertForSequenceClassification.pretrained("distilbert_base_sequence_classifier_amazon_polarity", "en")
.setInputCols("document", "token")
.setOutputCol("class")
.setCaseSensitive(true)
.setMaxSentenceLength(512)

val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier))

val example = Seq.empty["I really liked that movie!"].toDS.toDF("text")

val result = pipeline.fit(example).transform(example)

import nlu
nlu.load("en.classify.distilbert_sequence.amazon_polarity").predict("""I really liked that movie!""")

Model Information

Model Name:	distilbert_base_sequence_classifier_amazon_polarity
Compatibility:	Spark NLP 3.3.3+
License:	Open Source
Edition:	Official
Input Labels:	[token, document]
Output Labels:	[class]
Language:	en
Case sensitive:	true
Max sentense length:	512

Data Source

https://huggingface.co/datasets/amazon_polarity

Benchmarking

precision    recall  f1-score   support

negative       0.94      0.96      0.95     24285
positive       0.96      0.95      0.95     25715

accuracy                           0.95     50000
macro avg       0.95      0.95      0.95     50000
weighted avg       0.95      0.95      0.95     50000

PREVIOUSDistilBERT Sequence Classification Base - AG News (distilbert_base_sequence_classifier_ag_news)

NEXTDistilBERT Sequence Classification - IMDB (distilbert_base_sequence_classifier_imdb)