DistilBERT Sequence Classification Base - AG News (distilbert_base_sequence_classifier_ag_news)

Description

DistilBERT Model with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.

distilbert_base_sequence_classifier_ag_news is a fine-tuned DistilBERT model that is ready to be used for Sequence Classification tasks such as sentiment analysis or multi-class text classification and it achieves state-of-the-art performance.

We used TFDistilBertForSequenceClassification to train this model and used BertForSequenceClassification annotator in Spark NLP 🚀 for prediction at scale!

Predicted Entities

Business, Sci/Tech, Sports, World

Download Copy S3 URI

How to use

document_assembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')

tokenizer = Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')

sequenceClassifier = DistilBertForSequenceClassification \
.pretrained('distilbert_base_sequence_classifier_ag_news', 'en') \
.setInputCols(['token', 'document']) \
.setOutputCol('class') \
.setCaseSensitive(True) \
.setMaxSentenceLength(512)

pipeline = Pipeline(stages=[
document_assembler, 
tokenizer,
sequenceClassifier    
])

example = spark.createDataFrame([['Disney Comics was a comic book publishing company operated by The Walt Disney Company which ran from 1990 to 1993.']]).toDF("text")
result = pipeline.fit(example).transform(example)

val document_assembler = DocumentAssembler() 
.setInputCol("text") 
.setOutputCol("document")

val tokenizer = Tokenizer() 
.setInputCols("document") 
.setOutputCol("token")

val tokenClassifier = DistilBertForSequenceClassification.pretrained("distilbert_base_sequence_classifier_ag_news", "en")
.setInputCols("document", "token")
.setOutputCol("class")
.setCaseSensitive(true)
.setMaxSentenceLength(512)

val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier))

val example = Seq("Disney Comics was a comic book publishing company operated by The Walt Disney Company which ran from 1990 to 1993.").toDS.toDF("text")

val result = pipeline.fit(example).transform(example)

import nlu
nlu.load("en.classify.distilbert_sequence.ag_news").predict("""Disney Comics was a comic book publishing company operated by The Walt Disney Company which ran from 1990 to 1993.""")

Model Information

Model Name:	distilbert_base_sequence_classifier_ag_news
Compatibility:	Spark NLP 3.3.3+
License:	Open Source
Edition:	Official
Input Labels:	[token, document]
Output Labels:	[class]
Language:	en
Case sensitive:	true
Max sentense length:	512

Data Source

https://huggingface.co/datasets/ag_news

Benchmarking

precision    recall  f1-score   support

Business       0.90      0.90      0.90      1904
Sci/Tech       0.91      0.90      0.91      1935
Sports       0.99      0.98      0.98      1921
World       0.92      0.95      0.94      1840

accuracy                           0.93      7600
macro avg       0.93      0.93      0.93      7600
weighted avg       0.93      0.93      0.93      7600

PREVIOUSBERT Sequence Classifier - Classify the Music Genre

NEXTDistilBERT Sequence Classification - Amazon Polarity (distilbert_base_sequence_classifier_amazon_polarity)