BERT Sequence Classification - German Sentiment Analysis (bert_sequence_classifier_sentiment)

Description

German Sentiment Classification with Bert

This model was trained for sentiment classification of German language texts. To achieve the best results all model inputs needs to be preprocessed with the same procedure, that was applied during the training. To simplify the usage of the model, we provide a Python package that bundles the code need for the preprocessing and inferencing.

The model uses the Googles Bert architecture and was trained on 1.834 million German-language samples. The training data contains texts from various domains like Twitter, Facebook and movie, app and hotel reviews.

You can find more information about the dataset and the training process in the paper.

Cite

Please cite this paper if you found this useful:

@InProceedings{guhr-EtAl:2020:LREC,
  author    = {Guhr, Oliver  and  Schumann, Anne-Kathrin  and  Bahrmann, Frank  and  Böhme, Hans Joachim},
  title     = {Training a Broad-Coverage German Sentiment Classification Model for Dialog Systems},
  booktitle      = {Proceedings of The 12th Language Resources and Evaluation Conference},
  month          = {May},
  year           = {2020},
  address        = {Marseille, France},
  publisher      = {European Language Resources Association},
  pages     = {1620--1625},
  url       = {https://www.aclweb.org/anthology/2020.lrec-1.202}
}

Predicted Entities

positive, negative, neutral

Download Copy S3 URI

How to use

document_assembler = DocumentAssembler() \
    .setInputCol('text') \
    .setOutputCol('document')

tokenizer = Tokenizer() \
    .setInputCols(['document']) \
    .setOutputCol('token')

sequenceClassifier = BertForSequenceClassification \
      .pretrained('bert_sequence_classifier_sentiment', 'de') \
      .setInputCols(['token', 'document']) \
      .setOutputCol('class') \
      .setCaseSensitive(True) \
      .setMaxSentenceLength(512)

pipeline = Pipeline(stages=[
    document_assembler,
    tokenizer,
    sequenceClassifier
])

example = spark.createDataFrame([['Mit keinem guten Ergebniss']]).toDF("text")
result = pipeline.fit(example).transform(example)
val document_assembler = DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val tokenizer = Tokenizer()
    .setInputCols("document")
    .setOutputCol("token")

val tokenClassifier = BertForSequenceClassification.pretrained("bert_sequence_classifier_sentiment", "de")
      .setInputCols("document", "token")
      .setOutputCol("class")
      .setCaseSensitive(true)
      .setMaxSentenceLength(512)

val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier))

val example = Seq("Mit keinem guten Ergebniss").toDS.toDF("text")

val result = pipeline.fit(example).transform(example)
import nlu
nlu.load("de.classify.bert.sentiment.").predict("""Mit keinem guten Ergebniss""")

Model Information

Model Name: bert_sequence_classifier_sentiment
Compatibility: Spark NLP 3.3.2+
License: Open Source
Edition: Official
Input Labels: [token, document]
Output Labels: [class]
Language: de
Case sensitive: false
Max sentense length: 512

Data Source

https://huggingface.co/oliverguhr/german-sentiment-bert

Benchmarking

| Dataset                                                      | F1 micro Score |
| :----------------------------------------------------------- | -------------: |
| [holidaycheck](https://github.com/oliverguhr/german-sentiment) |         0.9568 |
| [scare](https://www.romanklinger.de/scare/)                  |         0.9418 |
| [filmstarts](https://github.com/oliverguhr/german-sentiment) |         0.9021 |
| [germeval](https://sites.google.com/view/germeval2017-absa/home) |         0.7536 |
| [PotTS](https://www.aclweb.org/anthology/L16-1181/)          |         0.6780 |
| [emotions](https://github.com/oliverguhr/german-sentiment)  |         0.9649 |
| [sb10k](https://www.spinningbytes.com/resources/germansentiment/) |         0.7376 |
| [Leipzig Wikipedia Corpus 2016](https://wortschatz.uni-leipzig.de/de/download/german) |         0.9967 |
| all                                                          |         0.9639 |