BERT Sequence Classification - Detecting Hate Speech (bert_sequence_classifier_dehatebert_mono)

Description

This model is used detecting hatespeech in English language. The mono in the name refers to the monolingual setting, where the model is trained using only English language data. It is finetuned on multilingual bert model. The model is trained with different learning rates and the best validation score achieved is 0.726030 for a learning rate of 2e-5. Training code can be found here https://github.com/punyajoy/DE-LIMIT

For more details about our paper

Sai Saketh Aluru, Binny Mathew, Punyajoy Saha and Animesh Mukherjee. “Deep Learning Models for Multilingual Hate Speech Detection”. Accepted at ECML-PKDD 2020.

@article{aluru2020deep,
title={Deep Learning Models for Multilingual Hate Speech Detection},
author={Aluru, Sai Saket and Mathew, Binny and Saha, Punyajoy and Mukherjee, Animesh},
journal={arXiv preprint arXiv:2004.06465},
year={2020}
}

Predicted Entities

NON_HATE, HATE

Download Copy S3 URI

How to use

document_assembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')

tokenizer = Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')

sequenceClassifier = BertForSequenceClassification \
.pretrained('bert_sequence_classifier_dehatebert_mono', 'en') \
.setInputCols(['token', 'document']) \
.setOutputCol('class') \
.setCaseSensitive(True) \
.setMaxSentenceLength(512)

pipeline = Pipeline(stages=[
document_assembler,
tokenizer,
sequenceClassifier
])

example = spark.createDataFrame([['I love you!']]).toDF("text")
result = pipeline.fit(example).transform(example)
val document_assembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

val tokenizer = Tokenizer()
.setInputCols("document")
.setOutputCol("token")

val tokenClassifier = BertForSequenceClassification.pretrained("bert_sequence_classifier_dehatebert_mono", "en")
.setInputCols("document", "token")
.setOutputCol("class")
.setCaseSensitive(true)
.setMaxSentenceLength(512)

val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, sequenceClassifier))

val example = Seq("I love you!").toDS.toDF("text")

val result = pipeline.fit(example).transform(example)
import nlu
nlu.load("en.classify.bert_sequence.dehatebert_mono").predict("""I love you!""")

Model Information

Model Name: bert_sequence_classifier_dehatebert_mono
Compatibility: Spark NLP 3.3.2+
License: Open Source
Edition: Official
Input Labels: [token, document]
Output Labels: [class]
Language: en
Case sensitive: false
Max sentense length: 512

Data Source

https://huggingface.co/Hate-speech-CNERG/dehatebert-mono-english