Cyberbullying Classifier in Turkish texts.

Description

Identifies whether a Turkish text contains cyberbullying or not.

Predicted Entities

Negative, Positive

Live Demo Open in Colab Download

How to use

...
berturk_embeddings = BertEmbeddings.pretrained("bert_base_turkish_uncased", "tr") \
      .setInputCols("document", "lemma") \
      .setOutputCol("embeddings")

embeddingsSentence = SentenceEmbeddings() \
      .setInputCols(["document", "embeddings"]) \
      .setOutputCol("sentence_embeddings") \
      .setPoolingStrategy("AVERAGE")

document_classifier = ClassifierDLModel.pretrained('classifierdl_berturk_cyberbullying', 'tr') \
  .setInputCols(["document", "sentence_embeddings"]) \
  .setOutputCol("class")

berturk_pipeline = Pipeline(stages=[document_assembler, tokenizer, normalizer, stopwords_cleaner, lemma, berturk_embeddings, embeddingsSentence, document_classifier])

light_pipeline = LightPipeline(berturk_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))

result = light_pipeline.annotate("""Gidişin olsun, dönüşün olmasın inşallah senin..""")
result["class"]
...
val berturk_embeddings = BertEmbeddings.pretrained("bert_base_turkish_uncased", "tr") 
      .setInputCols("document", "lemma") 
      .setOutputCol("embeddings")

val embeddingsSentence = SentenceEmbeddings() 
      .setInputCols(Array("document", "embeddings")) 
      .setOutputCol("sentence_embeddings") 
      .setPoolingStrategy("AVERAGE")

val document_classifier = ClassifierDLModel.pretrained("classifierdl_berturk_cyberbullying", "tr") 
  .setInputCols(Array("document", "sentence_embeddings")) 
  .setOutputCol("class")

val berturk_pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, normalizer, stopwords_cleaner, lemma, berturk_embeddings, embeddingsSentence, document_classifier))

val light_pipeline = LightPipeline(berturk_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))

val result = light_pipeline.annotate("Gidişin olsun, dönüşün olmasın inşallah senin..")

Results

['Negative']

Model Information

Model Name: classifierdl_berturk_cyberbullying
Compatibility: Spark NLP 3.1.2+
License: Open Source
Edition: Official
Input Labels: [sentence_embeddings]
Output Labels: [class]
Language: tr

Data Source

Trained on a custom dataset with Turkish Bert embeddings (BERTurk).

Benchmarking

              precision    recall  f1-score   support

     Negative       0.83      0.80      0.81       970
     Positive       0.84      0.87      0.86      1225

    accuracy                           0.84      2195
   macro avg       0.84      0.83      0.84      2195
weighted avg       0.84      0.84      0.84      2195