News Classifier for Urdu texts

Description

Classify Urdu news into 7 categories.

Predicted Entities

business, entertainment, health, inland, science, sports, weird_news

Live Demo Open in Colab Download Copy S3 URI

How to use

document_assembler = DocumentAssembler() \
.setInputCol("news") \
.setOutputCol("document")

embeddings = BertSentenceEmbeddings.pretrained("labse", "xx") \
.setInputCols("document") \
.setOutputCol("sentence_embeddings")

classifierdl = ClassifierDLModel.pretrained("classifierdl_bert_news", "ur") \
.setInputCols(["document", "sentence_embeddings"]) \
.setOutputCol("class")

urdu_news_pipeline = Pipeline(stages=[document_assembler, embeddings, classifierdl])
light_pipeline = LightPipeline(urdu_news_pipeline.fit(spark.createDataFrame([['']]).toDF("news")))

result = light_pipeline.annotate("گزشتہ ہفتے ایپل کے حصص میں 11 فیصد اضافہ ہوا ہے۔")
result["class"]
val document = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

val embeddings = BertSentenceEmbeddings
.pretrained("lanse", "xx") 
.setInputCols("document")
.setOutputCol("sentence_embeddings")

val document_classifier = ClassifierDLModel.pretrained("classifierdl_bert_news", "ur") 
.setInputCols(Array("document", "sentence_embeddings")) 
.setOutputCol("class")

val nlpPipeline = new Pipeline().setStages(Array(document, embeddings, document_classifier))
val light_pipeline = LightPipeline(nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
val result = light_pipeline.annotate("گزشتہ ہفتے ایپل کے حصص میں 11 فیصد اضافہ ہوا ہے۔")

import nlu
nlu.load("ur.classify.news").predict("""گزشتہ ہفتے ایپل کے حصص میں 11 فیصد اضافہ ہوا ہے۔""")

Results

['business']

Model Information

Model Name: classifierdl_bert_news
Compatibility: Spark NLP 3.3.0+
License: Open Source
Edition: Official
Input Labels: [sentence_embeddings]
Output Labels: [class]
Language: ur
Size: 23.6 MB

Data Source

Combination of multiple open source data sets.

Benchmarking

label  precision    recall  f1-score   support
business       0.83      0.86      0.85      2365
entertainment       0.87      0.85      0.86      3081
health       0.68      0.67      0.68       430
inland       0.80      0.82      0.81      3964
science       0.62      0.60      0.61       558
sports       0.88      0.89      0.89      4022
weird_news       0.60      0.54      0.57       826
accuracy          -         -      0.82     15246
macro-avg       0.76      0.75      0.75     15246
weighted-avg       0.82      0.82      0.82     15246