NER Model for 10 African Languages

Description

  • This model is imported from Hugging Face.

  • It’s been trained using xlm_roberta_large fine-tuned model on 10 African languages (Amharic, Hausa, Igbo, Kinyarwanda, Luganda, Nigerian, Pidgin, Swahilu, Wolof, and Yorùbá).

Predicted Entities

DATE, LOC, PER, ORG

Live Demo Open in Colab Download Copy S3 URI

How to use

documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")

sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")

tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")

tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlm_roberta_large_token_classifier_masakhaner", "xx"))\
.setInputCols(["sentence",'token'])\
.setOutputCol("ner")

ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")

nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)
text = """አህመድ ቫንዳ ከ3-10-2000 ጀምሮ በአዲስ አበባ ኖሯል።"""
result = model.transform(spark.createDataFrame([[text]]).toDF("text"))
val documentAssembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")

val tokenizer = Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")

val tokenClassifier = XlmRoBertaForTokenClassification.pretrained("xlm_roberta_large_token_classifier_masakhaner", "xx"))
.setInputCols(Array("sentence","token"))
.setOutputCol("ner")

ner_converter = NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")

val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter))

val example = Seq.empty["አህመድ ቫንዳ ከ3-10-2000 ጀምሮ በአዲስ አበባ ኖሯል።"].toDS.toDF("text")

val result = pipeline.fit(example).transform(example)
import nlu
nlu.load("xx.ner.masakhaner").predict("""አህመድ ቫንዳ ከ3-10-2000 ጀምሮ በአዲስ አበባ ኖሯል።""")

Results

+----------------+---------+
|chunk           |ner_label|
+----------------+---------+
|አህመድ ቫንዳ      |PER      |
|ከ3-10-2000 ጀምሮ|DATE      |
|በአዲስ አበባ       |LOC      |
+----------------+---------+

Model Information

Model Name: xlm_roberta_large_token_classifier_masakhaner
Compatibility: Spark NLP 3.3.2+
License: Open Source
Edition: Official
Input Labels: [sentence, token]
Output Labels: [ner]
Language: xx
Case sensitive: true
Max sentense length: 256

Data Source

https://huggingface.co/Davlan/xlm-roberta-large-masakhaner

Benchmarking

language:   F1-score:
--------    --------
amh	     75.76
hau	     91.75
ibo	     86.26
kin	     76.38
lug	     84.64
luo	     80.65
pcm	     89.55
swa	     89.48
wol	     70.70
yor	     82.05