Part of Speech for Amharic (pos_ud_att)

Description

This model annotates the part of speech of tokens in a text. The parts of speech annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 13 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically.

Predicted Entities

POS tag Description
ADJ adjective
ADP adposition
ADV adverb
AUX auxiliary
CCONJ coordinating conjunction
DET determiner
INTJ interjection
NOUN noun
NUM numeral
PART particle
PRON pronoun
PROPN proper noun
PUNCT punctuation
SCONJ subordinating conjunction
SYM symbol
VERB verb
X other

Live Demo Open in Colab Download Copy S3 URI

How to use

Use as part of an nlp pipeline after tokenization.

document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")
    
sentence_detector = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")
    
tokenizer = Tokenizer()\
        .setInputCols(["sentence"])\
        .setOutputCol("token")
        
pos = PerceptronModel.pretrained("pos_ud_att", "am") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("pos")

pipeline = Pipeline(stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        posTagger
    ])

example = spark.createDataFrame([['ልጅ ኡ ን ሥራ ው ን አስጨርስ ኧው ኣል ኧሁ ።']], ["text"])

result = pipeline.fit(example).transform(example)
val document_assembler = DocumentAssembler()
        .setInputCol("text")
        .setOutputCol("document")
        
val sentence_detector = SentenceDetector()
        .setInputCols("document")
        .setOutputCol("sentence")
        
val tokenizer = Tokenizer()
        .setInputCols("sentence")
        .setOutputCol("token")
        
val pos = PerceptronModel.pretrained("pos_ud_att", "am")
    .setInputCols(Array("document", "token"))
    .setOutputCol("pos")

val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos))

val data = Seq("ልጅ ኡ ን ሥራ ው ን አስጨርስ ኧው ኣል ኧሁ ።").toDF("text")
val result = pipeline.fit(data).transform(data)
import nlu

text = ["ልጅ ኡ ን ሥራ ው ን አስጨርስ ኧው ኣል ኧሁ ።"]
pos_df = nlu.load('am.pos').predict(text)
pos_df

Results

+------------------------------+----------------------------------------------------------------+
|text                          |result                                                          |
+------------------------------+----------------------------------------------------------------+
|ልጅ ኡ ን ሥራ ው ን አስጨርስ ኧው ኣል ኧሁ ።|[NOUN, DET, PART, NOUN, DET, PART, VERB, PRON, AUX, PRON, PUNCT]|
+------------------------------+----------------------------------------------------------------+

Model Information

Model Name: pos_ud_att
Compatibility: Spark NLP 2.7.0+
License: Open Source
Edition: Official
Input Labels: [sentence, token]
Output Labels: [pos]
Language: am

Data Source

The model was trained on the Universal Dependencies version 2.7.

Reference:

  • Binyam Ephrem Seyoum ,Yusuke Miyao and Baye Yimam Mekonnen.2018.Universal Dependencies for Amharic. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), pp. 2216–2222, Miyazaki, Japan: European Language Resources Association (ELRA)

Benchmarking

|              | precision | recall | f1-score | support |
|:------------:|:---------:|:------:|:--------:|:-------:|
|      ADJ     |    1.00   |  0.97  |   0.99   |   116   |
|      ADP     |    0.99   |  1.00  |   0.99   |   681   |
|      ADV     |    0.94   |  0.99  |   0.96   |    93   |
|      AUX     |    1.00   |  1.00  |   1.00   |   419   |
|     CCONJ    |    0.99   |  0.97  |   0.98   |    99   |
|      DET     |    0.99   |  1.00  |   0.99   |   485   |
|     INTJ     |    0.97   |  0.99  |   0.98   |    67   |
|     NOUN     |    0.99   |  1.00  |   1.00   |   1485  |
|      NUM     |    1.00   |  1.00  |   1.00   |    42   |
|     PART     |    1.00   |  1.00  |   1.00   |   875   |
|     PRON     |    1.00   |  1.00  |   1.00   |   2547  |
|     PROPN    |    1.00   |  0.99  |   0.99   |   236   |
|     PUNCT    |    1.00   |  1.00  |   1.00   |   1093  |
|     SCONJ    |    1.00   |  0.98  |   0.99   |   214   |
|     VERB     |    1.00   |  1.00  |   1.00   |   1552  |
|   accuracy   |           |        |   1.00   |  10004  |
|   macro avg  |    0.99   |  0.99  |   0.99   |  10004  |
| weighted avg |    1.00   |  1.00  |   1.00   |  10004  |