Part of Speech for Amharic (pos_ud_att)

Description

This model annotates the part of speech of tokens in a text. The parts of speech annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 13 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically.

Predicted Entities

POS tag	Description
ADJ	adjective
ADP	adposition
ADV	adverb
AUX	auxiliary
CCONJ	coordinating conjunction
DET	determiner
INTJ	interjection
NOUN	noun
NUM	numeral
PART	particle
PRON	pronoun
PROPN	proper noun
PUNCT	punctuation
SCONJ	subordinating conjunction
SYM	symbol
VERB	verb
X	other

Live Demo Open in Colab Download Copy S3 URI

How to use

Use as part of an nlp pipeline after tokenization.

document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")
    
sentence_detector = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")
    
tokenizer = Tokenizer()\
        .setInputCols(["sentence"])\
        .setOutputCol("token")
        
pos = PerceptronModel.pretrained("pos_ud_att", "am") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("pos")

pipeline = Pipeline(stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        posTagger
    ])

example = spark.createDataFrame([['ልጅ ኡ ን ሥራ ው ን አስጨርስ ኧው ኣል ኧሁ ።']], ["text"])

result = pipeline.fit(example).transform(example)

val document_assembler = DocumentAssembler()
        .setInputCol("text")
        .setOutputCol("document")
        
val sentence_detector = SentenceDetector()
        .setInputCols("document")
        .setOutputCol("sentence")
        
val tokenizer = Tokenizer()
        .setInputCols("sentence")
        .setOutputCol("token")
        
val pos = PerceptronModel.pretrained("pos_ud_att", "am")
    .setInputCols(Array("document", "token"))
    .setOutputCol("pos")

val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos))

val data = Seq("ልጅ ኡ ን ሥራ ው ን አስጨርስ ኧው ኣል ኧሁ ።").toDF("text")
val result = pipeline.fit(data).transform(data)

import nlu

text = ["ልጅ ኡ ን ሥራ ው ን አስጨርስ ኧው ኣል ኧሁ ።"]
pos_df = nlu.load('am.pos').predict(text)
pos_df

Results

+------------------------------+----------------------------------------------------------------+
|text                          |result                                                          |
+------------------------------+----------------------------------------------------------------+
|ልጅ ኡ ን ሥራ ው ን አስጨርስ ኧው ኣል ኧሁ ።|[NOUN, DET, PART, NOUN, DET, PART, VERB, PRON, AUX, PRON, PUNCT]|
+------------------------------+----------------------------------------------------------------+

Model Information

Model Name:	pos_ud_att
Compatibility:	Spark NLP 2.7.0+
License:	Open Source
Edition:	Official
Input Labels:	[sentence, token]
Output Labels:	[pos]
Language:	am

Data Source

The model was trained on the Universal Dependencies version 2.7.

Reference:

Binyam Ephrem Seyoum ,Yusuke Miyao and Baye Yimam Mekonnen.2018.Universal Dependencies for Amharic. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), pp. 2216–2222, Miyazaki, Japan: European Language Resources Association (ELRA)

Benchmarking

|              | precision | recall | f1-score | support |
|:------------:|:---------:|:------:|:--------:|:-------:|
|      ADJ     |    1.00   |  0.97  |   0.99   |   116   |
|      ADP     |    0.99   |  1.00  |   0.99   |   681   |
|      ADV     |    0.94   |  0.99  |   0.96   |    93   |
|      AUX     |    1.00   |  1.00  |   1.00   |   419   |
|     CCONJ    |    0.99   |  0.97  |   0.98   |    99   |
|      DET     |    0.99   |  1.00  |   0.99   |   485   |
|     INTJ     |    0.97   |  0.99  |   0.98   |    67   |
|     NOUN     |    0.99   |  1.00  |   1.00   |   1485  |
|      NUM     |    1.00   |  1.00  |   1.00   |    42   |
|     PART     |    1.00   |  1.00  |   1.00   |   875   |
|     PRON     |    1.00   |  1.00  |   1.00   |   2547  |
|     PROPN    |    1.00   |  0.99  |   0.99   |   236   |
|     PUNCT    |    1.00   |  1.00  |   1.00   |   1093  |
|     SCONJ    |    1.00   |  0.98  |   0.99   |   214   |
|     VERB     |    1.00   |  1.00  |   1.00   |   1552  |
|   accuracy   |           |        |   1.00   |  10004  |
|   macro avg  |    0.99   |  0.99  |   0.99   |  10004  |
| weighted avg |    1.00   |  1.00  |   1.00   |  10004  |

PREVIOUSPart of Speech for Bengali (pos_msri)

NEXTEnd-to-End (E2E) and data-driven NLG Challenge