Part of Speech for Arabic

Description

This model annotates the part of speech of tokens in a text. The parts of speech annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically.

Open in Colab Download Copy S3 URI

How to use

...
pos = PerceptronModel.pretrained("pos_ud_padt", "ar") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("pos")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate(["كرستيانو رونالدو لاعب برتغالي محترف يلعب في صفوف منتخب البرتغال لكرة القدم"])
...
val pos = PerceptronModel.pretrained("pos_ud_padt", "ar")
    .setInputCols(Array("document", "token"))
    .setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos))
val data = Seq("كرستيانو رونالدو لاعب برتغالي محترف يلعب في صفوف منتخب البرتغال لكرة القدم").toDF("text")
val result = pipeline.fit(data).transform(data)
import nlu

text = ["""كرستيانو رونالدو لاعب برتغالي محترف يلعب في صفوف منتخب البرتغال لكرة القدم"""]
pos_df = nlu.load('ar.pos').predict(text)
pos_df

Results

{'pos': [Annotation(pos, 0, 7, X, {'word': 'كرستيانو'}),
   Annotation(pos, 9, 15, X, {'word': 'رونالدو'}),
   Annotation(pos, 17, 20, NOUN, {'word': 'لاعب'}),
   Annotation(pos, 22, 28, X, {'word': 'برتغالي'}),
   Annotation(pos, 30, 34, X, {'word': 'محترف'}),
   Annotation(pos, 36, 39, VERB, {'word': 'يلعب'}),
   Annotation(pos, 41, 42, ADP, {'word': 'في'}),
   Annotation(pos, 44, 47, NOUN, {'word': 'صفوف'}),
   Annotation(pos, 49, 53, NOUN, {'word': 'منتخب'}),
   Annotation(pos, 55, 62, X, {'word': 'البرتغال'}),
   Annotation(pos, 64, 67, CCONJ, {'word': 'لكرة'}),
   Annotation(pos, 69, 73, NOUN, {'word': 'القدم'})],

Model Information

Model Name: pos_ud_padt
Compatibility: Spark NLP 2.7.0+
Edition: Official
Input Labels: [tags, document]
Output Labels: [pos]
Language: ar

Data Source

The model is trained on data obtained from https://universaldependencies.org

Benchmarking

|    |              | precision   | recall   |   f1-score |   support |
|---:|:-------------|:------------|:---------|-----------:|----------:|
|  0 | ADJ          | 0.90        | 0.91     |       0.91 |      2937 |
|  1 | ADP          | 0.99        | 1.00     |       0.99 |      4528 |
|  2 | ADV          | 0.96        | 0.93     |       0.95 |       104 |
|  3 | AUX          | 0.88        | 0.85     |       0.87 |       197 |
|  4 | CCONJ        | 1.00        | 0.99     |       0.99 |      1963 |
|  5 | DET          | 0.95        | 0.96     |       0.96 |       623 |
|  6 | NOUN         | 0.94        | 0.96     |       0.95 |      9547 |
|  7 | NUM          | 0.98        | 0.97     |       0.98 |       779 |
|  8 | None         | 1.00        | 1.00     |       1    |      3868 |
|  9 | PART         | 0.92        | 0.93     |       0.93 |       226 |
| 10 | PRON         | 0.99        | 1.00     |       1    |      1133 |
| 11 | PROPN        | 1.00        | 0.48     |       0.65 |        31 |
| 12 | PUNCT        | 1.00        | 1.00     |       1    |      2052 |
| 13 | SCONJ        | 0.99        | 0.98     |       0.98 |       534 |
| 14 | SYM          | 1.00        | 0.98     |       0.99 |        41 |
| 15 | VERB         | 0.94        | 0.93     |       0.94 |      2189 |
| 16 | X            | 0.80        | 0.64     |       0.71 |      1380 |
| 17 | accuracy     |             |          |       0.96 |     32132 |
| 18 | macro avg    | 0.95        | 0.91     |       0.93 |     32132 |
| 19 | weighted avg | 0.95        | 0.96     |       0.95 |     32132 |