Part of Speech for Persian

Description

This model annotates the part of speech of tokens in a text. The parts of speech annotated include PRON (pronoun), CCONJ (coordinating conjunction), and 15 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically.

Open in Colab Download

How to use

...
pos = PerceptronModel.pretrained("pos_ud_perdt", "fa") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("pos")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, pos])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([['']]).toDF("text")))
results = light_pipeline.fullAnnotate(["جان اسنو جدا از سلطنت شمال ، یک پزشک انگلیسی و رهبر توسعه بیهوشی و بهداشت پزشکی است."])
...
val pos = PerceptronModel.pretrained("pos_ud_perdt", "fa")
    .setInputCols(Array("document", "token"))
    .setOutputCol("pos")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos))
val result = pipeline.fit(Seq.empty["جان اسنو جدا از سلطنت شمال ، یک پزشک انگلیسی و رهبر توسعه بیهوشی و بهداشت پزشکی است."].toDS.toDF("text")).transform(data)

Results

{'pos': [Annotation(pos, 0, 2, NOUN, {'word': 'جان'}),
   Annotation(pos, 4, 7, NOUN, {'word': 'اسنو'}),
   Annotation(pos, 9, 11, ADJ, {'word': 'جدا'}),
   Annotation(pos, 13, 14, ADP, {'word': 'از'}),
   Annotation(pos, 16, 20, NOUN, {'word': 'سلطنت'}),
   Annotation(pos, 22, 25, NOUN, {'word': 'شمال'}),
   Annotation(pos, 27, 27, PUNCT, {'word': '،'}),
   Annotation(pos, 29, 30, NUM, {'word': 'یک'}),
   Annotation(pos, 32, 35, NOUN, {'word': 'پزشک'}),
   Annotation(pos, 37, 43, ADJ, {'word': 'انگلیسی'}),
   Annotation(pos, 45, 45, CCONJ, {'word': 'و'}),
   Annotation(pos, 47, 50, NOUN, {'word': 'رهبر'}),
   Annotation(pos, 52, 56, NOUN, {'word': 'توسعه'}),
   Annotation(pos, 58, 63, VERB, {'word': 'بیهوشی'}),
   Annotation(pos, 65, 65, CCONJ, {'word': 'و'}),
   Annotation(pos, 67, 72, NOUN, {'word': 'بهداشت'}),
   Annotation(pos, 74, 78, ADJ, {'word': 'پزشکی'}),
   Annotation(pos, 80, 82, AUX, {'word': 'است'}),
   Annotation(pos, 83, 83, PUNCT, {'word': '.'})]}

Model Information

Model Name: pos_ud_perdt
Compatibility: Spark NLP 2.7.0+
Edition: Official
Input Labels: [tags, document]
Output Labels: [pos]
Language: fa

Data Source

The model is trained on data obtained from https://universaldependencies.org

Benchmarking

|    |              | precision   | recall   |   f1-score |   support |
|---:|:-------------|:------------|:---------|-----------:|----------:|
|  0 | ADJ          | 0.88        | 0.88     |       0.88 |      1647 |
|  1 | ADP          | 0.99        | 0.99     |       0.99 |      3402 |
|  2 | ADV          | 0.94        | 0.91     |       0.92 |       383 |
|  3 | AUX          | 0.99        | 0.99     |       0.99 |      1000 |
|  4 | CCONJ        | 1.00        | 1.00     |       1    |      1022 |
|  5 | DET          | 0.94        | 0.96     |       0.95 |       490 |
|  6 | INTJ         | 0.88        | 0.81     |       0.85 |        27 |
|  7 | NOUN         | 0.95        | 0.96     |       0.95 |      8201 |
|  8 | NUM          | 0.94        | 0.97     |       0.96 |       293 |
|  9 | None         | 1.00        | 0.99     |       0.99 |       289 |
| 10 | PART         | 1.00        | 0.86     |       0.92 |        28 |
| 11 | PRON         | 0.98        | 0.97     |       0.98 |      1117 |
| 12 | PROPN        | 0.84        | 0.78     |       0.81 |      1107 |
| 13 | PUNCT        | 1.00        | 1.00     |       1    |      2134 |
| 14 | SCONJ        | 0.98        | 0.98     |       0.98 |       630 |
| 15 | VERB         | 0.99        | 0.99     |       0.99 |      2581 |
| 16 | accuracy     |             |          |       0.96 |     24351 |
| 17 | macro avg    | 0.96        | 0.94     |       0.95 |     24351 |
| 18 | weighted avg | 0.96        | 0.96     |       0.96 |     24351 |