Part of Speech for Bengali (pos_msri)

Description

This model annotates the part of speech of tokens in a text. The parts of speech annotated include NN (noun), CC (Conjuncts - coordinating and subordinating), and 26 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically.

Predicted Entities

BM (Not Documented), CC (Conjuncts, Coordinating and Subordinating), CL (Clitics), DEM (Demonstratives), INJ (Interjection), INTF (Intensifier), JJ (Adjective), NEG (Negative), NN (Noun), NNC (Compound Nouns), NNP (Proper Noun), NST (Preposition of Direction), PPR (Postposition), PRP (Pronoun), PSP (Preprosition), QC (Cardinal Number), QF (Quantifiers), QO (Ordinal Numbers), RB (Adverb), RDP (Not Documented), RP (Particle), SYM (Special Symbol), UT (Not Documented), VAUX (Verb Auxiliary), VM (Verb), WQ (wh- qualifier)

Live Demo Open in Colab Download

How to use

Use as part of an nlp pipeline after tokenization.

document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")
    
sentence_detector = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")
    
tokenizer = Tokenizer()\
        .setInputCols(["sentence"])\
        .setOutputCol("token")
        
pos = PerceptronModel.pretrained("pos_msri", "bn") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("pos")

pipeline = Pipeline(stages=[
        document_assembler,
        sentence_detector,
        tokenizer,
        posTagger
    ])

example = spark.createDataFrame(pd.DataFrame({'text': ["বাসস্থান-ঘরগৃহস্থালি তোড়া ভাষায় গ্রামকেও বলে ` মোদ ' ৷"]}))

result = pipeline.fit(example).transform(example)
val document_assembler = DocumentAssembler()
        .setInputCol("text")
        .setOutputCol("document")
        
val sentence_detector = SentenceDetector()
        .setInputCols(["document"])
        .setOutputCol("sentence")
        
val tokenizer = Tokenizer()
        .setInputCols(["sentence"])
        .setOutputCol("token")
        
val pos = PerceptronModel.pretrained("pos_lst20", "th")
    .setInputCols(Array("document", "token"))
    .setOutputCol("pos")

val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, pos))

val result = pipeline.fit(Seq.empty["সসন-ঘরগহসি   মক বল `  ' ৷"].toDS.toDF("text")).transform(data)
import nlu

text = ["বাসস্থান-ঘরগৃহস্থালি তোড়া ভাষায় গ্রামকেও বলে ` মোদ ' ৷"]
pos_df = nlu.load('bn.pos').predict(text, output_level = "token")
pos_df

Results

+------------------------------------------------------+----------------------------------------+
|text                                                  |result                                  |
+------------------------------------------------------+----------------------------------------+
|বাসস্থান-ঘরগৃহস্থালি তোড়া ভাষায় গ্রামকেও বলে ` মোদ ' ৷|[NN, NNP, NN, NN, VM, SYM, NN, SYM, SYM]|
+------------------------------------------------------+----------------------------------------+

Model Information

Model Name: pos_msri
Compatibility: Spark NLP 2.7.0+
License: Open Source
Edition: Official
Input Labels: [sentence, token]
Output Labels: [pos]
Language: bn

Data Source

The model was trained on the Indian Language POS-Tagged Corpus from NLTK collected by A Kumaran (Microsoft Research, India).

Benchmarking

|              | precision | recall | f1-score | support |
|--------------|-----------|--------|----------|---------|
| BM           | 1.00      | 1.00   | 1.00     | 1       |
| CC           | 0.99      | 0.99   | 0.99     | 390     |
| CL           | 1.00      | 1.00   | 1.00     | 2       |
| DEM          | 0.98      | 0.99   | 0.98     | 139     |
| INJ          | 0.92      | 0.85   | 0.88     | 13      |
| INTF         | 1.00      | 1.00   | 1.00     | 55      |
| JJ           | 0.99      | 0.99   | 0.99     | 688     |
| NEG          | 0.99      | 0.98   | 0.99     | 135     |
| NN           | 0.99      | 0.99   | 0.99     | 2996    |
| NNC          | 1.00      | 1.00   | 1.00     | 4       |
| NNP          | 0.97      | 0.98   | 0.97     | 528     |
| NST          | 1.00      | 1.00   | 1.00     | 156     |
| PPR          | 1.00      | 1.00   | 1.00     | 1       |
| PRP          | 0.98      | 0.98   | 0.98     | 685     |
| PSP          | 0.99      | 0.99   | 0.99     | 250     |
| QC           | 0.99      | 0.99   | 0.99     | 193     |
| QF           | 0.98      | 0.98   | 0.98     | 187     |
| QO           | 1.00      | 1.00   | 1.00     | 22      |
| RB           | 0.99      | 0.99   | 0.99     | 187     |
| RDP          | 1.00      | 0.98   | 0.99     | 44      |
| RP           | 0.99      | 0.96   | 0.97     | 79      |
| SYM          | 0.97      | 0.98   | 0.98     | 1413    |
| UNK          | 1.00      | 1.00   | 1.00     | 1       |
| UT           | 1.00      | 1.00   | 1.00     | 18      |
| VAUX         | 0.97      | 0.97   | 0.97     | 400     |
| VM           | 0.99      | 0.98   | 0.98     | 1393    |
| WQ           | 1.00      | 0.99   | 0.99     | 71      |
| XC           | 0.98      | 0.97   | 0.97     | 219     |
| accuracy     |           |        | 0.98     | 10270   |
| macro avg    | 0.99      | 0.98   | 0.99     | 10270   |
| weighted avg | 0.98      | 0.98   | 0.98     | 10270   |