Bhojpuri Lemmatizer

Description

This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous.

Live Demo Open in Colab Download

How to use

document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer()\
    .setInputCols(["document"]) \
    .setOutputCol("token")

lemmatizer = LemmatizerModel.pretrained("lemma", "bh") \
        .setInputCols(["token"]) \
        .setOutputCol("lemma")

nlp_pipeline = Pipeline(stages=[document_assembler, tokenizer, lemmatizer])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))

results = light_pipeline.fullAnnotate(["एह आयोजन में विश्व भोजपुरी सम्मेलन , पूर्वांचल एकता मंच , वीर कुँवर सिंह फाउन्डेशन , पूर्वांचल भोजपुरी महासभा , अउर हर्फ - मीडिया के सहभागिता बा ।"])
val document_assembler = DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

val tokenizer = Tokenizer()
    .setInputCols(["document"])
    .setOutputCol("token")

val lemmatizer = LemmatizerModel.pretrained("lemma", "bh")
        .setInputCols(["token"])
        .setOutputCol("lemma")

val nlp_pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, lemmatizer))

val result = nlp_pipeline.fit(Seq.empty["एह आयजन ें ि जप समलन , ांचल एकत  ,  ुँवर िं उनशन , ांचल जप महसभ , अउर हर - ि  सहभि  ।"].toDS.toDF("text")).transform(data)
import nlu

text = ["एह आयोजन में विश्व भोजपुरी सम्मेलन , पूर्वांचल एकता मंच , वीर कुँवर सिंह फाउन्डेशन , पूर्वांचल भोजपुरी महासभा , अउर हर्फ - मीडिया के सहभागिता बा ।"]
lemma_df = nlu.load('bh.lemma').predict(text, output_level = "document")
lemma_df.lemma.values[0]

Results

{'lemma': [Annotation(token, 0, 1, एह, {'sentence': '0'}),
  Annotation(token, 3, 7, आयोजन, {'sentence': '0'}),
  Annotation(token, 9, 11, में, {'sentence': '0'}),
  Annotation(token, 13, 17, विश्व, {'sentence': '0'}),
  Annotation(token, 19, 25, भोजपुरी, {'sentence': '0'}),
  Annotation(token, 27, 33, सम्मेलन, {'sentence': '0'}),
  Annotation(token, 35, 35, COMMA, {'sentence': '0'}),
  Annotation(token, 37, 45, पूर्वांचल, {'sentence': '0'}),
  Annotation(token, 47, 50, एकता, {'sentence': '0'}),
  Annotation(token, 52, 54, मंच, {'sentence': '0'}),
  Annotation(token, 56, 56, COMMA, {'sentence': '0'}),
  Annotation(token, 58, 60, वीर, {'sentence': '0'}),
  Annotation(token, 62, 66, कुँवर, {'sentence': '0'}),
  Annotation(token, 68, 71, सिंह, {'sentence': '0'}),
  Annotation(token, 73, 81, फाउन्डेशन, {'sentence': '0'}),
  Annotation(token, 83, 83, COMMA, {'sentence': '0'}),
  Annotation(token, 85, 93, पूर्वांचल, {'sentence': '0'}),
  Annotation(token, 95, 101, भोजपुरी, {'sentence': '0'}),
  Annotation(token, 103, 108, महासभा, {'sentence': '0'}),
  Annotation(token, 110, 110, COMMA, {'sentence': '0'}),
  Annotation(token, 112, 114, अउर, {'sentence': '0'}),
  Annotation(token, 116, 119, हर्फ, {'sentence': '0'}),
  Annotation(token, 121, 121, -, {'sentence': '0'}),
  Annotation(token, 123, 128, मीडिया, {'sentence': '0'}),
  Annotation(token, 130, 131, को, {'sentence': '0'}),
  Annotation(token, 133, 140, सहभागिता, {'sentence': '0'}),
  Annotation(token, 142, 143, बा, {'sentence': '0'}),
  Annotation(token, 145, 145, ।, {'sentence': '0'})]}

Model Information

Model Name: lemma
Compatibility: Spark NLP 2.7.0+
Edition: Official
Input Labels: [document]
Output Labels: [token]
Language: bh

Data Source

The model was trained on the Universal Dependencies data set version 2.7.

Reference:

  • Ojha, A. K., & Zeman, D. (2020). Universal Dependency Treebanks for Low-Resource Indian Languages: The Case of Bhojpuri. Proceedings of the WILDRE5{–} 5th Workshop on Indian Language Data: Resources and Evaluation.