Korean Lemmatizer

Description

This model uses context and language knowledge to assign all forms and inflections of a word to a single root. This enables the pipeline to treat the past and present tense of a verb, for example, as the same word instead of two completely different words. The lemmatizer takes into consideration the context surrounding a word to determine which root is correct when the word form alone is ambiguous.

Live Demo Open in Colab Download Copy S3 URI

How to use

document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")
word_segmenter = WordSegmenterModel.pretrained('wordseg_kaist_ud', 'ko')\
        .setInputCols("document")\
        .setOutputCol("token")
lemmatizer = LemmatizerModel.pretrained("lemma", "ko") \
        .setInputCols(["token"]) \
        .setOutputCol("lemma")
nlp_pipeline = Pipeline(stages=[document_assembler, word_segmenter , lemmatizer])
light_pipeline = LightPipeline(nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text")))
results = light_pipeline.fullAnnotate(["이렇게되면이러한인간형을다투어본받으려할것이틀림없다."])

val document_assembler = DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")
val word_segmenter = WordSegmenterModel.pretrained("wordseg_kaist_ud", "ko")
        .setInputCols("document")
        .setOutputCol("token")
val lemmatizer = LemmatizerModel.pretrained("lemma", "ko")
        .setInputCols("token")
        .setOutputCol("lemma")
val pipeline = new Pipeline().setStages(Array(document_assembler, word_segmenter , lemmatizer))
val data = Seq("이렇게되면이러한인간형을다투어본받으려할것이틀림없다.").toDF("text")
val result = pipeline.fit(data).transform(data)

import nlu

text = ["이렇게되면이러한인간형을다투어본받으려할것이틀림없다."]
lemma_df = nlu.load('ko.lemma').predict(text, output_level = "document")
lemma_df.lemma.values[0]

Results

{'lemma': [Annotation(token, 0, 2, 이렇게, {'sentence': '0'}),
   Annotation(token, 3, 4, 되+면, {'sentence': '0'}),
   Annotation(token, 5, 7, 이러한+ㄴ, {'sentence': '0'}),
   Annotation(token, 8, 11, 인간형+을, {'sentence': '0'}),
   Annotation(token, 12, 15, 다투어본, {'sentence': '0'}),
   Annotation(token, 16, 18, 받으할, {'sentence': '0'}),
   Annotation(token, 18, 18, 려, {'sentence': '0'}),
   Annotation(token, 20, 21, 것+이, {'sentence': '0'}),
   Annotation(token, 22, 25, 틀림없+다, {'sentence': '0'}),
   Annotation(token, 26, 26, ., {'sentence': '0'})]}

Model Information

Model Name:	lemma
Compatibility:	Spark NLP 2.7.0+
Edition:	Official
Input Labels:	[document]
Output Labels:	[token]
Language:	ko

Data Source

The model was trained on the universal dependencies from Korea Advanced Institute of Science and Technology (KAIST) dataset.

Reference:

Building Universal Dependency Treebanks in Korean, Jayeol Chun, Na-Rae Han, Jena D. Hwang, and Jinho D. Choi. In Proceedings of the 11th International Conference on Language Resources and Evaluation, LREC’18, Miyazaki, Japan, 2018.

PREVIOUSJapanese Lemmatizer

NEXTSentiment Analysis of tweets Pipeline (analyze_sentimentdl_use_twitter)