Word Segmenter for Korean


WordSegmenterModel-WSM is based on maximum entropy probability model to detect word boundaries in Korean text. Korean text is written without white space between the words, and a computer-based application cannot know a priori which sequence of ideograms form a word. In many natural language processing tasks such as part-of-speech (POS) and named entity recognition (NER) require word segmentation as a initial step.

Open in Colab Download

How to use

word_segmenter = WordSegmenterModel.pretrained("wordseg_kaist_ud", "ko")        .setInputCols(["sentence"])        .setOutputCol("token")
pipeline = Pipeline(stages=[document_assembler, word_segmenter])
ws_model = pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
example = spark.createDataFrame(pd.DataFrame({'text': [""John Snow Labs에서 안녕하세요! ""]}))
result = ws_model.transform(example)

val word_segmenter = WordSegmenterModel.pretrained("wordseg_kaist_ud", "ko")
val pipeline = new Pipeline().setStages(Array(document_assembler, word_segmenter))
val result = pipeline.fit(Seq.empty["John Snow Labs에서 안녕하세요! "].toDS.toDF("text")).transform(data)

import nlu
text = [""John Snow Labs에서 안녕하세요! ""]
token_df = nlu.load('ko.segment_words').predict(text)


0       J
1       o
2       h
3       n
4       S
5       n
6       o
7       w
8       L
9       a
10      b
11      s
12      에
13      서
14      안
15      녕
16    하세요
17      !
Name: token, dtype: object

Model Information

Model Name: wordseg_kaist_ud
Compatibility: Spark NLP 3.0.0+
License: Open Source
Edition: Official
Input Labels: [document]
Output Labels: [words_segmented]
Language: ko