Part of Speech for Thai


A Part of Speech classifier predicts a grammatical label for every token in the input text. Implemented with an averaged perceptron architecture.

Predicted Entities

  • NN
  • VV
  • PS
  • NG
  • NU
  • CL
  • PU
  • CC
  • AX
  • AV
  • FX
  • AJ
  • PR
  • PA
  • IJ
  • XX

How to use

document_assembler = DocumentAssembler()

sentence_detector = SentenceDetector()

pos = PerceptronModel.pretrained("pos_lst20", "th")
  .setInputCols(["document", "token"])

pipeline = Pipeline(stages=[

example = spark.createDataFrame(pd.DataFrame({'text': ["สวัสดีจาก John Snow Labs! "]}))

result =

val document_assembler = DocumentAssembler()

val sentence_detector = SentenceDetector()

val pos = PerceptronModel.pretrained("pos_lst20", "th")
        .setInputCols(Array("document", "token"))

val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, pos))

val result =["สวสดจาก John Snow Labs! "].toDS.toDF("text")).transform(data)

import nlu
text = [""สวัสดีจาก John Snow Labs! ""]
token_df = nlu.load('th.pos').predict(text)


       token pos
0  สวัสดีจาก  CC
1       John  NN
2       Snow  NN
3       Labs  NN
4          !  PU

Model Information

Model Name: pos_lst20
Compatibility: Spark NLP 3.0.0+
License: Open Source
Edition: Official
Input Labels: [document, token]
Output Labels: [pos]
Language: th