Part of Speech for Chinese

Description

This model annotates the part of speech of tokens in a text. The parts of speech annotated include PN (pronoun), CC (coordinating conjunction), and 39 others. The part of speech model is useful for extracting the grammatical structure of a piece of text automatically.

Predicted Entities

AD, AS, BA, CC, CD, CS, DEC, DEG, DER, DEV, DT, EM, ETC, FW, IC, IJ, JJ, LB, LC, M, MSP, MSP-2, NN, NN-SHORT, NOI, NR, NR-SHORT, NT, NT-SHORT, OD, ON, P, PN, PU, SB, SP, URL, VA, VC, VE, and VV

Live Demo Open in Colab Download

How to use

Use as part of an nlp pipeline after tokenization.

document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence_detector = SentenceDetector()\
    .setInputCols(["document"])\
    .setOutputCol("sentence")

word_segmenter = WordSegmenterModel.pretrained("wordseg_large", "zh")\
        .setInputCols(["sentence"])\
        .setOutputCol("token")

pos = PerceptronModel.pretrained("pos_ctb9", zh") \
    .setInputCols(["document", "token"]) \
    .setOutputCol("pos")

pipeline = Pipeline(stages=[document_assembler, sentence_detector, word_segmenter, pos])

example = spark.createDataFrame(pd.DataFrame({'text': ["""然而,这样的处理也衍生了一些问题。"""]}))

result = pipeline.fit(example).transform(example)

val document_assembler = DocumentAssembler()
        .setInputCol("text")
        .setOutputCol("document")

val sentence_detector = SentenceDetector()
        .setInputCols(["document"])
        .setOutputCol("sentence")

val word_segmenter = WordSegmenterModel.pretrained("wordseg_large", "zh")
        .setInputCols(["sentence"])
        .setOutputCol("token")
        
val pos = PerceptronModel.pretrained("pos_ctb9", "zh")
    .setInputCols(Array("document", "token"))
    .setOutputCol("pos")

val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, word_segmenter, pos))

val result = pipeline.fit(Seq.empty["然而这样的处理也衍生了一些问题。"].toDS.toDF("text")).transform(data)
import nlu

text = ["""然而,这样的处理也衍生了一些问题。"""]
pos_df = nlu.load('zh.pos.ctb9').predict(text, output_level='token')
pos_df

Results

+-----+---+
|token|pos_tag|
+-----+---+
|然而 |AD |
|,    |PU |
|这样 |PN |
|的   |DEG|
|处理 |NN |
|也   |AD |
|衍生 |VV |
|了   |AS |
|一些 |CD |
|问题 |NN |
|。   |PU |
+-----+---+

Model Information

Model Name: pos_ctb9
Compatibility: Spark NLP 2.7.0+
Edition: Official
Input Labels: [sentence, token]
Output Labels: [pos]
Language: zh

Data Source

The model was trained on the Microsoft Research Asia (MSRA) data set available on the Second International Chinese Word Segmentation Bakeoff SIGHAN 2005

Benchmarking

| Tag          | precision | recall | f1-score | support |
|--------------|-----------|--------|----------|---------|
| AD           | 0.92      | 0.94   | 0.93     | 23017   |
| AS           | 0.92      | 0.93   | 0.92     | 2415    |
| BA           | 0.95      | 0.94   | 0.95     | 376     |
| CC           | 0.85      | 0.82   | 0.83     | 2388    |
| CD           | 0.95      | 0.95   | 0.95     | 6758    |
| CS           | 0.93      | 0.92   | 0.92     | 525     |
| DEC          | 0.75      | 0.78   | 0.76     | 4422    |
| DEG          | 0.86      | 0.85   | 0.85     | 6094    |
| DER          | 0.87      | 0.79   | 0.83     | 251     |
| DEV          | 0.86      | 0.73   | 0.79     | 291     |
| DT           | 0.94      | 0.91   | 0.92     | 4409    |
| EM           | 0.74      | 0.72   | 0.73     | 32      |
| ETC          | 0.97      | 0.98   | 0.97     | 294     |
| FW           | 0.00      | 0.00   | 0.00     | 2       |
| IC           | 0.11      | 0.01   | 0.02     | 117     |
| IJ           | 0.95      | 0.95   | 0.95     | 2153    |
| JJ           | 0.80      | 0.77   | 0.78     | 4693    |
| LB           | 0.78      | 0.70   | 0.73     | 105     |
| LC           | 0.95      | 0.97   | 0.96     | 2660    |
| M            | 0.97      | 0.97   | 0.97     | 6467    |
| MSP          | 0.83      | 0.82   | 0.83     | 428     |
| MSP-2        | 0.00      | 0.00   | 0.00     | 1       |
| NN           | 0.93      | 0.94   | 0.93     | 49159   |
| NN-SHORT     | 0.00      | 0.00   | 0.00     | 1       |
| NOI          | 0.17      | 0.06   | 0.09     | 48      |
| NR           | 0.95      | 0.90   | 0.93     | 13220   |
| NR-SHORT     | 0.00      | 0.00   | 0.00     | 8       |
| NT           | 0.95      | 0.95   | 0.95     | 3723    |
| NT-SHORT     | 1.00      | 0.20   | 0.33     | 5       |
| OD           | 0.91      | 0.86   | 0.88     | 587     |
| ON           | 0.33      | 0.23   | 0.27     | 13      |
| P            | 0.90      | 0.92   | 0.91     | 7442    |
| PN           | 0.95      | 0.95   | 0.95     | 11011   |
| PU           | 0.99      | 0.99   | 0.99     | 36717   |
| SB           | 0.89      | 0.90   | 0.90     | 197     |
| SP           | 0.91      | 0.90   | 0.91     | 5466    |
| URL          | 0.75      | 0.75   | 0.75     | 8       |
| VA           | 0.84      | 0.83   | 0.83     | 4866    |
| VC           | 0.98      | 0.98   | 0.98     | 4434    |
| VE           | 0.95      | 0.95   | 0.95     | 2274    |
| VV           | 0.91      | 0.91   | 0.91     | 35240   |
|              |           |        |          |         |
| accuracy     | 0.93      | 242317 |          |         |
| macro avg    | 0.76      | 0.72   | 0.73     | 242317  |
| weighted avg | 0.93      | 0.93   | 0.93     | 242317  |