Chinese NER Model

Description

This model was imported from Hugging Face and it’s been fine-tuned for traditional Chinese language, leveraging Bert embeddings and BertForTokenClassification for NER purposes.

Predicted Entities

CARDINAL, DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, WORK_OF_ART

Download Copy S3 URI

How to use

documentAssembler = DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
       .setInputCols(["document"])\
       .setOutputCol("sentence")

tokenizer = Tokenizer()\
      .setInputCols(["sentence"])\
      .setOutputCol("token")

tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_chinese_ner", "zh"))\
  .setInputCols(["sentence",'token'])\
  .setOutputCol("ner")

ner_converter = NerConverter()\
      .setInputCols(["sentence", "token", "ner"])\
      .setOutputCol("ner_chunk")
      
nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)
text = """我是莎拉,我从 1999 年 11 月 2 日。开始在斯图加特的梅赛德斯-奔驰公司工作。"""
result = model.transform(spark.createDataFrame([[text]]).toDF("text"))
val documentAssembler = DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")

val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
       .setInputCols(Array("document"))
       .setOutputCol("sentence")

val tokenizer = Tokenizer()
      .setInputCols(Array("sentence"))
      .setOutputCol("token")

val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_chinese_ner", "zh"))
  .setInputCols(Array("sentence","token"))
  .setOutputCol("ner")

ner_converter = NerConverter()
      .setInputCols(Array("sentence", "token", "ner"))
      .setOutputCol("ner_chunk")
      
val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter))

val example = Seq.empty["我是莎拉我从 1999  11  2 开始在斯图加特的梅赛德斯-奔驰公司工作。"].toDS.toDF("text")

val result = pipeline.fit(example).transform(example)
import nlu
nlu.load("zh.ner.bert_token").predict("""我是莎拉,我从 1999 年 11 月 2 日。开始在斯图加特的梅赛德斯-奔驰公司工作。""")

Results

+-----------------+---------+
|chunk            |ner_label|
+-----------------+---------+
|莎拉             |PERSON   |
|1999 年 11 月 2  |DATE     |
|斯图加特          |GPE      |
|梅赛德斯-奔驰公司  |ORG      |
+-----------------+---------+

Model Information

Model Name: bert_token_classifier_chinese_ner
Compatibility: Spark NLP 3.3.2+
License: Open Source
Edition: Official
Input Labels: [sentence, token]
Output Labels: [ner]
Language: zh
Case sensitive: true
Max sentense length: 256

Data Source

https://huggingface.co/ckiplab/bert-base-chinese-ner

Benchmarking

label   score
   f1   0.8118