Named Entity Recognition for Japanese (XLM-RoBERTa)

Description

This model annotates named entities in a text, that can be used to find features such as names of people, places, and organizations. The model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together.

This model uses the pretrained XlmRoBertaEmbeddings embeddings “xlm_roberta_base” as an input, so be sure to use the same embeddings in the pipeline.

Predicted Entities

ORDINAL, PERSON, LAW, MOVEMENT, LOC, WORK_OF_ART, DATE, NORP, TITLE_AFFIX, QUANTITY, FAC, TIME, MONEY, LANGUAGE, GPE, EVENT, ORG, PERCENT, PRODUCT

Download

How to use

import sparknlp
from pyspark.ml import Pipeline
from sparknlp.annotator import *
from sparknlp.base import *
from sparknlp.training import *

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud", "ja") \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

embeddings = XlmRoBertaEmbeddings.pretrained() \
    .setInputCols("sentence", "token") \
    .setOutputCol("embeddings")

nerTagger = NerDLModel.pretrained("ner_ud_gsd_xlm_roberta_base", "ja") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

pipeline = Pipeline().setStages(
    [
        documentAssembler,
        sentence,
        word_segmenter,
        embeddings,
        nerTagger,
    ]
)

data = spark.createDataFrame([["宮本茂氏は、日本の任天堂のゲームプロデューサーです。"]]).toDF("text")
model = pipeline.fit(data)
result = model.transform(data)
result.selectExpr("explode(arrays_zip(token.result, ner.result))").show()
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.{SentenceDetector, WordSegmenterModel}
import com.johnsnowlabs.nlp.embeddings.XlmRoBertaEmbeddings
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud", "ja")
  .setInputCols("sentence")
  .setOutputCol("token")

val embeddings = XlmRoBertaEmbeddings.pretrained("japanese_cc_300d", "ja")
  .setInputCols("sentence", "token")
  .setOutputCol("embeddings")

val nerTagger = NerDLModel.pretrained("ner_ud_gsd_xlm_roberta_base", "ja")
  .setInputCols("sentence", "token")
  .setOutputCol("ner")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentence,
  word_segmenter,
  embeddings,
  nerTagger
))

val data = Seq("宮本茂氏は、日本の任天堂のゲームプロデューサーです。").toDF("text")
val model = pipeline.fit(data)
val result = model.transform(data)

result.selectExpr("explode(arrays_zip(token.result, ner.result))").show()

Results

+-------------------+                                                           
|                col|
+-------------------+
|   {宮本, B-PERSON}|
|     {茂, I-PERSON}|
|            {氏, O}|
|            {は, O}|
|            {、, O}|
|      {日本, B-GPE}|
|            {の, O}|
|      {任天, B-ORG}|
|        {堂, I-ORG}|
|            {の, O}|
|        {ゲーム, O}|
|{プロデューサー, O}|
|          {です, O}|
|            {。, O}|
+-------------------+

Model Information

Model Name: ner_ud_gsd_xlm_roberta_base
Type: ner
Compatibility: Spark NLP 3.2.2+
License: Open Source
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: ja
Dependencies: xlm_roberta_base

Data Source

The model was trained on the Universal Dependencies, curated by Google. A NER version was created by megagonlabs:

https://github.com/megagonlabs/UD_Japanese-GSD

Reference:

Asahara, M., Kanayama, H., Tanaka, T., Miyao, Y., Uematsu, S., Mori, S., Matsumoto, Y., Omura, M., & Murawaki, Y. (2018). Universal Dependencies Version 2 for Japanese. In LREC-2018.

Benchmarking

              precision    recall  f1-score   support

        DATE       0.93      0.97      0.95       206
       EVENT       0.78      0.48      0.60        52
         FAC       0.80      0.68      0.73        59
         GPE       0.88      0.81      0.85       102
    LANGUAGE       1.00      1.00      1.00         8
         LAW       0.82      0.69      0.75        13
         LOC       0.87      0.83      0.85        41
       MONEY       1.00      1.00      1.00        20
    MOVEMENT       0.67      0.55      0.60        11
        NORP       0.84      0.86      0.85        57
           O       0.99      0.99      0.99     11785
     ORDINAL       0.94      0.94      0.94        32
         ORG       0.71      0.78      0.74       179
     PERCENT       1.00      1.00      1.00        16
      PERSON       0.89      0.90      0.89       127
     PRODUCT       0.56      0.68      0.61        50
    QUANTITY       0.92      0.96      0.94       172
        TIME       0.91      1.00      0.96        32
 TITLE_AFFIX       0.86      0.75      0.80        24
 WORK_OF_ART       0.87      0.85      0.86        48

    accuracy                           0.98     13034
   macro avg       0.86      0.84      0.85     13034
weighted avg       0.98      0.98      0.98     13034