Named Entity Recognition for Japanese (GloVe 840B 300d)

Description

This model annotates named entities in a text, that can be used to find features such as names of people, places, and organizations. The model does not read words directly but instead reads word embeddings, which represent words as points such that more semantically similar words are closer together.

This model uses the pre-trained glove_840B_300 embeddings model from WordEmbeddings annotator as an input, so be sure to use the same embeddings in the pipeline.

Predicted Entities

DATE, EVENT, FAC, GPE, LANGUAGE, LAW, LOC, MONEY, MOVEMENT, NORP, ORDINAL, ORG, PERCENT, PERSON, PRODUCT, QUANTITY, TIME, TITLE_AFFIX, and WORK_OF_ART.

Live Demo Open in Colab Download

How to use

...
word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud", "ja")\
        .setInputCols(["sentence"])\
        .setOutputCol("token")
embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", "xx")\
          .setInputCols("document", "token") \
          .setOutputCol("embeddings")
ner = NerDLModel.pretrained("ner_ud_gsd_glove_840B_300d", "ja") \
        .setInputCols(["document", "token", "embeddings"]) \
        .setOutputCol("ner")
...
pipeline = Pipeline(stages=[document_assembler, sentence_detector, word_segmenter, embeddings, ner, ner_converter])
example = spark.createDataFrame(pd.DataFrame({'text': ["""5月13日に放送されるフジテレビ系「僕らの音楽」にて、福原美穂とAIという豪華共演が決定した。"""]}))
result = pipeline.fit(example).transform(example)
...
val word_segmenter = WordSegmenterModel.pretrained("wordseg_gsd_ud", "ja")
        .setInputCols(Array("sentence"))
        .setOutputCol("token")
val embeddings = WordEmbeddingsModel.pretrained("glove_840B_300", "xx")
          .setInputCols(Array("document", "token"))
          .setOutputCol("embeddings")
val ner = NerDLModel.pretrained("ner_ud_gsd_glove_840B_300d", "ja")
        .setInputCols(Array("document", "token", "embeddings"))
        .setOutputCol("ner")
...
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, word_segmenter, embeddings, ner, ner_converter))
val result = pipeline.fit(Seq.empty["5月13日に放送されるフジテレビ系僕らの音楽にて福原美穂とAIという豪華共演が決定した。"].toDS.toDF("text")).transform(data)

Results

+----------+------+
|token     |ner   |
+----------+------+
|5月       |DATE  |
|13日      |DATE  |
|に        |O     |
|放送      |O     |
|さ        |O     |
|れる      |O     |
|フジテレビ|O     |
|系        |O     |
|「        |O     |
|僕らの音楽|O     |
|」        |O     |
|にて      |O     |
|、        |O     |
|福原美穂  |PERSON|
|と        |O     |
|AI        |O     |
|と        |O     |
|いう      |O     |
|豪華      |O     |
|共演      |O     |
|が        |O     |
|決定      |O     |
|し        |O     |
|た        |O     |
|。        |O     |
+----------+------+

Model Information

Model Name: ner_ud_gsd_glove_840B_300d
Type: ner
Compatibility: Spark NLP 2.7.0+
License: Open Source
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: ja

Data Source

The model was trained on the Universal Dependencies, curated by Google.

Reference:

Asahara, M., Kanayama, H., Tanaka, T., Miyao, Y., Uematsu, S., Mori, S., Matsumoto, Y., Omura, M., & Murawaki, Y. (2018). Universal Dependencies Version 2 for Japanese. In LREC-2018.

Benchmarking

|    ner_tag   | precision | recall | f1-score | support |
|:------------:|:---------:|:------:|:--------:|:-------:|
|     DATE     |    1.00   |  0.86  |   0.92   |    84   |
|     EVENT    |    1.00   |  0.14  |   0.25   |    14   |
|      FAC     |    1.00   |  0.15  |   0.26   |    20   |
|      GPE     |    1.00   |  0.01  |   0.02   |    82   |
|   LANGUAGE   |    0.00   |  0.00  |   0.00   |    6    |
|      LAW     |    0.00   |  0.00  |   0.00   |    3    |
|      LOC     |    0.00   |  0.00  |   0.00   |    25   |
|     MONEY    |    0.86   |  0.86  |   0.86   |    7    |
|   MOVEMENT   |    0.00   |  0.00  |   0.00   |    4    |
|     NORP     |    1.00   |  0.11  |   0.19   |    28   |
|    ORDINAL   |    0.92   |  0.85  |   0.88   |    13   |
|      ORG     |    0.44   |  0.35  |   0.39   |    75   |
|    PERCENT   |    1.00   |  1.00  |   1.00   |    7    |
|    PERSON    |    0.71   |  0.06  |   0.10   |    89   |
|    PRODUCT   |    0.42   |  0.48  |   0.45   |    23   |
|   QUANTITY   |    0.98   |  0.78  |   0.87   |    78   |
|     TIME     |    1.00   |  1.00  |   1.00   |    13   |
|  TITLE_AFFIX |    0.00   |  0.00  |   0.00   |    20   |
|  WORK_OF_ART |    1.00   |  0.22  |   0.36   |    18   |
|   accuracy   |    0.97   |  12419 |          |         |
|   macro avg  |    0.67   |  0.39  |   0.43   |  12419  |
| weighted avg |    0.96   |  0.97  |   0.96   |  12419  |