Detect Entities in 8 languages - WIKINER (ner_wikiner_xlm_roberta_base)

Description

This NER model was trained over WIKINER datasets with 8 languages including English, French, German, Italian, Polish, Portuguese, Russian, and Spanish. We used XlmRoBertaEmbeddings (xlm_roberta_base) model for the embeddings to train this NER model.

Predicted Entities

B-LOC I-LOC B-ORG I-ORG B-PER I-PER

Download

How to use

document_assembler = DocumentAssembler() \
    .setInputCol('text') \
    .setOutputCol('document')

tokenizer = Tokenizer() \
    .setInputCols(['document']) \
    .setOutputCol('token')

embeddings = XlmRoBertaEmbeddings\
      .pretrained('xlm_roberta_base', 'xx')\
      .setInputCols(["token", "document"])\
      .setOutputCol("embeddings")

ner_model = NerDLModel.pretrained('ner_wikiner_xlm_roberta_base', 'xx') \
    .setInputCols(['document', 'token', 'embeddings']) \
    .setOutputCol('ner')

ner_converter = NerConverter() \
    .setInputCols(['document', 'token', 'ner']) \
    .setOutputCol('entities')

pipeline = Pipeline(stages=[
    document_assembler, 
    tokenizer,
    embeddings,
    ner_model,
    ner_converter
])

example = spark.createDataFrame(pd.DataFrame({'text': ['My name is John!']}))
result = pipeline.fit(example).transform(example)
val document_assembler = DocumentAssembler() 
    .setInputCol("text") 
    .setOutputCol("document")

val tokenizer = Tokenizer() 
    .setInputCols("document") 
    .setOutputCol("token")

val embeddings = XlmRoBertaEmbeddings.pretrained("xlm_roberta_base", "xx")
    .setInputCols("document", "token") 
    .setOutputCol("embeddings")

val ner_model = NerDLModel.pretrained("ner_wikiner_xlm_roberta_base", "xx") 
    .setInputCols("document"', "token", "embeddings") 
    .setOutputCol("ner")

val ner_converter = NerConverter() 
    .setInputCols("document", "token", "ner") 
    .setOutputCol("entities")

val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, embeddings, ner_model, ner_converter))
val result = pipeline.fit(Seq.empty["My name is John!"].toDS.toDF("text")).transform(data)
import nlu

text = ["My name is John!"]

ner_df = nlu.load('xx.ner.ner_wikiner_xlm_roberta_base').predict(text, output_level='token')

Model Information

Model Name: ner_wikiner_xlm_roberta_base
Type: ner
Compatibility: Spark NLP 3.1.3+
License: Open Source
Edition: Official
Input Labels: [sentence, token, embeddings]
Output Labels: [ner]
Language: xx

Data Source

https://figshare.com/articles/dataset/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500

Benchmarking

Average of all languages benchmark (multi-label classification and CoNLL Eval):

processed 1267026 tokens with 134558 phrases; found: 132447 phrases; correct: 114590.
accuracy:  85.26%; (non-O)
accuracy:  97.23%; precision:  86.52%; recall:  85.16%; FB1:  85.83
              LOC: precision:  87.26%; recall:  88.62%; FB1:  87.94  58155
             MISC: precision:  80.06%; recall:  70.23%; FB1:  74.82  20432
              ORG: precision:  80.02%; recall:  75.09%; FB1:  77.48  14003
              PER: precision:  91.03%; recall:  92.83%; FB1:  91.92  39857



Language by language benchmarks (multi-label classification and CoNLL Eval):


lang:  english
              precision    recall  f1-score   support

       B-LOC       0.84      0.91      0.87      8600
       I-ORG       0.82      0.81      0.82      4249
       I-LOC       0.84      0.82      0.83      3960
       I-PER       0.95      0.94      0.95      4472
       B-ORG       0.83      0.77      0.80      4882
       B-PER       0.93      0.94      0.94      9639

   micro avg       0.88      0.88      0.88     35802
   macro avg       0.87      0.87      0.87     35802
weighted avg       0.88      0.88      0.88     35802

processed 349485 tokens with 30471 phrases; found: 30143 phrases; correct: 25648.
accuracy:  85.02%; (non-O)
accuracy:  97.68%; precision:  85.09%; recall:  84.17%; FB1:  84.63
              LOC: precision:  82.69%; recall:  88.64%; FB1:  85.56  9219
             MISC: precision:  80.26%; recall:  71.82%; FB1:  75.81  6577
              ORG: precision:  81.13%; recall:  75.91%; FB1:  78.43  4568
              PER: precision:  92.44%; recall:  93.79%; FB1:  93.11  9779




###############################
lang:  french
              precision    recall  f1-score   support

       B-LOC       0.84      0.86      0.85     11482
       I-ORG       0.80      0.77      0.78      2143
       I-LOC       0.81      0.60      0.69      4495
       I-PER       0.97      0.94      0.95      5339
       B-ORG       0.84      0.81      0.82      2556
       B-PER       0.93      0.93      0.93      7524

   micro avg       0.87      0.85      0.86     33539
   macro avg       0.86      0.82      0.84     33539
weighted avg       0.87      0.85      0.86     33539

processed 348522 tokens with 25499 phrases; found: 25270 phrases; correct: 21525.
accuracy:  82.17%; (non-O)
accuracy:  97.62%; precision:  85.18%; recall:  84.42%; FB1:  84.80
              LOC: precision:  82.63%; recall:  85.18%; FB1:  83.88  11836
             MISC: precision:  80.10%; recall:  69.62%; FB1:  74.49  3422
              ORG: precision:  82.91%; recall:  79.34%; FB1:  81.09  2446
              PER: precision:  92.20%; recall:  92.72%; FB1:  92.46  7566




###############################
lang:  german
              precision    recall  f1-score   support

       B-LOC       0.88      0.90      0.89     20709
       I-ORG       0.82      0.87      0.85      5933
       I-LOC       0.81      0.82      0.82      6405
       I-PER       0.96      0.97      0.97      8365
       B-ORG       0.82      0.80      0.81      6759
       B-PER       0.93      0.94      0.94     10647

   micro avg       0.88      0.89      0.89     58818
   macro avg       0.87      0.88      0.88     58818
weighted avg       0.88      0.89      0.89     58818

processed 349393 tokens with 46006 phrases; found: 45517 phrases; correct: 39247.
accuracy:  86.81%; (non-O)
accuracy:  96.44%; precision:  86.22%; recall:  85.31%; FB1:  85.76
              LOC: precision:  86.58%; recall:  88.33%; FB1:  87.45  21128
             MISC: precision:  81.40%; recall:  72.67%; FB1:  76.79  7044
              ORG: precision:  80.11%; recall:  77.76%; FB1:  78.92  6561
              PER: precision:  92.40%; recall:  93.59%; FB1:  92.99  10784




###############################
lang:  italian
              precision    recall  f1-score   support

       B-LOC       0.91      0.92      0.91     13050
       I-ORG       0.75      0.81      0.78      1211
       I-LOC       0.92      0.86      0.89      7454
       I-PER       0.96      0.95      0.95      4539
       B-ORG       0.84      0.84      0.84      2222
       B-PER       0.93      0.94      0.93      7206

   micro avg       0.91      0.90      0.91     35682
   macro avg       0.88      0.88      0.88     35682
weighted avg       0.91      0.90      0.91     35682

processed 349242 tokens with 26227 phrases; found: 25982 phrases; correct: 23079.
accuracy:  88.06%; (non-O)
accuracy:  98.35%; precision:  88.83%; recall:  88.00%; FB1:  88.41
              LOC: precision:  89.87%; recall:  90.22%; FB1:  90.05  13101
             MISC: precision:  81.60%; recall:  74.05%; FB1:  77.64  3402
              ORG: precision:  83.26%; recall:  82.36%; FB1:  82.81  2198
              PER: precision:  92.01%; recall:  92.96%; FB1:  92.48  7281




###############################
lang:  polish
              precision    recall  f1-score   support

       B-LOC       0.91      0.93      0.92     17757
       I-ORG       0.79      0.85      0.82      2105
       I-LOC       0.89      0.85      0.87      5242
       I-PER       0.97      0.95      0.96      6672
       B-ORG       0.86      0.84      0.85      3700
       B-PER       0.93      0.94      0.94      9670

   micro avg       0.91      0.91      0.91     45146
   macro avg       0.89      0.89      0.89     45146
weighted avg       0.91      0.91      0.91     45146

processed 350132 tokens with 36235 phrases; found: 35886 phrases; correct: 32107.
accuracy:  88.31%; (non-O)
accuracy:  97.98%; precision:  89.47%; recall:  88.61%; FB1:  89.04
              LOC: precision:  90.17%; recall:  92.52%; FB1:  91.33  18221
             MISC: precision:  82.48%; recall:  70.71%; FB1:  76.15  4379
              ORG: precision:  85.37%; recall:  82.51%; FB1:  83.92  3576
              PER: precision:  92.82%; recall:  93.21%; FB1:  93.01  9710




###############################
lang:  portuguese
              precision    recall  f1-score   support

       B-LOC       0.93      0.93      0.93     14818
       I-ORG       0.80      0.85      0.83      1705
       I-LOC       0.92      0.88      0.90      8354
       I-PER       0.96      0.94      0.95      4338
       B-ORG       0.84      0.86      0.85      2351
       B-PER       0.94      0.94      0.94      6398

   micro avg       0.92      0.91      0.92     37964
   macro avg       0.90      0.90      0.90     37964
weighted avg       0.92      0.91      0.92     37964

processed 348966 tokens with 26513 phrases; found: 26349 phrases; correct: 23958.
accuracy:  90.10%; (non-O)
accuracy:  98.60%; precision:  90.93%; recall:  90.36%; FB1:  90.64
              LOC: precision:  92.13%; recall:  91.97%; FB1:  92.05  14792
             MISC: precision:  84.09%; recall:  79.46%; FB1:  81.71  2784
              ORG: precision:  83.51%; recall:  85.28%; FB1:  84.39  2401
              PER: precision:  93.91%; recall:  93.53%; FB1:  93.72  6372




###############################
lang:  russian
              precision    recall  f1-score   support

       B-LOC       0.93      0.95      0.94     14707
       I-ORG       0.84      0.73      0.78      2594
       I-LOC       0.86      0.87      0.87      5047
       I-PER       0.98      0.96      0.97      6366
       B-ORG       0.86      0.84      0.85      3697
       B-PER       0.94      0.95      0.94      7119

   micro avg       0.92      0.92      0.92     39530
   macro avg       0.90      0.88      0.89     39530
weighted avg       0.92      0.92      0.92     39530




###############################
lang:  spanish
              precision    recall  f1-score   support

       B-LOC       0.89      0.90      0.89     11963
       I-ORG       0.83      0.79      0.81      1950
       I-LOC       0.89      0.80      0.84      6162
       I-PER       0.97      0.94      0.95      4678
       B-ORG       0.84      0.80      0.82      2084
       B-PER       0.94      0.94      0.94      7215

   micro avg       0.90      0.88      0.89     34052
   macro avg       0.89      0.86      0.88     34052
weighted avg       0.90      0.88      0.89     34052

processed 348209 tokens with 24505 phrases; found: 24360 phrases; correct: 21446.
accuracy:  86.63%; (non-O)
accuracy:  98.24%; precision:  88.04%; recall:  87.52%; FB1:  87.78
              LOC: precision:  87.90%; recall:  88.74%; FB1:  88.32  12078
             MISC: precision:  79.02%; recall:  74.68%; FB1:  76.79  3065
              ORG: precision:  82.57%; recall:  78.89%; FB1:  80.69  1991
              PER: precision:  93.61%; recall:  93.75%; FB1:  93.68  7226