BERT Token Classification - Few-NERD (bert_base_token_classifier_few_nerd)

Description

BERT Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

This model is fine-tuned on the Few-NERD dataset. Few-NERD is a large-scale, fine-grained manually annotated named entity recognition dataset, which contains 8 coarse-grained types, 66 fine-grained types, 188,200 sentences, 491,711 entities, and 4,601,223 tokens. Three benchmark tasks are built, one is supervised (Few-NERD (SUP)) and the other two are few-shot (Few-NERD (INTRA) and Few-NERD (INTER)). Few-NERD is collected by researchers from Tsinghua University and DAMO Academy, Alibaba Group.

Predicted Entities

art-broadcastprogram art-film art-music art-other art-painting art-writtenart building-airport building-hospital building-hotel building-library building-other building-restaurant building-sportsfacility building-theater event-attack/battle/war/militaryconflict event-disaster event-election event-other event-protest event-sportsevent location-GPE location-bodiesofwater location-island location-mountain location-other location-park location-road/railway/highway/transit organization-company organization-education organization-government/governmentagency organization-media/newspaper organization-other organization-politicalparty organization-religion organization-showorganization organization-sportsleague organization-sportsteam other-astronomything other-award other-biologything other-chemicalthing other-currency other-disease other-educationaldegree other-god other-language other-law other-livingthing other-medical person-actor person-artist/author person-athlete person-director person-other person-politician person-scholar person-soldier product-airplane product-car product-food product-game product-other product-ship product-software product-train product-weapon

Download

How to use

document_assembler = DocumentAssembler() \
    .setInputCol('text') \
    .setOutputCol('document')

tokenizer = Tokenizer() \
    .setInputCols(['document']) \
    .setOutputCol('token')

tokenClassifier = BertForTokenClassification \
      .pretrained('bert_base_token_classifier_few_nerd', 'en') \
      .setInputCols(['token', 'document']) \
      .setOutputCol('ner') \
      .setCaseSensitive(True) \
      .setMaxSentenceLength(512)

pipeline = Pipeline(stages=[
    document_assembler, 
    tokenizer,
    tokenClassifier
])

example = spark.createDataFrame([['My name is John!']]).toDF("text")
result = pipeline.fit(example).transform(example)
val document_assembler = DocumentAssembler() 
    .setInputCol("text") 
    .setOutputCol("document")

val tokenizer = Tokenizer() 
    .setInputCols("document") 
    .setOutputCol("token")

val tokenClassifier = BertForTokenClassification.pretrained("bert_base_token_classifier_few_nerd", "en")
      .setInputCols("document", "token")
      .setOutputCol("ner")
      .setCaseSensitive(true)
      .setMaxSentenceLength(512)

val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, tokenClassifier))

val example = Seq.empty["My name is John!"].toDS.toDF("text")

val result = pipeline.fit(example).transform(example)

Model Information

Model Name: bert_base_token_classifier_few_nerd
Compatibility: Spark NLP 3.2.0+
License: Open Source
Edition: Official
Input Labels: [token, document]
Output Labels: [ner]
Language: en
Case sensitive: true
Max sentense length: 512

Data Source

https://github.com/thunlp/Few-NERD

Benchmarking

Test:

                        precision    recall  f1-score   support

                                       O       0.98      0.98      0.98    365750
                    art-broadcastprogram       0.66      0.66      0.66       890
                                art-film       0.78      0.78      0.78      1039
                               art-music       0.85      0.81      0.83      1773
                               art-other       0.40      0.40      0.40       729
                            art-painting       0.51      0.43      0.47        91
                          art-writtenart       0.69      0.70      0.70      1570
                        building-airport       0.83      0.88      0.85       391
                       building-hospital       0.80      0.89      0.84       577
                          building-hotel       0.87      0.80      0.83       526
                        building-library       0.81      0.86      0.83       715
                          building-other       0.64      0.67      0.65      3448
                     building-restaurant       0.72      0.57      0.64       283
                 building-sportsfacility       0.65      0.82      0.72       495
                        building-theater       0.78      0.90      0.83       529
event-attack/battle/war/militaryconflict       0.82      0.87      0.85      1583
                          event-disaster       0.67      0.73      0.70       317
                          event-election       0.56      0.46      0.51       282
                             event-other       0.65      0.57      0.60      1634
                           event-protest       0.41      0.48      0.44       227
                       event-sportsevent       0.74      0.80      0.77      1975
                            location-GPE       0.82      0.86      0.84     13112
                  location-bodiesofwater       0.83      0.82      0.83      1210
                         location-island       0.81      0.81      0.81       666
                       location-mountain       0.82      0.78      0.80       734
                          location-other       0.45      0.36      0.40      2207
                           location-park       0.71      0.81      0.76       634
   location-road/railway/highway/transit       0.76      0.79      0.77      1861
                    organization-company       0.75      0.77      0.76      3982
                  organization-education       0.87      0.88      0.88      3432
organization-government/governmentagency       0.65      0.60      0.62      2178
            organization-media/newspaper       0.63      0.67      0.65      1291
                      organization-other       0.63      0.64      0.64      5989
             organization-politicalparty       0.75      0.81      0.78      1199
                   organization-religion       0.65      0.74      0.69       830
           organization-showorganization       0.74      0.78      0.76       933
               organization-sportsleague       0.75      0.60      0.67      1088
                 organization-sportsteam       0.79      0.84      0.81      2374
                    other-astronomything       0.80      0.82      0.81       625
                             other-award       0.80      0.73      0.77      1873
                      other-biologything       0.69      0.70      0.69      1282
                     other-chemicalthing       0.70      0.56      0.62       881
                          other-currency       0.75      0.85      0.80       608
                           other-disease       0.71      0.73      0.72       825
                 other-educationaldegree       0.73      0.80      0.76       599
                               other-god       0.70      0.67      0.69       316
                          other-language       0.75      0.83      0.78       539
                               other-law       0.82      0.82      0.82       966
                       other-livingthing       0.64      0.71      0.67       696
                           other-medical       0.53      0.45      0.49       293
                            person-actor       0.85      0.82      0.83      1510
                    person-artist/author       0.74      0.77      0.76      3083
                          person-athlete       0.84      0.86      0.85      2519
                         person-director       0.73      0.73      0.73       535
                            person-other       0.71      0.68      0.70      7601
                       person-politician       0.72      0.72      0.72      2588
                          person-scholar       0.54      0.59      0.56       657
                          person-soldier       0.63      0.67      0.65       573
                        product-airplane       0.79      0.69      0.73       781
                             product-car       0.84      0.79      0.81       779
                            product-food       0.53      0.56      0.54       345
                            product-game       0.81      0.81      0.81       534
                           product-other       0.60      0.45      0.51      1751
                            product-ship       0.65      0.71      0.68       333
                        product-software       0.62      0.66      0.64       693
                           product-train       0.50      0.72      0.59       274
                          product-weapon       0.74      0.70      0.72       611

                                accuracy                           0.93    463214
                               macro avg       0.71      0.72      0.71    463214
                            weighted avg       0.93      0.93      0.93    463214



processed 463214 tokens with 48764 phrases; found: 51017 phrases; correct: 34149.
accuracy:  73.78%; (non-O)
accuracy:  92.88%; precision:  66.94%; recall:  70.03%; FB1:  68.45
              GPE: precision:  79.57%; recall:  84.68%; FB1:  82.05  11001
            actor: precision:  81.64%; recall:  78.81%; FB1:  80.20  779
         airplane: precision:  65.69%; recall:  52.48%; FB1:  58.35  306
          airport: precision:  74.17%; recall:  78.87%; FB1:  76.45  151
    artist/author: precision:  69.20%; recall:  74.45%; FB1:  71.73  1857
   astronomything: precision:  70.49%; recall:  73.30%; FB1:  71.87  366
          athlete: precision:  80.10%; recall:  83.94%; FB1:  81.98  1553
attack/battle/war/militaryconflict: precision:  72.32%; recall:  81.75%; FB1:  76.75  607
            award: precision:  58.38%; recall:  59.30%; FB1:  58.83  519
     biologything: precision:  61.19%; recall:  63.36%; FB1:  62.25  907
    bodiesofwater: precision:  76.54%; recall:  77.16%; FB1:  76.85  618
 broadcastprogram: precision:  57.97%; recall:  60.98%; FB1:  59.44  345
              car: precision:  68.66%; recall:  67.74%; FB1:  68.20  367
    chemicalthing: precision:  57.74%; recall:  50.92%; FB1:  54.12  478
          company: precision:  66.25%; recall:  68.84%; FB1:  67.52  1991
         currency: precision:  66.60%; recall:  76.23%; FB1:  71.09  467
         director: precision:  68.20%; recall:  68.93%; FB1:  68.56  283
         disaster: precision:  46.54%; recall:  57.36%; FB1:  51.39  159
          disease: precision:  58.45%; recall:  65.62%; FB1:  61.83  503
        education: precision:  77.32%; recall:  80.18%; FB1:  78.73  1151
educationaldegree: precision:  55.30%; recall:  62.50%; FB1:  58.68  217
         election: precision:  26.83%; recall:  26.51%; FB1:  26.67  82
             film: precision:  73.77%; recall:  74.32%; FB1:  74.05  408
             food: precision:  43.35%; recall:  44.67%; FB1:  44.00  203
             game: precision:  67.61%; recall:  73.47%; FB1:  70.42  213
              god: precision:  67.04%; recall:  70.98%; FB1:  68.95  270
government/governmentagency: precision:  47.22%; recall:  45.37%; FB1:  46.28  737
         hospital: precision:  67.01%; recall:  77.38%; FB1:  71.82  194
            hotel: precision:  68.93%; recall:  66.67%; FB1:  67.78  177
           island: precision:  72.58%; recall:  72.58%; FB1:  72.58  361
         language: precision:  68.77%; recall:  80.56%; FB1:  74.20  506
              law: precision:  56.51%; recall:  62.50%; FB1:  59.35  292
          library: precision:  66.37%; recall:  73.89%; FB1:  69.93  226
      livingthing: precision:  59.27%; recall:  63.12%; FB1:  61.13  491
  media/newspaper: precision:  52.84%; recall:  62.66%; FB1:  57.34  721
          medical: precision:  52.25%; recall:  52.97%; FB1:  52.61  222
         mountain: precision:  73.95%; recall:  71.93%; FB1:  72.93  357
            music: precision:  76.52%; recall:  74.13%; FB1:  75.31  558
            other: precision:  59.20%; recall:  59.14%; FB1:  59.17  10514
         painting: precision:  37.04%; recall:  40.00%; FB1:  38.46  27
             park: precision:  61.15%; recall:  73.61%; FB1:  66.81  260
   politicalparty: precision:  61.72%; recall:  74.45%; FB1:  67.49  661
       politician: precision:  66.98%; recall:  68.20%; FB1:  67.58  1508
          protest: precision:  28.00%; recall:  39.77%; FB1:  32.86  125
         religion: precision:  51.49%; recall:  59.02%; FB1:  55.00  470
       restaurant: precision:  60.19%; recall:  51.18%; FB1:  55.32  108
road/railway/highway/transit: precision:  64.51%; recall:  69.78%; FB1:  67.04  834
          scholar: precision:  50.13%; recall:  54.50%; FB1:  52.22  399
             ship: precision:  49.20%; recall:  50.83%; FB1:  50.00  187
 showorganization: precision:  63.75%; recall:  71.70%; FB1:  67.49  469
         software: precision:  56.28%; recall:  62.53%; FB1:  59.24  430
          soldier: precision:  55.59%; recall:  61.63%; FB1:  58.45  367
      sportsevent: precision:  55.30%; recall:  63.48%; FB1:  59.11  792
   sportsfacility: precision:  59.68%; recall:  75.50%; FB1:  66.67  253
     sportsleague: precision:  65.14%; recall:  58.91%; FB1:  61.87  416
       sportsteam: precision:  70.38%; recall:  79.51%; FB1:  74.66  1384
          theater: precision:  67.23%; recall:  80.20%; FB1:  73.15  235
            train: precision:  39.51%; recall:  54.70%; FB1:  45.88  162
           weapon: precision:  55.96%; recall:  50.99%; FB1:  53.36  277
       writtenart: precision:  56.65%; recall:  60.95%; FB1:  58.73  496