BERT Token Classification - Few-NERD (bert_base_token_classifier_few_nerd)

Description

BERT Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

This model is fine-tuned on the Few-NERD dataset. Few-NERD is a large-scale, fine-grained manually annotated named entity recognition dataset, which contains 8 coarse-grained types, 66 fine-grained types, 188,200 sentences, 491,711 entities, and 4,601,223 tokens. Three benchmark tasks are built, one is supervised (Few-NERD (SUP)) and the other two are few-shot (Few-NERD (INTRA) and Few-NERD (INTER)). Few-NERD is collected by researchers from Tsinghua University and DAMO Academy, Alibaba Group.

Predicted Entities

  • art-broadcastprogram
  • art-film
  • art-music
  • art-other
  • art-painting
  • art-writtenart
  • building-airport
  • building-hospital
  • building-hotel
  • building-library
  • building-other
  • building-restaurant
  • building-sportsfacility
  • building-theater
  • event-attack/battle/war/militaryconflict
  • event-disaster
  • event-election
  • event-other
  • event-protest
  • event-sportsevent
  • location-GPE
  • location-bodiesofwater
  • location-island
  • location-mountain
  • location-other
  • location-park
  • location-road/railway/highway/transit
  • organization-company
  • organization-education
  • organization-government/governmentagency
  • organization-media/newspaper
  • organization-other
  • organization-politicalparty
  • organization-religion
  • organization-showorganization
  • organization-sportsleague
  • organization-sportsteam
  • other-astronomything
  • other-award
  • other-biologything
  • other-chemicalthing
  • other-currency
  • other-disease
  • other-educationaldegree
  • other-god
  • other-language
  • other-law
  • other-livingthing
  • other-medical
  • person-actor
  • person-artist/author
  • person-athlete
  • person-director
  • person-other
  • person-politician
  • person-scholar
  • person-soldier
  • product-airplane
  • product-car
  • product-food
  • product-game
  • product-other
  • product-ship
  • product-software
  • product-train
  • product-weapon

Download Copy S3 URI

How to use

document_assembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')

tokenizer = Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')

tokenClassifier = BertForTokenClassification \
.pretrained('bert_base_token_classifier_few_nerd', 'en') \
.setInputCols(['token', 'document']) \
.setOutputCol('ner') \
.setCaseSensitive(True) \
.setMaxSentenceLength(512)

pipeline = Pipeline(stages=[
document_assembler, 
tokenizer,
tokenClassifier
])

example = spark.createDataFrame([['My name is John!']]).toDF("text")
result = pipeline.fit(example).transform(example)
val document_assembler = DocumentAssembler() 
.setInputCol("text") 
.setOutputCol("document")

val tokenizer = Tokenizer() 
.setInputCols("document") 
.setOutputCol("token")

val tokenClassifier = BertForTokenClassification.pretrained("bert_base_token_classifier_few_nerd", "en")
.setInputCols("document", "token")
.setOutputCol("ner")
.setCaseSensitive(true)
.setMaxSentenceLength(512)

val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, tokenClassifier))

val example = Seq.empty["My name is John!"].toDS.toDF("text")

val result = pipeline.fit(example).transform(example)
import nlu
nlu.load("en.classify.token_bert.few_nerd").predict("""My name is John!""")

Model Information

Model Name: bert_base_token_classifier_few_nerd
Compatibility: Spark NLP 3.2.0+
License: Open Source
Edition: Official
Input Labels: [token, document]
Output Labels: [ner]
Language: en
Case sensitive: true
Max sentense length: 512

Data Source

https://github.com/thunlp/Few-NERD

Benchmarking

Test:

precision    recall  f1-score   support

O       0.98      0.98      0.98    365750
art-broadcastprogram       0.66      0.66      0.66       890
art-film       0.78      0.78      0.78      1039
art-music       0.85      0.81      0.83      1773
art-other       0.40      0.40      0.40       729
art-painting       0.51      0.43      0.47        91
art-writtenart       0.69      0.70      0.70      1570
building-airport       0.83      0.88      0.85       391
building-hospital       0.80      0.89      0.84       577
building-hotel       0.87      0.80      0.83       526
building-library       0.81      0.86      0.83       715
building-other       0.64      0.67      0.65      3448
building-restaurant       0.72      0.57      0.64       283
building-sportsfacility       0.65      0.82      0.72       495
building-theater       0.78      0.90      0.83       529
event-attack/battle/war/militaryconflict       0.82      0.87      0.85      1583
event-disaster       0.67      0.73      0.70       317
event-election       0.56      0.46      0.51       282
event-other       0.65      0.57      0.60      1634
event-protest       0.41      0.48      0.44       227
event-sportsevent       0.74      0.80      0.77      1975
location-GPE       0.82      0.86      0.84     13112
location-bodiesofwater       0.83      0.82      0.83      1210
location-island       0.81      0.81      0.81       666
location-mountain       0.82      0.78      0.80       734
location-other       0.45      0.36      0.40      2207
location-park       0.71      0.81      0.76       634
location-road/railway/highway/transit       0.76      0.79      0.77      1861
organization-company       0.75      0.77      0.76      3982
organization-education       0.87      0.88      0.88      3432
organization-government/governmentagency       0.65      0.60      0.62      2178
organization-media/newspaper       0.63      0.67      0.65      1291
organization-other       0.63      0.64      0.64      5989
organization-politicalparty       0.75      0.81      0.78      1199
organization-religion       0.65      0.74      0.69       830
organization-showorganization       0.74      0.78      0.76       933
organization-sportsleague       0.75      0.60      0.67      1088
organization-sportsteam       0.79      0.84      0.81      2374
other-astronomything       0.80      0.82      0.81       625
other-award       0.80      0.73      0.77      1873
other-biologything       0.69      0.70      0.69      1282
other-chemicalthing       0.70      0.56      0.62       881
other-currency       0.75      0.85      0.80       608
other-disease       0.71      0.73      0.72       825
other-educationaldegree       0.73      0.80      0.76       599
other-god       0.70      0.67      0.69       316
other-language       0.75      0.83      0.78       539
other-law       0.82      0.82      0.82       966
other-livingthing       0.64      0.71      0.67       696
other-medical       0.53      0.45      0.49       293
person-actor       0.85      0.82      0.83      1510
person-artist/author       0.74      0.77      0.76      3083
person-athlete       0.84      0.86      0.85      2519
person-director       0.73      0.73      0.73       535
person-other       0.71      0.68      0.70      7601
person-politician       0.72      0.72      0.72      2588
person-scholar       0.54      0.59      0.56       657
person-soldier       0.63      0.67      0.65       573
product-airplane       0.79      0.69      0.73       781
product-car       0.84      0.79      0.81       779
product-food       0.53      0.56      0.54       345
product-game       0.81      0.81      0.81       534
product-other       0.60      0.45      0.51      1751
product-ship       0.65      0.71      0.68       333
product-software       0.62      0.66      0.64       693
product-train       0.50      0.72      0.59       274
product-weapon       0.74      0.70      0.72       611

accuracy                           0.93    463214
macro avg       0.71      0.72      0.71    463214
weighted avg       0.93      0.93      0.93    463214



processed 463214 tokens with 48764 phrases; found: 51017 phrases; correct: 34149.
accuracy:  73.78%; (non-O)
accuracy:  92.88%; precision:  66.94%; recall:  70.03%; FB1:  68.45
GPE: precision:  79.57%; recall:  84.68%; FB1:  82.05  11001
actor: precision:  81.64%; recall:  78.81%; FB1:  80.20  779
airplane: precision:  65.69%; recall:  52.48%; FB1:  58.35  306
airport: precision:  74.17%; recall:  78.87%; FB1:  76.45  151
artist/author: precision:  69.20%; recall:  74.45%; FB1:  71.73  1857
astronomything: precision:  70.49%; recall:  73.30%; FB1:  71.87  366
athlete: precision:  80.10%; recall:  83.94%; FB1:  81.98  1553
attack/battle/war/militaryconflict: precision:  72.32%; recall:  81.75%; FB1:  76.75  607
award: precision:  58.38%; recall:  59.30%; FB1:  58.83  519
biologything: precision:  61.19%; recall:  63.36%; FB1:  62.25  907
bodiesofwater: precision:  76.54%; recall:  77.16%; FB1:  76.85  618
broadcastprogram: precision:  57.97%; recall:  60.98%; FB1:  59.44  345
car: precision:  68.66%; recall:  67.74%; FB1:  68.20  367
chemicalthing: precision:  57.74%; recall:  50.92%; FB1:  54.12  478
company: precision:  66.25%; recall:  68.84%; FB1:  67.52  1991
currency: precision:  66.60%; recall:  76.23%; FB1:  71.09  467
director: precision:  68.20%; recall:  68.93%; FB1:  68.56  283
disaster: precision:  46.54%; recall:  57.36%; FB1:  51.39  159
disease: precision:  58.45%; recall:  65.62%; FB1:  61.83  503
education: precision:  77.32%; recall:  80.18%; FB1:  78.73  1151
educationaldegree: precision:  55.30%; recall:  62.50%; FB1:  58.68  217
election: precision:  26.83%; recall:  26.51%; FB1:  26.67  82
film: precision:  73.77%; recall:  74.32%; FB1:  74.05  408
food: precision:  43.35%; recall:  44.67%; FB1:  44.00  203
game: precision:  67.61%; recall:  73.47%; FB1:  70.42  213
god: precision:  67.04%; recall:  70.98%; FB1:  68.95  270
government/governmentagency: precision:  47.22%; recall:  45.37%; FB1:  46.28  737
hospital: precision:  67.01%; recall:  77.38%; FB1:  71.82  194
hotel: precision:  68.93%; recall:  66.67%; FB1:  67.78  177
island: precision:  72.58%; recall:  72.58%; FB1:  72.58  361
language: precision:  68.77%; recall:  80.56%; FB1:  74.20  506
law: precision:  56.51%; recall:  62.50%; FB1:  59.35  292
library: precision:  66.37%; recall:  73.89%; FB1:  69.93  226
livingthing: precision:  59.27%; recall:  63.12%; FB1:  61.13  491
media/newspaper: precision:  52.84%; recall:  62.66%; FB1:  57.34  721
medical: precision:  52.25%; recall:  52.97%; FB1:  52.61  222
mountain: precision:  73.95%; recall:  71.93%; FB1:  72.93  357
music: precision:  76.52%; recall:  74.13%; FB1:  75.31  558
other: precision:  59.20%; recall:  59.14%; FB1:  59.17  10514
painting: precision:  37.04%; recall:  40.00%; FB1:  38.46  27
park: precision:  61.15%; recall:  73.61%; FB1:  66.81  260
politicalparty: precision:  61.72%; recall:  74.45%; FB1:  67.49  661
politician: precision:  66.98%; recall:  68.20%; FB1:  67.58  1508
protest: precision:  28.00%; recall:  39.77%; FB1:  32.86  125
religion: precision:  51.49%; recall:  59.02%; FB1:  55.00  470
restaurant: precision:  60.19%; recall:  51.18%; FB1:  55.32  108
road/railway/highway/transit: precision:  64.51%; recall:  69.78%; FB1:  67.04  834
scholar: precision:  50.13%; recall:  54.50%; FB1:  52.22  399
ship: precision:  49.20%; recall:  50.83%; FB1:  50.00  187
showorganization: precision:  63.75%; recall:  71.70%; FB1:  67.49  469
software: precision:  56.28%; recall:  62.53%; FB1:  59.24  430
soldier: precision:  55.59%; recall:  61.63%; FB1:  58.45  367
sportsevent: precision:  55.30%; recall:  63.48%; FB1:  59.11  792
sportsfacility: precision:  59.68%; recall:  75.50%; FB1:  66.67  253
sportsleague: precision:  65.14%; recall:  58.91%; FB1:  61.87  416
sportsteam: precision:  70.38%; recall:  79.51%; FB1:  74.66  1384
theater: precision:  67.23%; recall:  80.20%; FB1:  73.15  235
train: precision:  39.51%; recall:  54.70%; FB1:  45.88  162
weapon: precision:  55.96%; recall:  50.99%; FB1:  53.36  277
writtenart: precision:  56.65%; recall:  60.95%; FB1:  58.73  496