BERT Token Classification - Swedish Language Understanding (bert_token_classifier_swedish_ner)

Description

BERT Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

Recorded Future together with AI Sweden releases a Named Entity Recognition(NER) model for entety detection in Swedish. The model is based on KB/bert-base-swedish-cased and finetuned on data collected from various internet sources and forums.

The model has been trained on Swedish data and only supports an inference of Swedish input texts. The model’s inference metrics for all non-Swedish inputs are not defined, these inputs are considered as out of domain data.

Predicted Entities

  • Location
  • Organization
  • Person
  • Religion
  • Title

Download Copy S3 URI

How to use

document_assembler = DocumentAssembler() \
.setInputCol('text') \
.setOutputCol('document')

tokenizer = Tokenizer() \
.setInputCols(['document']) \
.setOutputCol('token')

tokenClassifier = BertForTokenClassification \
.pretrained('bert_token_classifier_swedish_ner', 'sv') \
.setInputCols(['token', 'document']) \
.setOutputCol('ner') \
.setCaseSensitive(True) \
.setMaxSentenceLength(512)

# since output column is IOB/IOB2 style, NerConverter can extract entities
ner_converter = NerConverter() \
.setInputCols(['document', 'token', 'ner']) \
.setOutputCol('entities')

pipeline = Pipeline(stages=[
document_assembler, 
tokenizer,
tokenClassifier,
ner_converter
])

example = spark.createDataFrame([["Engelbert tar Volvon till Tele2 Arena för att titta på Djurgården som spelar fotboll i VM klockan två på kvällen."]]).toDF("text")
result = pipeline.fit(example).transform(example)
val document_assembler = DocumentAssembler() 
.setInputCol("text") 
.setOutputCol("document")

val tokenizer = Tokenizer() 
.setInputCols("document") 
.setOutputCol("token")

val tokenClassifier = BertForTokenClassification.pretrained("bert_token_classifier_swedish_ner", "sv")
.setInputCols("document", "token")
.setOutputCol("ner")
.setCaseSensitive(true)
.setMaxSentenceLength(512)

// since output column is IOB/IOB2 style, NerConverter can extract entities
val ner_converter = NerConverter() 
.setInputCols("document", "token", "ner") 
.setOutputCol("entities")

val pipeline = new Pipeline().setStages(Array(document_assembler, tokenizer, tokenClassifier, ner_converter))

val example = Seq.empty["Engelbert tar Volvon till Tele2 Arena för att titta  Djurgården som spelar fotboll i VM klockan två  kvällen."].toDS.toDF("text")

val result = pipeline.fit(example).transform(example)
import nlu
nlu.load("sv.classify.token_bert.swedish_ner").predict("""Engelbert tar Volvon till Tele2 Arena för att titta på Djurgården som spelar fotboll i VM klockan två på kvällen.""")

Model Information

Model Name: bert_token_classifier_swedish_ner
Compatibility: Spark NLP 3.2.0+
License: Open Source
Edition: Official
Input Labels: [token, document]
Output Labels: [ner]
Language: sv
Case sensitive: false
Max sentense length: 512

Data Source

https://huggingface.co/RecordedFuture/Swedish-NER

Benchmarking

The model had the following metrics when evaluated on test data originating from the same domain as the training data. 

F1-score

| Loc  | Org  | Per  | Nat  | Rel  | Tit  | Total |
|------|------|------|------|------|------|-------|
| 0.91 | 0.88 | 0.96 | 0.95 | 0.91 | 0.84 | 0.92  |