Icelandic NER Model

Description

This model, imported from Hugging Face, was fine-tuned on the MIM-GOLD-NER dataset for the Icelandic language, leveraging Roberta embeddings and using RobertaForTokenClassification for NER purposes.

Predicted Entities

Date, Location, Miscellaneous, Money, Organization, Percent, Person, Time

Download Copy S3 URI

How to use

documentAssembler = DocumentAssembler()\
.setInputCol("text")\
.setOutputCol("document")

sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")\
.setInputCols(["document"])\
.setOutputCol("sentence")

tokenizer = Tokenizer()\
.setInputCols(["sentence"])\
.setOutputCol("token")

tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_token_classifier_icelandic_ner", "is"))\
.setInputCols(["sentence",'token'])\
.setOutputCol("ner")

ner_converter = NerConverter()\
.setInputCols(["sentence", "token", "ner"])\
.setOutputCol("ner_chunk")

nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)
text = """Ég heiti Peter Fergusson. Ég hef búið í New York síðan í október 2011 og unnið hjá Tesla Motor og þénað 100K $ á ári."""
result = model.transform(spark.createDataFrame([[text]]).toDF("text"))
val documentAssembler = DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

val sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols(Array("document"))
.setOutputCol("sentence")

val tokenizer = Tokenizer()
.setInputCols(Array("sentence"))
.setOutputCol("token")

val tokenClassifier = RoBertaForTokenClassification.pretrained("roberta_token_classifier_icelandic_ner", "is"))
.setInputCols(Array("sentence","token"))
.setOutputCol("ner")

ner_converter = NerConverter()
.setInputCols(Array("sentence", "token", "ner"))
.setOutputCol("ner_chunk")

val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter))

val example = Seq.empty["Ég heiti Peter Fergusson. Ég hef búið í New York síðan í október 2011 og unnið hjá Tesla Motor og þénað 100K $ á ári."].toDS.toDF("text")

val result = pipeline.fit(example).transform(example)
import nlu
nlu.load("is.ner").predict("""Ég heiti Peter Fergusson. Ég hef búið í New York síðan í október 2011 og unnið hjá Tesla Motor og þénað 100K $ á ári.""")

Results

+----------------+------------+
|chunk           |ner_label   |
+----------------+------------+
|Peter Fergusson |Person      |
|New York        |Location    |
|október 2011    |Date        |
|Tesla Motor     |Organization|
|100K $          |Money       |
+----------------+------------+

Model Information

Model Name: roberta_token_classifier_icelandic_ner
Compatibility: Spark NLP 3.3.2+
License: Open Source
Edition: Official
Input Labels: [sentence, token]
Output Labels: [ner]
Language: is
Case sensitive: true
Max sentense length: 256

Data Source

https://huggingface.co/m3hrdadfi/icelandic-ner-roberta

Benchmarking

label      score
Macro-F1-Score   0.957209
Micro-F1-Score   0.951866