Typo Detector for Icelandic

Description

This model was imported from Hugging Face (link) and it’s been trained on a Icelandic synthetic data to detect typos, leveraging DistilBERT embeddings and DistilBertForTokenClassification for NER purposes. It classifies typo tokens as PO.

Predicted Entities

PO

Download Copy S3 URI

How to use

documentAssembler = DocumentAssembler()\
      .setInputCol("text")\
      .setOutputCol("document")

sentenceDetector = SentenceDetector()\
       .setInputCols(["document"])\
       .setOutputCol("sentence")

tokenizer = Tokenizer()\
      .setInputCols(["sentence"])\
      .setOutputCol("token")

tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_typo_detector", "is")\
  .setInputCols(["sentence",'token'])\
  .setOutputCol("ner")

ner_converter = NerConverter()\
      .setInputCols(["sentence", "token", "ner"])\
      .setOutputCol("ner_chunk")

nlpPipeline = Pipeline(stages=[documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter])
text = """Það er miög auðvelt að draga marktækar álykanir af texta með Spark NLP."""
data = spark.createDataFrame([[text]]).toDF("text")

result = nlpPipeline.fit(data).transform(data)
val documentAssembler = DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")

val sentenceDetector = SentenceDetector()
       .setInputCols(Array("document"))
       .setOutputCol("sentence")

val tokenizer = Tokenizer()
      .setInputCols(Array("sentence"))
      .setOutputCol("token")

val tokenClassifier = DistilBertForTokenClassification.pretrained("distilbert_token_classifier_typo_detector", "is")
  .setInputCols(Array("sentence","token"))
  .setOutputCol("ner")

val ner_converter = NerConverter()
      .setInputCols(Array("sentence", "token", "ner"))
      .setOutputCol("ner_chunk")

val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector, tokenizer, tokenClassifier, ner_converter))

val example = Seq.empty["Það er miög auðvelt  draga marktækar álykanir af texta með Spark NLP."].toDS.toDF("text")

val result = pipeline.fit(example).transform(example)
import nlu
nlu.load("is.ner.distil_bert").predict("""Það er miög auðvelt að draga marktækar álykanir af texta með Spark NLP.""")

Results

+--------+---------+
|chunk   |ner_label|
+--------+---------+
|miög    |PO       |
|álykanir|PO       |
+--------+---------+

Model Information

Model Name: distilbert_token_classifier_typo_detector
Compatibility: Spark NLP 3.3.4+
License: Open Source
Edition: Official
Input Labels: [sentence, token]
Output Labels: [ner]
Language: is
Size: 505.7 MB
Case sensitive: true
Max sentence length: 256

Benchmarking

label         precision recall    f1-score  support
micro avg     0.98954   0.967603  0.978448  43800.0
macro-avg     0.98954   0.967603  0.978448  43800.0
weighted-avg  0.98954   0.967603  0.978448  43800.0