Context Spell Checker for the Italian Language

Description

This is an Italian Context Spell Checker trained on the Paisà corpus.

Download Copy S3 URI

How to use

The model works at the token level, so you must put it after tokenization. The model can change the length of the tokes when correcting words, so keep this in mind when using it before other annotators that may work with absolute references to the original document like NerConverter.

assembler = DocumentAssembler()\
 .setInputCol("value")\
 .setOutputCol("document")

tokenizer = RecursiveTokenizer()\
 .setInputCols("document")\
 .setOutputCol("token")\
 .setPrefixes(["\"", """, "(", "[", "\n", ".", "l'", "dell'", "nell'", "sull'", "all'", "d'", "un'"])\
 .setSuffixes(["\"", """, ".", ",", "?", ")", "]", "!", ";", ":"])

spellChecker = ContextSpellCheckerModel("spellcheck_dl", "it").\
    setInputCols("token").\
    setOutputCol("corrected")

Model Information

Model Name: spellcheck_dl
Compatibility: Spark NLP 2.7.4+
License: Open Source
Edition: Official
Input Labels: [token]
Output Labels: [corrected]
Language: it

Data Source

Paisà Italian Language Corpus.