sparknlp.annotator.spell_check.norvig_sweeting#

Contains classes for the NorvigSweeting spell checker.

Module Contents#

Classes#

NorvigSweetingApproach

Trains annotator, that retrieves tokens and makes corrections

NorvigSweetingModel

This annotator retrieves tokens and makes corrections automatically if

class NorvigSweetingApproach[source]#

Trains annotator, that retrieves tokens and makes corrections automatically if not found in an English dictionary.

The Symmetric Delete spelling correction algorithm reduces the complexity of edit candidate generation and dictionary lookup for a given Damerau-Levenshtein distance. It is six orders of magnitude faster (than the standard approach with deletes + transposes + replaces + inserts) and language independent. A dictionary of correct spellings must be provided with setDictionary() in the form of a text file, where each word is parsed by a regex pattern.

For instantiated/pretrained models, see NorvigSweetingModel.

For extended examples of usage, see the Spark NLP Workshop.

Input Annotation types

Output Annotation type

TOKEN

TOKEN

Parameters:
dictionary

Dictionary needs ‘tokenPattern’ regex in dictionary for separating words

caseSensitive

Whether to ignore case sensitivity, by default False

doubleVariants

Whether to use more expensive spell checker, by default False

Increase search at cost of performance. Enables extra check for word combinations.

shortCircuit

Whether to use faster mode, by default False

Increase performance at cost of accuracy. Faster but less accurate.

frequencyPriority

Applies frequency over hamming in intersections, when false hamming takes priority, by default True

wordSizeIgnore

Minimum size of word before ignoring, by default 3

dupsLimit

Maximum duplicate of characters in a word to consider, by default 2

reductLimit

Word reductions limit, by default 3

intersections

Hamming intersections to attempt, by default 10

vowelSwapLimit

Vowel swap attempts, by default 6

See also

SymmetricDeleteApproach

for an alternative approach to spell checking

ContextSpellCheckerApproach

for a DL based approach

References

Inspired by Norvig model and SymSpell.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline

In this example, the dictionary "words.txt" has the form of:

...
gummy
gummic
gummier
gummiest
gummiferous
...

This dictionary is then set to be the basis of the spell checker.

>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["document"]) \
...     .setOutputCol("token")
>>> spellChecker = NorvigSweetingApproach() \
...     .setInputCols(["token"]) \
...     .setOutputCol("spell") \
...     .setDictionary("src/test/resources/spell/words.txt")
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     tokenizer,
...     spellChecker
... ])
>>> pipelineModel = pipeline.fit(trainingData)
dictionary[source]#
caseSensitive[source]#
doubleVariants[source]#
shortCircuit[source]#
frequencyPriority[source]#
wordSizeIgnore[source]#
dupsLimit[source]#
reductLimit[source]#
intersections[source]#
vowelSwapLimit[source]#
setDictionary(self, path, token_pattern='\\S+', read_as=ReadAs.TEXT, options={'format': 'text'})[source]#

Sets dictionary which needs ‘tokenPattern’ regex for separating words.

Parameters:
pathstr

Path to the source file

token_patternstr, optional

Pattern for token separation, by default \S+

read_asstr, optional

How to read the file, by default ReadAs.TEXT

optionsdict, optional

Options to read the resource, by default {“format”: “text”}

setCaseSensitive(self, value)[source]#

Sets whether to ignore case sensitivity, by default False.

Parameters:
valuebool

Whether to ignore case sensitivity

setDoubleVariants(self, value)[source]#

Sets whether to use more expensive spell checker, by default False.

Increase search at cost of performance. Enables extra check for word combinations.

Parameters:
valuebool

[description]

setShortCircuit(self, value)[source]#

Sets whether to use faster mode, by default False.

Increase performance at cost of accuracy. Faster but less accurate.

Parameters:
valuebool

Whether to use faster mode

setFrequencyPriority(self, value)[source]#

Sets whether to consider frequency over hamming in intersections, when false hamming takes priority, by default True.

Parameters:
valuebool

Whether to consider frequency over hamming in intersections

class NorvigSweetingModel(classname='com.johnsnowlabs.nlp.annotators.spell.norvig.NorvigSweetingModel', java_model=None)[source]#

This annotator retrieves tokens and makes corrections automatically if not found in an English dictionary.

The Symmetric Delete spelling correction algorithm reduces the complexity of edit candidate generation and dictionary lookup for a given Damerau-Levenshtein distance. It is six orders of magnitude faster (than the standard approach with deletes + transposes + replaces + inserts) and language independent.

This is the instantiated model of the NorvigSweetingApproach. For training your own model, please see the documentation of that class.

Pretrained models can be loaded with pretrained() of the companion object:

>>>    spellChecker = NorvigSweetingModel.pretrained() \
...        .setInputCols(["token"]) \
...        .setOutputCol("spell") \

The default model is "spellcheck_norvig", if no name is provided. For available pretrained models please see the Models Hub.

For extended examples of usage, see the Spark NLP Workshop.

Input Annotation types

Output Annotation type

TOKEN

TOKEN

Parameters:
None

See also

SymmetricDeleteModel

for an alternative approach to spell checking

ContextSpellCheckerModel

for a DL based approach

References

Inspired by Norvig model and SymSpell.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["document"]) \
...     .setOutputCol("token")
>>> spellChecker = NorvigSweetingModel.pretrained() \
...     .setInputCols(["token"]) \
...     .setOutputCol("spell")
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     tokenizer,
...     spellChecker
... ])
>>> data = spark.createDataFrame([["somtimes i wrrite wordz erong."]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.select("spell.result").show(truncate=False)
+--------------------------------------+
|result                                |
+--------------------------------------+
|[sometimes, i, write, words, wrong, .]|
+--------------------------------------+
name = NorvigSweetingModel[source]#
static pretrained(name='spellcheck_norvig', lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:
namestr, optional

Name of the pretrained model, by default “spellcheck_norvig”

langstr, optional

Language of the pretrained model, by default “en”

remote_locstr, optional

Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:
NorvigSweetingModel

The restored model