sparknlp.annotator.spell_check.symmetric_delete#

Contains classes for SymmetricDelete.

Module Contents#

Classes#

SymmetricDeleteApproach

Trains a Symmetric Delete spelling correction algorithm. Retrieves tokens

SymmetricDeleteModel

Symmetric Delete spelling correction algorithm.

class SymmetricDeleteApproach[source]#

Trains a Symmetric Delete spelling correction algorithm. Retrieves tokens and utilizes distance metrics to compute possible derived words.

For instantiated/pretrained models, see SymmetricDeleteModel.

Input Annotation types

Output Annotation type

TOKEN

TOKEN

Parameters:
dictionary

folder or file with text that teaches about the language

maxEditDistance

max edit distance characters to derive strings from a word, by default 3

frequencyThreshold

minimum frequency of words to be considered from training, by default 0

deletesThreshold

minimum frequency of corrections a word needs to have to be considered from training, by default 0

See also

NorvigSweetingApproach

for an alternative approach to spell checking

ContextSpellCheckerApproach

for a DL based approach

References

Inspired by SymSpell.

Examples

In this example, the dictionary "words.txt" has the form of:

...
gummy
gummic
gummier
gummiest
gummiferous
...

This dictionary is then set to be the basis of the spell checker.

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["document"]) \
...     .setOutputCol("token")
>>> spellChecker = SymmetricDeleteApproach() \
...     .setInputCols(["token"]) \
...     .setOutputCol("spell") \
...     .setDictionary("src/test/resources/spell/words.txt")
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     tokenizer,
...     spellChecker
... ])
>>> pipelineModel = pipeline.fit(trainingData)
corpus[source]#
dictionary[source]#
maxEditDistance[source]#
frequencyThreshold[source]#
deletesThreshold[source]#
dupsLimit[source]#
setDictionary(self, path, token_pattern='\\S+', read_as=ReadAs.TEXT, options={'format': 'text'})[source]#

Sets folder or file with text that teaches about the language.

Parameters:
pathstr

Path to the resource

token_patternstr, optional

Regex patttern to extract tokens, by default “S+”

read_asstr, optional

How to read the resource, by default ReadAs.TEXT

optionsdict, optional

Options for reading the resource, by default {“format”: “text”}

setMaxEditDistance(self, v)[source]#

Sets max edit distance characters to derive strings from a word, by default 3.

Parameters:
vint

Max edit distance characters to derive strings from a word

setFrequencyThreshold(self, v)[source]#

Sets minimum frequency of words to be considered from training, by default 0.

Parameters:
vint

Minimum frequency of words to be considered from training

setDeletesThreshold(self, v)[source]#

Sets minimum frequency of corrections a word needs to have to be considered from training, by default 0.

Parameters:
vint

Minimum frequency of corrections a word needs to have to be considered from training

class SymmetricDeleteModel(classname='com.johnsnowlabs.nlp.annotators.spell.symmetric.SymmetricDeleteModel', java_model=None)[source]#

Symmetric Delete spelling correction algorithm.

The Symmetric Delete spelling correction algorithm reduces the complexity of edit candidate generation and dictionary lookup for a given Damerau-Levenshtein distance. It is six orders of magnitude faster (than the standard approach with deletes + transposes + replaces + inserts) and language independent.

Pretrained models can be loaded with pretrained() of the companion object:

>>> spell = SymmetricDeleteModel.pretrained() \
...     .setInputCols(["token"]) \
...     .setOutputCol("spell")

The default model is "spellcheck_sd", if no name is provided. For available pretrained models please see the Models Hub.

Input Annotation types

Output Annotation type

TOKEN

TOKEN

Parameters:
None

See also

NorvigSweetingModel

for an alternative approach to spell checking

ContextSpellCheckerModel

for a DL based approach

References

Inspired by SymSpell.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["document"]) \
...     .setOutputCol("token")
>>> spellChecker = SymmetricDeleteModel.pretrained() \
...     .setInputCols(["token"]) \
...     .setOutputCol("spell")
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     tokenizer,
...     spellChecker
... ])
>>> data = spark.createDataFrame([["spmetimes i wrrite wordz erong."]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.select("spell.result").show(truncate=False)
+--------------------------------------+
|result                                |
+--------------------------------------+
|[sometimes, i, write, words, wrong, .]|
+--------------------------------------+
name = SymmetricDeleteModel[source]#
static pretrained(name='spellcheck_sd', lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:
namestr, optional

Name of the pretrained model, by default “spellcheck_sd”

langstr, optional

Language of the pretrained model, by default “en”

remote_locstr, optional

Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:
SymmetricDeleteModel

The restored model