sparknlp.annotator.spell_check.symmetric_delete
#
Contains classes for SymmetricDelete.
Module Contents#
Classes#
Trains a Symmetric Delete spelling correction algorithm. Retrieves tokens |
|
Symmetric Delete spelling correction algorithm. |
- class SymmetricDeleteApproach[source]#
Trains a Symmetric Delete spelling correction algorithm. Retrieves tokens and utilizes distance metrics to compute possible derived words.
For instantiated/pretrained models, see
SymmetricDeleteModel
.Input Annotation types
Output Annotation type
TOKEN
TOKEN
- Parameters:
- dictionary
folder or file with text that teaches about the language
- maxEditDistance
max edit distance characters to derive strings from a word, by default 3
- frequencyThreshold
minimum frequency of words to be considered from training, by default 0
- deletesThreshold
minimum frequency of corrections a word needs to have to be considered from training, by default 0
See also
NorvigSweetingApproach
for an alternative approach to spell checking
ContextSpellCheckerApproach
for a DL based approach
References
Inspired by SymSpell.
Examples
In this example, the dictionary
"words.txt"
has the form of:... gummy gummic gummier gummiest gummiferous ...
This dictionary is then set to be the basis of the spell checker.
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> tokenizer = Tokenizer() \ ... .setInputCols(["document"]) \ ... .setOutputCol("token") >>> spellChecker = SymmetricDeleteApproach() \ ... .setInputCols(["token"]) \ ... .setOutputCol("spell") \ ... .setDictionary("src/test/resources/spell/words.txt") >>> pipeline = Pipeline().setStages([ ... documentAssembler, ... tokenizer, ... spellChecker ... ]) >>> pipelineModel = pipeline.fit(trainingData)
- setDictionary(self, path, token_pattern='\\S+', read_as=ReadAs.TEXT, options={'format': 'text'})[source]#
Sets folder or file with text that teaches about the language.
- Parameters:
- pathstr
Path to the resource
- token_patternstr, optional
Regex patttern to extract tokens, by default “S+”
- read_asstr, optional
How to read the resource, by default ReadAs.TEXT
- optionsdict, optional
Options for reading the resource, by default {“format”: “text”}
- setMaxEditDistance(self, v)[source]#
Sets max edit distance characters to derive strings from a word, by default 3.
- Parameters:
- vint
Max edit distance characters to derive strings from a word
- class SymmetricDeleteModel(classname='com.johnsnowlabs.nlp.annotators.spell.symmetric.SymmetricDeleteModel', java_model=None)[source]#
Symmetric Delete spelling correction algorithm.
The Symmetric Delete spelling correction algorithm reduces the complexity of edit candidate generation and dictionary lookup for a given Damerau-Levenshtein distance. It is six orders of magnitude faster (than the standard approach with deletes + transposes + replaces + inserts) and language independent.
Pretrained models can be loaded with
pretrained()
of the companion object:>>> spell = SymmetricDeleteModel.pretrained() \ ... .setInputCols(["token"]) \ ... .setOutputCol("spell")
The default model is
"spellcheck_sd"
, if no name is provided. For available pretrained models please see the Models Hub.Input Annotation types
Output Annotation type
TOKEN
TOKEN
- Parameters:
- None
See also
NorvigSweetingModel
for an alternative approach to spell checking
ContextSpellCheckerModel
for a DL based approach
References
Inspired by SymSpell.
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> tokenizer = Tokenizer() \ ... .setInputCols(["document"]) \ ... .setOutputCol("token") >>> spellChecker = SymmetricDeleteModel.pretrained() \ ... .setInputCols(["token"]) \ ... .setOutputCol("spell") >>> pipeline = Pipeline().setStages([ ... documentAssembler, ... tokenizer, ... spellChecker ... ]) >>> data = spark.createDataFrame([["spmetimes i wrrite wordz erong."]]).toDF("text") >>> result = pipeline.fit(data).transform(data) >>> result.select("spell.result").show(truncate=False) +--------------------------------------+ |result | +--------------------------------------+ |[sometimes, i, write, words, wrong, .]| +--------------------------------------+
- static pretrained(name='spellcheck_sd', lang='en', remote_loc=None)[source]#
Downloads and loads a pretrained model.
- Parameters:
- namestr, optional
Name of the pretrained model, by default “spellcheck_sd”
- langstr, optional
Language of the pretrained model, by default “en”
- remote_locstr, optional
Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.
- Returns:
- SymmetricDeleteModel
The restored model