sparknlp.annotator.ContextSpellCheckerApproach

class sparknlp.annotator.ContextSpellCheckerApproach[source]

Bases: sparknlp.common.AnnotatorApproach

Trains a deep-learning based Noisy Channel Model Spell Algorithm.

Correction candidates are extracted combining context information and word information.

For instantiated/pretrained models, see ContextSpellCheckerModel.

Spell Checking is a sequence to sequence mapping problem. Given an input sequence, potentially containing a certain number of errors, ContextSpellChecker will rank correction sequences according to three things:

  1. Different correction candidates for each word — word level.

  2. The surrounding text of each word, i.e. it’s context — sentence level.

  3. The relative cost of different correction candidates according to the edit operations at the character level it requires — subword level.

For extended examples of usage, see the article Training a Contextual Spell Checker for Italian Language, the Spark NLP Workshop.

Input Annotation types

Output Annotation type

TOKEN

TOKEN

Parameters
languageModelClasses

Number of classes to use during factorization of the softmax output in the LM.

wordMaxDistance

Maximum distance for the generated candidates for every word.

maxCandidates

Maximum number of candidates for every word.

caseStrategy

What case combinations to try when generating candidates, by default 2. Possible values are:

  • 0: All uppercase letters

  • 1: First letter capitalized

  • 2: All letters

errorThreshold

Threshold perplexity for a word to be considered as an error.

epochs

Number of epochs to train the language model.

batchSize

Batch size for the training in NLM.

initialRate

Initial learning rate for the LM.

finalRate

Final learning rate for the LM.

validationFraction

Percentage of datapoints to use for validation.

minCount

Min number of times a token should appear to be included in vocab.

compoundCount

Min number of times a compound word should appear to be included in vocab.

classCount

Min number of times the word need to appear in corpus to not be considered of a special class.

tradeoff

Tradeoff between the cost of a word error and a transition in the language model.

weightedDistPath

The path to the file containing the weights for the levenshtein distance.

maxWindowLen

Maximum size for the window used to remember history prior to every correction.

configProtoBytes

ConfigProto from tensorflow, serialized into byte array.

References

For an in-depth explanation of the module see the article Applying Context Aware Spell Checking in Spark NLP.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline

For this example, we use the first Sherlock Holmes book as the training dataset.

>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> tokenizer = Tokenizer() \
...     .setInputCols("document") \
...     .setOutputCol("token")
>>> spellChecker = ContextSpellCheckerApproach() \
...     .setInputCols("token") \
...     .setOutputCol("corrected") \
...     .setWordMaxDistance(3) \
...     .setBatchSize(24) \
...     .setEpochs(8) \
...     .setLanguageModelClasses(1650)  # dependant on vocabulary size
...     # .addVocabClass("_NAME_", names) # Extra classes for correction could be added like this
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     tokenizer,
...     spellChecker
... ])
>>> path = "sherlockholmes.txt"
>>> dataset = spark.read.text(path) \
...     .toDF("text")
>>> pipelineModel = pipeline.fit(dataset)

Methods

__init__()

addRegexClass(label, regex[, userdist])

Adds a new class of words to correct, based on regex.

addVocabClass(label, vocab[, userdist])

Adds a new class of words to correct, based on a vocabulary.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy([extra])

Creates a copy of this instance with the same uid and some extra params.

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap([extra])

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

fit(dataset[, params])

Fits a model to the input dataset with optional parameters.

fitMultiple(dataset, paramMaps)

Fits a model to the input dataset for each param map in paramMaps.

getInputCols()

Gets current column names of input annotations.

getLazyAnnotator()

Gets whether Annotator should be evaluated lazily in a RecursivePipeline.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value.

getOutputCol()

Gets output column name of annotations.

getParam(paramName)

Gets a param by its name.

getParamValue(paramName)

Gets the value of a parameter.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of 'write().save(path)'.

set(param, value)

Sets a parameter in the embedded param map.

setBatchSize(size)

Sets batch size.

setCaseStrategy(strategy)

Sets what case combinations to try when generating candidates.

setClassCount(count)

Sets min number of times the word need to appear in corpus to not be considered of a special class.

setCompoundCount(count)

Sets min number of times a compound word should appear to be included in vocab.

setConfigProtoBytes(b)

Sets configProto from tensorflow, serialized into byte array.

setEpochs(count)

Sets number of epochs to train the language model.

setErrorThreshold(threshold)

Sets threshold perplexity for a word to be considered as an error.

setFinalRate(rate)

Sets final learning rate for the LM.

setInitialRate(rate)

Sets initial learning rate for the LM.

setInputCols(*value)

Sets column names of input annotations.

setLanguageModelClasses(count)

Sets number of classes to use during factorization of the softmax output in the Language Model.

setLazyAnnotator(value)

Sets whether Annotator should be evaluated lazily in a RecursivePipeline.

setMaxCandidates(candidates)

Sets maximum number of candidates for every word.

setMaxWindowLen(length)

Sets the maximum size for the window used to remember history prior to every correction.

setMinCount(count)

Sets min number of times a token should appear to be included in vocab.

setOutputCol(value)

Sets output column name of annotations.

setParamValue(paramName)

Sets the value of a parameter.

setTradeoff(alpha)

Sets tradeoff between the cost of a word error and a transition in the language model.

setValidationFraction(fraction)

Sets percentage of datapoints to use for validation.

setWeightedDistPath(path)

Sets the path to the file containing the weights for the levenshtein distance.

setWordMaxDistance(dist)

Sets maximum distance for the generated candidates for every word.

write()

Returns an MLWriter instance for this ML instance.

Attributes

batchSize

caseStrategy

classCount

compoundCount

configProtoBytes

epochs

errorThreshold

finalRate

getter_attrs

initialRate

inputCols

languageModelClasses

lazyAnnotator

maxCandidates

maxWindowLen

minCount

name

outputCol

params

Returns all params ordered by name.

tradeoff

validationFraction

weightedDistPath

wordMaxDistance

addRegexClass(label, regex, userdist=3)[source]

Adds a new class of words to correct, based on regex.

Parameters
labelstr

Name of the class

regexstr

Regex to add

userdistint, optional

Maximal distance to the word, by default 3

addVocabClass(label, vocab, userdist=3)[source]

Adds a new class of words to correct, based on a vocabulary.

Parameters
labelstr

Name of the class

vocabList[str]

Vocabulary as a list

userdistint, optional

Maximal distance to the word, by default 3

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

fit(dataset, params=None)

Fits a model to the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.

Returns

fitted model(s)

New in version 1.3.0.

fitMultiple(dataset, paramMaps)

Fits a model to the input dataset for each param map in paramMaps.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame.

  • paramMaps – A Sequence of param maps.

Returns

A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.

New in version 2.3.0.

getInputCols()

Gets current column names of input annotations.

getLazyAnnotator()

Gets whether Annotator should be evaluated lazily in a RecursivePipeline.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets output column name of annotations.

getParam(paramName)

Gets a param by its name.

getParamValue(paramName)

Gets the value of a parameter.

Parameters
paramNamestr

Name of the parameter

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setBatchSize(size)[source]

Sets batch size.

Parameters
sizeint

Batch size

setCaseStrategy(strategy)[source]

Sets what case combinations to try when generating candidates.

Possible values are:

  • 0: All uppercase letters

  • 1: First letter capitalized

  • 2: All letters

Parameters
strategyint

Case combinations to try when generating candidates

setClassCount(count)[source]

Sets min number of times the word need to appear in corpus to not be considered of a special class.

Parameters
countint

Min number of times the word need to appear in corpus to not be considered of a special class.

setCompoundCount(count)[source]

Sets min number of times a compound word should appear to be included in vocab.

Parameters
countint

Min number of times a compound word should appear to be included in vocab.

setConfigProtoBytes(b)[source]

Sets configProto from tensorflow, serialized into byte array.

Parameters
bList[str]

ConfigProto from tensorflow, serialized into byte array

setEpochs(count)[source]

Sets number of epochs to train the language model.

Parameters
countint

Number of epochs

setErrorThreshold(threshold)[source]

Sets threshold perplexity for a word to be considered as an error.

Parameters
thresholdfloat

Threshold perplexity for a word to be considered as an error

setFinalRate(rate)[source]

Sets final learning rate for the LM.

Parameters
ratefloat

Final learning rate for the LM

setInitialRate(rate)[source]

Sets initial learning rate for the LM.

Parameters
ratefloat

Initial learning rate for the LM

setInputCols(*value)

Sets column names of input annotations.

Parameters
*valuestr

Input columns for the annotator

setLanguageModelClasses(count)[source]

Sets number of classes to use during factorization of the softmax output in the Language Model.

Parameters
countint

Number of classes

setLazyAnnotator(value)

Sets whether Annotator should be evaluated lazily in a RecursivePipeline.

Parameters
valuebool

Whether Annotator should be evaluated lazily in a RecursivePipeline

setMaxCandidates(candidates)[source]

Sets maximum number of candidates for every word.

Parameters
candidatesint

Maximum number of candidates for every word.

setMaxWindowLen(length)[source]

Sets the maximum size for the window used to remember history prior to every correction.

Parameters
lengthint

Maximum size for the window used to remember history prior to every correction

setMinCount(count)[source]

Sets min number of times a token should appear to be included in vocab.

Parameters
countint

Min number of times a token should appear to be included in vocab

setOutputCol(value)

Sets output column name of annotations.

Parameters
valuestr

Name of output column

setParamValue(paramName)

Sets the value of a parameter.

Parameters
paramNamestr

Name of the parameter

setTradeoff(alpha)[source]

Sets tradeoff between the cost of a word error and a transition in the language model.

Parameters
alphafloat

Tradeoff between the cost of a word error and a transition in the language model

setValidationFraction(fraction)[source]

Sets percentage of datapoints to use for validation.

Parameters
fractionfloat

Percentage of datapoints to use for validation

setWeightedDistPath(path)[source]

Sets the path to the file containing the weights for the levenshtein distance.

Parameters
pathstr

Path to the file containing the weights for the levenshtein distance.

setWordMaxDistance(dist)[source]

Sets maximum distance for the generated candidates for every word.

Parameters
distint

Maximum distance for the generated candidates for every word

uid

A unique id for the object.

write()

Returns an MLWriter instance for this ML instance.