sparknlp.annotator.NerCrfApproach

class sparknlp.annotator.NerCrfApproach[source]

Bases: sparknlp.common.AnnotatorApproach, sparknlp.annotator.NerApproach

Algorithm for training a Named Entity Recognition Model

For instantiated/pretrained models, see NerCrfModel.

This Named Entity recognition annotator allows for a generic model to be trained by utilizing a CRF machine learning algorithm. The training data should be a labeled Spark Dataset, e.g. CoNLL 2003 IOB with Annotation type columns. The data should have columns of type DOCUMENT, TOKEN, POS, WORD_EMBEDDINGS and an additional label column of annotator type NAMED_ENTITY.

Excluding the label, this can be done with for example:

Optionally the user can provide an entity dictionary file with setExternalFeatures() for better accuracy.

For extended examples of usage, see the Spark NLP Workshop.

Input Annotation types

Output Annotation type

DOCUMENT, TOKEN, POS, WORD_EMBEDDINGS

NAMED_ENTITY

Parameters
labelColumn

Column with label per each token

entities

Entities to recognize

minEpochs

Minimum number of epochs to train, by default 0

maxEpochs

Maximum number of epochs to train, by default 1000

verbose

Level of verbosity during training, by default 4

randomSeed

Random seed

l2

L2 regularization coefficient, by default 1.0

c0

c0 params defining decay speed for gradient, by default 2250000

lossEps

If Epoch relative improvement less than eps then training is stopped, by default 0.001

minW

Features with less weights then this param value will be filtered

includeConfidence

Whether to include confidence scores in annotation metadata, by default False

externalFeatures

Additional dictionaries paths to use as a features

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from sparknlp.training import *
>>> from pyspark.ml import Pipeline

This CoNLL dataset already includes the sentence, token, pos and label column with their respective annotator types. If a custom dataset is used, these need to be defined.

>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> embeddings = WordEmbeddingsModel.pretrained() \
...     .setInputCols(["sentence", "token"]) \
...     .setOutputCol("embeddings") \
...     .setCaseSensitive(False)
>>> nerTagger = NerCrfApproach() \
...     .setInputCols(["sentence", "token", "pos", "embeddings"]) \
...     .setLabelColumn("label") \
...     .setMinEpochs(1) \
...     .setMaxEpochs(3) \
...     .setC0(34) \
...     .setL2(3.0) \
...     .setOutputCol("ner")
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     embeddings,
...     nerTagger
... ])
>>> conll = CoNLL()
>>> trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train")
>>> pipelineModel = pipeline.fit(trainingData)

Methods

__init__()

clear(param)

Clears a param from the param map if it has been explicitly set.

copy([extra])

Creates a copy of this instance with the same uid and some extra params.

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap([extra])

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

fit(dataset[, params])

Fits a model to the input dataset with optional parameters.

fitMultiple(dataset, paramMaps)

Fits a model to the input dataset for each param map in paramMaps.

getInputCols()

Gets current column names of input annotations.

getLabelColumn()

Gets column for label per each token.

getLazyAnnotator()

Gets whether Annotator should be evaluated lazily in a RecursivePipeline.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value.

getOutputCol()

Gets output column name of annotations.

getParam(paramName)

Gets a param by its name.

getParamValue(paramName)

Gets the value of a parameter.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of 'write().save(path)'.

set(param, value)

Sets a parameter in the embedded param map.

setC0(c0value)

Sets c0 params defining decay speed for gradient, by default 2250000.

setEntities(tags)

Sets entities to recognize.

setExternalFeatures(path, delimiter[, ...])

Sets Additional dictionaries paths to use as a features.

setIncludeConfidence(b)

Sets whether to include confidence scores in annotation metadata, by default False.

setInputCols(*value)

Sets column names of input annotations.

setL2(l2value)

Sets L2 regularization coefficient, by default 1.0.

setLabelColumn(value)

Sets name of column for data labels.

setLazyAnnotator(value)

Sets whether Annotator should be evaluated lazily in a RecursivePipeline.

setLossEps(eps)

Sets If Epoch relative improvement less than eps then training is stopped, by default 0.001.

setMaxEpochs(epochs)

Sets maximum number of epochs to train.

setMinEpochs(epochs)

Sets minimum number of epochs to train.

setMinW(w)

Sets minimum weight value.

setOutputCol(value)

Sets output column name of annotations.

setParamValue(paramName)

Sets the value of a parameter.

setRandomSeed(seed)

Sets random seed for shuffling.

setVerbose(verboseValue)

Sets level of verbosity during training.

write()

Returns an MLWriter instance for this ML instance.

Attributes

c0

entities

externalFeatures

getter_attrs

includeConfidence

inputCols

l2

labelColumn

lazyAnnotator

lossEps

maxEpochs

minEpochs

minW

outputCol

params

Returns all params ordered by name.

randomSeed

verbose

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

fit(dataset, params=None)

Fits a model to the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.

Returns

fitted model(s)

New in version 1.3.0.

fitMultiple(dataset, paramMaps)

Fits a model to the input dataset for each param map in paramMaps.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame.

  • paramMaps – A Sequence of param maps.

Returns

A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.

New in version 2.3.0.

getInputCols()

Gets current column names of input annotations.

getLabelColumn()

Gets column for label per each token.

Returns
str

Column with label per each token

getLazyAnnotator()

Gets whether Annotator should be evaluated lazily in a RecursivePipeline.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets output column name of annotations.

getParam(paramName)

Gets a param by its name.

getParamValue(paramName)

Gets the value of a parameter.

Parameters
paramNamestr

Name of the parameter

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setC0(c0value)[source]

Sets c0 params defining decay speed for gradient, by default 2250000.

Parameters
c0valueint

c0 params defining decay speed for gradient

setEntities(tags)

Sets entities to recognize.

Parameters
tagsList[str]

List of entities

setExternalFeatures(path, delimiter, read_as='TEXT', options={'format': 'text'})[source]

Sets Additional dictionaries paths to use as a features.

Parameters
pathstr

Path to the source files

delimiterstr

Delimiter for the dictionary file. Can also be set it options.

read_asstr, optional

How to read the file, by default ReadAs.TEXT

optionsdict, optional

Options to read the resource, by default {“format”: “text”}

setIncludeConfidence(b)[source]

Sets whether to include confidence scores in annotation metadata, by default False.

Parameters
bbool

Whether to include the confidence value in the output.

setInputCols(*value)

Sets column names of input annotations.

Parameters
*valuestr

Input columns for the annotator

setL2(l2value)[source]

Sets L2 regularization coefficient, by default 1.0.

Parameters
l2valuefloat

L2 regularization coefficient

setLabelColumn(value)

Sets name of column for data labels.

Parameters
valuestr

Column for data labels

setLazyAnnotator(value)

Sets whether Annotator should be evaluated lazily in a RecursivePipeline.

Parameters
valuebool

Whether Annotator should be evaluated lazily in a RecursivePipeline

setLossEps(eps)[source]

Sets If Epoch relative improvement less than eps then training is stopped, by default 0.001.

Parameters
epsfloat

The threshold

setMaxEpochs(epochs)

Sets maximum number of epochs to train.

Parameters
epochsint

Maximum number of epochs to train

setMinEpochs(epochs)

Sets minimum number of epochs to train.

Parameters
epochsint

Minimum number of epochs to train

setMinW(w)[source]

Sets minimum weight value.

Features with less weights then this param value will be filtered.

Parameters
wfloat

Minimum weight value

setOutputCol(value)

Sets output column name of annotations.

Parameters
valuestr

Name of output column

setParamValue(paramName)

Sets the value of a parameter.

Parameters
paramNamestr

Name of the parameter

setRandomSeed(seed)

Sets random seed for shuffling.

Parameters
seedint

Random seed for shuffling

setVerbose(verboseValue)

Sets level of verbosity during training.

Parameters
verboseValueint

Level of verbosity

uid

A unique id for the object.

write()

Returns an MLWriter instance for this ML instance.