sparknlp.annotator.MultiClassifierDLApproach

class sparknlp.annotator.MultiClassifierDLApproach[source]

Bases: sparknlp.common.AnnotatorApproach

Trains a MultiClassifierDL for Multi-label Text Classification.

MultiClassifierDL uses a Bidirectional GRU with a convolutional model that we have built inside TensorFlow and supports up to 100 classes.

In machine learning, multi-label classification and the strongly related problem of multi-output classification are variants of the classification problem where multiple labels may be assigned to each instance. Multi-label classification is a generalization of multiclass classification, which is the single-label problem of categorizing instances into precisely one of more than two classes; in the multi-label problem there is no constraint on how many of the classes the instance can be assigned to. Formally, multi-label classification is the problem of finding a model that maps inputs x to binary vectors y (assigning a value of 0 or 1 for each element (label) in y).

For instantiated/pretrained models, see MultiClassifierDLModel.

The input to MultiClassifierDL are Sentence Embeddings such as the state-of-the-art UniversalSentenceEncoder, BertSentenceEmbeddings, SentenceEmbeddings or other sentence embeddings.

For extended examples of usage, see the Spark NLP Workshop.

Input Annotation types

Output Annotation type

SENTENCE_EMBEDDINGS

CATEGORY

Parameters
lr

Learning Rate, by default 0.001

batchSize

Batch size, by default 64

maxEpochs

Maximum number of epochs to train, by default 10

configProtoBytes

ConfigProto from tensorflow, serialized into byte array.

validationSplit

Choose the proportion of training dataset to be validated against the model on each Epoch. The value should be between 0.0 and 1.0 and by default it is 0.0 and off, by default 0.0

enableOutputLogs

Whether to use stdout in addition to Spark logs, by default False

outputLogsPath

Folder path to save training logs

labelColumn

Column with label per each token

verbose

Level of verbosity during training

randomSeed

Random seed, by default 44

shufflePerEpoch

whether to shuffle the training data on each Epoch, by default False

threshold

The minimum threshold for each label to be accepted, by default 0.5

Notes

  • This annotator requires an array of labels in type of String.

  • UniversalSentenceEncoder, BertSentenceEmbeddings, SentenceEmbeddings or other sentence embeddings can be used for the inputCol.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline

In this example, the training data has the form:

+----------------+--------------------+--------------------+
|              id|                text|              labels|
+----------------+--------------------+--------------------+
|ed58abb40640f983|PN NewsYou mean ... |             [toxic]|
|a1237f726b5f5d89|Dude.  Place the ...|   [obscene, insult]|
|24b0d6c8733c2abe|Thanks  - thanks ...|            [insult]|
|8c4478fb239bcfc0|" Gee, 5 minutes ...|[toxic, obscene, ...|
+----------------+--------------------+--------------------+

Process training data to create text with associated array of labels:

>>> trainDataset.printSchema()
root
|-- id: string (nullable = true)
|-- text: string (nullable = true)
|-- labels: array (nullable = true)
|    |-- element: string (containsNull = true)

Then create pipeline for training:

>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document") \
...     .setCleanupMode("shrink")
>>> embeddings = UniversalSentenceEncoder.pretrained() \
...     .setInputCols("document") \
...     .setOutputCol("embeddings")
>>> docClassifier = MultiClassifierDLApproach() \
...     .setInputCols("embeddings") \
...     .setOutputCol("category") \
...     .setLabelColumn("labels") \
...     .setBatchSize(128) \
...     .setMaxEpochs(10) \
...     .setLr(1e-3) \
...     .setThreshold(0.5) \
...     .setValidationSplit(0.1)
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     embeddings,
...     docClassifier
... ])
>>> pipelineModel = pipeline.fit(trainDataset)

Methods

__init__()

clear(param)

Clears a param from the param map if it has been explicitly set.

copy([extra])

Creates a copy of this instance with the same uid and some extra params.

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap([extra])

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

fit(dataset[, params])

Fits a model to the input dataset with optional parameters.

fitMultiple(dataset, paramMaps)

Fits a model to the input dataset for each param map in paramMaps.

getInputCols()

Gets current column names of input annotations.

getLazyAnnotator()

Gets whether Annotator should be evaluated lazily in a RecursivePipeline.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value.

getOutputCol()

Gets output column name of annotations.

getParam(paramName)

Gets a param by its name.

getParamValue(paramName)

Gets the value of a parameter.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of 'write().save(path)'.

set(param, value)

Sets a parameter in the embedded param map.

setBatchSize(v)

Sets batch size, by default 64.

setConfigProtoBytes(v)

Sets configProto from tensorflow, serialized into byte array.

setEnableOutputLogs(v)

Sets whether to use stdout in addition to Spark logs, by default False.

setInputCols(*value)

Sets column names of input annotations.

setLabelColumn(v)

Sets name of column for data labels.

setLazyAnnotator(value)

Sets whether Annotator should be evaluated lazily in a RecursivePipeline.

setLr(v)

Sets Learning Rate, by default 0.001.

setMaxEpochs(v)

Sets maximum number of epochs to train, by default 10.

setOutputCol(value)

Sets output column name of annotations.

setOutputLogsPath(v)

Sets folder path to save training logs.

setParamValue(paramName)

Sets the value of a parameter.

setRandomSeed(seed)

Sets random seed for shuffling.

setShufflePerEpoch(v)

setThreshold(v)

Sets minimum threshold for each label to be accepted, by default 0.5.

setValidationSplit(v)

Sets the proportion of training dataset to be validated against the model on each Epoch, by default it is 0.0 and off.

setVerbose(v)

Sets level of verbosity during training.

write()

Returns an MLWriter instance for this ML instance.

Attributes

batchSize

configProtoBytes

enableOutputLogs

getter_attrs

inputCols

labelColumn

lazyAnnotator

lr

maxEpochs

outputCol

outputLogsPath

params

Returns all params ordered by name.

randomSeed

shufflePerEpoch

threshold

validationSplit

verbose

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

fit(dataset, params=None)

Fits a model to the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.

Returns

fitted model(s)

New in version 1.3.0.

fitMultiple(dataset, paramMaps)

Fits a model to the input dataset for each param map in paramMaps.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame.

  • paramMaps – A Sequence of param maps.

Returns

A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.

New in version 2.3.0.

getInputCols()

Gets current column names of input annotations.

getLazyAnnotator()

Gets whether Annotator should be evaluated lazily in a RecursivePipeline.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets output column name of annotations.

getParam(paramName)

Gets a param by its name.

getParamValue(paramName)

Gets the value of a parameter.

Parameters
paramNamestr

Name of the parameter

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setBatchSize(v)[source]

Sets batch size, by default 64.

Parameters
vint

Batch size

setConfigProtoBytes(v)[source]

Sets configProto from tensorflow, serialized into byte array.

Parameters
vList[str]

ConfigProto from tensorflow, serialized into byte array

setEnableOutputLogs(v)[source]

Sets whether to use stdout in addition to Spark logs, by default False.

Parameters
vbool

Whether to use stdout in addition to Spark logs

setInputCols(*value)

Sets column names of input annotations.

Parameters
*valuestr

Input columns for the annotator

setLabelColumn(v)[source]

Sets name of column for data labels.

Parameters
vstr

Column for data labels

setLazyAnnotator(value)

Sets whether Annotator should be evaluated lazily in a RecursivePipeline.

Parameters
valuebool

Whether Annotator should be evaluated lazily in a RecursivePipeline

setLr(v)[source]

Sets Learning Rate, by default 0.001.

Parameters
vfloat

Learning Rate

setMaxEpochs(v)[source]

Sets maximum number of epochs to train, by default 10.

Parameters
vint

Maximum number of epochs to train

setOutputCol(value)

Sets output column name of annotations.

Parameters
valuestr

Name of output column

setOutputLogsPath(v)[source]

Sets folder path to save training logs.

Parameters
vstr

Folder path to save training logs

setParamValue(paramName)

Sets the value of a parameter.

Parameters
paramNamestr

Name of the parameter

setRandomSeed(seed)[source]

Sets random seed for shuffling.

Parameters
seedint

Random seed for shuffling

setThreshold(v)[source]

Sets minimum threshold for each label to be accepted, by default 0.5.

Parameters
vfloat

The minimum threshold for each label to be accepted, by default 0.5

setValidationSplit(v)[source]

Sets the proportion of training dataset to be validated against the model on each Epoch, by default it is 0.0 and off. The value should be between 0.0 and 1.0.

Parameters
vfloat

Proportion of training dataset to be validated

setVerbose(v)[source]

Sets level of verbosity during training.

Parameters
vint

Level of verbosity

uid

A unique id for the object.

write()

Returns an MLWriter instance for this ML instance.