sparknlp.annotator.SentimentDLApproach

class sparknlp.annotator.SentimentDLApproach[source]

Bases: sparknlp.common.AnnotatorApproach

Trains a SentimentDL, an annotator for multi-class sentiment analysis.

In natural language processing, sentiment analysis is the task of classifying the affective state or subjective view of a text. A common example is if either a product review or tweet can be interpreted positively or negatively.

For the instantiated/pretrained models, see SentimentDLModel.

For extended examples of usage, see the Spark NLP Workshop.

Input Annotation types

Output Annotation type

SENTENCE_EMBEDDINGS

CATEGORY

Parameters
lr

Learning Rate, by default 0.005

batchSize

Batch size, by default 64

dropout

Dropout coefficient, by default 0.5

maxEpochs

Maximum number of epochs to train, by default 30

configProtoBytes

ConfigProto from tensorflow, serialized into byte array.

validationSplit

Choose the proportion of training dataset to be validated against the model on each Epoch. The value should be between 0.0 and 1.0 and by default it is 0.0 and off.

enableOutputLogs

Whether to use stdout in addition to Spark logs, by default False

outputLogsPath

Folder path to save training logs

labelColumn

Column with label per each token

verbose

Level of verbosity during training

randomSeed

Random seed

threshold

The minimum threshold for the final result otheriwse it will be neutral, by default 0.6

thresholdLabel

In case the score is less than threshold, what should be the label. Default is neutral, by default “neutral”

Notes

  • This annotator accepts a label column of a single item in either type of String, Int, Float, or Double. So positive sentiment can be expressed as either "positive" or 0, negative sentiment as "negative" or 1.

  • UniversalSentenceEncoder, BertSentenceEmbeddings, or SentenceEmbeddings can be used for the inputCol.

Examples

In this example, sentiment.csv is in the form:

text,label
This movie is the best movie I have watched ever! In my opinion this movie can win an award.,0
This was a terrible movie! The acting was bad really bad!,1

The model can then be trained with

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> smallCorpus = spark.read.option("header", "True").csv("src/test/resources/classifier/sentiment.csv")
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> useEmbeddings = UniversalSentenceEncoder.pretrained() \
...     .setInputCols(["document"]) \
...     .setOutputCol("sentence_embeddings")
>>> docClassifier = SentimentDLApproach() \
...     .setInputCols(["sentence_embeddings"]) \
...     .setOutputCol("sentiment") \
...     .setLabelColumn("label") \
...     .setBatchSize(32) \
...     .setMaxEpochs(1) \
...     .setLr(5e-3) \
...     .setDropout(0.5)
>>> pipeline = Pipeline().setStages([
...         documentAssembler,
...         useEmbeddings,
...         docClassifier
... ])
>>> pipelineModel = pipeline.fit(smallCorpus)

Methods

__init__()

clear(param)

Clears a param from the param map if it has been explicitly set.

copy([extra])

Creates a copy of this instance with the same uid and some extra params.

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap([extra])

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

fit(dataset[, params])

Fits a model to the input dataset with optional parameters.

fitMultiple(dataset, paramMaps)

Fits a model to the input dataset for each param map in paramMaps.

getInputCols()

Gets current column names of input annotations.

getLazyAnnotator()

Gets whether Annotator should be evaluated lazily in a RecursivePipeline.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value.

getOutputCol()

Gets output column name of annotations.

getParam(paramName)

Gets a param by its name.

getParamValue(paramName)

Gets the value of a parameter.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of 'write().save(path)'.

set(param, value)

Sets a parameter in the embedded param map.

setBatchSize(v)

Sets batch size, by default 64.

setConfigProtoBytes(b)

Sets configProto from tensorflow, serialized into byte array.

setDropout(v)

Sets dropout coefficient, by default 0.5.

setEnableOutputLogs(value)

Sets whether to use stdout in addition to Spark logs, by default False.

setInputCols(*value)

Sets column names of input annotations.

setLabelColumn(value)

Sets name of column for data labels

setLazyAnnotator(value)

Sets whether Annotator should be evaluated lazily in a RecursivePipeline.

setLr(v)

Sets Learning Rate, by default 0.005

setMaxEpochs(epochs)

Sets maximum number of epochs to train, by default 30.

setOutputCol(value)

Sets output column name of annotations.

setOutputLogsPath(p)

Sets folder path to save training logs.

setParamValue(paramName)

Sets the value of a parameter.

setRandomSeed(seed)

Sets random seed for shuffling

setThreshold(v)

Sets the minimum threshold for the final result otheriwse it will be neutral, by default 0.6.

setThresholdLabel(p)

Sets what the label should be, if the score is less than threshold, by default "neutral".

setValidationSplit(v)

Sets the proportion of training dataset to be validated against the model on each Epoch, by default it is 0.0 and off.

setVerbose(value)

Sets level of verbosity during training

write()

Returns an MLWriter instance for this ML instance.

Attributes

batchSize

configProtoBytes

dropout

enableOutputLogs

getter_attrs

inputCols

labelColumn

lazyAnnotator

lr

maxEpochs

outputCol

outputLogsPath

params

Returns all params ordered by name.

randomSeed

threshold

thresholdLabel

validationSplit

verbose

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

fit(dataset, params=None)

Fits a model to the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.

Returns

fitted model(s)

New in version 1.3.0.

fitMultiple(dataset, paramMaps)

Fits a model to the input dataset for each param map in paramMaps.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame.

  • paramMaps – A Sequence of param maps.

Returns

A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.

New in version 2.3.0.

getInputCols()

Gets current column names of input annotations.

getLazyAnnotator()

Gets whether Annotator should be evaluated lazily in a RecursivePipeline.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets output column name of annotations.

getParam(paramName)

Gets a param by its name.

getParamValue(paramName)

Gets the value of a parameter.

Parameters
paramNamestr

Name of the parameter

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setBatchSize(v)[source]

Sets batch size, by default 64.

Parameters
vint

Batch size

setConfigProtoBytes(b)[source]

Sets configProto from tensorflow, serialized into byte array.

Parameters
bList[str]

ConfigProto from tensorflow, serialized into byte array

setDropout(v)[source]

Sets dropout coefficient, by default 0.5.

Parameters
vfloat

Dropout coefficient

setEnableOutputLogs(value)[source]

Sets whether to use stdout in addition to Spark logs, by default False.

Parameters
valuebool

Whether to use stdout in addition to Spark logs

setInputCols(*value)

Sets column names of input annotations.

Parameters
*valuestr

Input columns for the annotator

setLabelColumn(value)[source]

Sets name of column for data labels

Parameters
valuestr

Column for data labels

setLazyAnnotator(value)

Sets whether Annotator should be evaluated lazily in a RecursivePipeline.

Parameters
valuebool

Whether Annotator should be evaluated lazily in a RecursivePipeline

setLr(v)[source]

Sets Learning Rate, by default 0.005

Parameters
vfloat

Learning Rate

setMaxEpochs(epochs)[source]

Sets maximum number of epochs to train, by default 30.

Parameters
epochsint

Maximum number of epochs to train

setOutputCol(value)

Sets output column name of annotations.

Parameters
valuestr

Name of output column

setOutputLogsPath(p)[source]

Sets folder path to save training logs.

Parameters
pstr

Folder path to save training logs

setParamValue(paramName)

Sets the value of a parameter.

Parameters
paramNamestr

Name of the parameter

setRandomSeed(seed)[source]

Sets random seed for shuffling

Parameters
seedint

Random seed for shuffling

setThreshold(v)[source]

Sets the minimum threshold for the final result otheriwse it will be neutral, by default 0.6.

Parameters
vfloat

Minimum threshold for the final result

setThresholdLabel(p)[source]

Sets what the label should be, if the score is less than threshold, by default “neutral”.

Parameters
pstr

The label, if the score is less than threshold

setValidationSplit(v)[source]

Sets the proportion of training dataset to be validated against the model on each Epoch, by default it is 0.0 and off. The value should be between 0.0 and 1.0.

Parameters
vfloat

Proportion of training dataset to be validated

setVerbose(value)[source]

Sets level of verbosity during training

Parameters
valueint

Level of verbosity

uid

A unique id for the object.

write()

Returns an MLWriter instance for this ML instance.