sparknlp.annotator.SentimentDLApproach#
- class sparknlp.annotator.SentimentDLApproach[source]#
Bases:
sparknlp.common.AnnotatorApproach
Trains a SentimentDL, an annotator for multi-class sentiment analysis.
In natural language processing, sentiment analysis is the task of classifying the affective state or subjective view of a text. A common example is if either a product review or tweet can be interpreted positively or negatively.
For the instantiated/pretrained models, see
SentimentDLModel
.For extended examples of usage, see the Spark NLP Workshop.
Input Annotation types
Output Annotation type
SENTENCE_EMBEDDINGS
CATEGORY
- Parameters
- lr
Learning Rate, by default 0.005
- batchSize
Batch size, by default 64
- dropout
Dropout coefficient, by default 0.5
- maxEpochs
Maximum number of epochs to train, by default 30
- configProtoBytes
ConfigProto from tensorflow, serialized into byte array.
- validationSplit
Choose the proportion of training dataset to be validated against the model on each Epoch. The value should be between 0.0 and 1.0 and by default it is 0.0 and off.
- enableOutputLogs
Whether to use stdout in addition to Spark logs, by default False
- outputLogsPath
Folder path to save training logs
- labelColumn
Column with label per each token
- verbose
Level of verbosity during training
- randomSeed
Random seed
- threshold
The minimum threshold for the final result otheriwse it will be neutral, by default 0.6
- thresholdLabel
In case the score is less than threshold, what should be the label. Default is neutral, by default “neutral”
Notes
This annotator accepts a label column of a single item in either type of String, Int, Float, or Double. So positive sentiment can be expressed as either
"positive"
or0
, negative sentiment as"negative"
or1
.UniversalSentenceEncoder, BertSentenceEmbeddings, or SentenceEmbeddings can be used for the
inputCol
.
Examples
In this example,
sentiment.csv
is in the form:text,label This movie is the best movie I have watched ever! In my opinion this movie can win an award.,0 This was a terrible movie! The acting was bad really bad!,1
The model can then be trained with
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> smallCorpus = spark.read.option("header", "True").csv("src/test/resources/classifier/sentiment.csv") >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> useEmbeddings = UniversalSentenceEncoder.pretrained() \ ... .setInputCols(["document"]) \ ... .setOutputCol("sentence_embeddings") >>> docClassifier = SentimentDLApproach() \ ... .setInputCols(["sentence_embeddings"]) \ ... .setOutputCol("sentiment") \ ... .setLabelColumn("label") \ ... .setBatchSize(32) \ ... .setMaxEpochs(1) \ ... .setLr(5e-3) \ ... .setDropout(0.5) >>> pipeline = Pipeline().setStages([ ... documentAssembler, ... useEmbeddings, ... docClassifier ... ]) >>> pipelineModel = pipeline.fit(smallCorpus)
Methods
__init__
()clear
(param)Clears a param from the param map if it has been explicitly set.
copy
([extra])Creates a copy of this instance with the same uid and some extra params.
explainParam
(param)Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
Returns the documentation of all params with their optionally default values and user-supplied values.
extractParamMap
([extra])Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
fit
(dataset[, params])Fits a model to the input dataset with optional parameters.
fitMultiple
(dataset, paramMaps)Fits a model to the input dataset for each param map in paramMaps.
Gets current column names of input annotations.
Gets whether Annotator should be evaluated lazily in a RecursivePipeline.
getOrDefault
(param)Gets the value of a param in the user-supplied param map or its default value.
Gets output column name of annotations.
getParam
(paramName)Gets a param by its name.
getParamValue
(paramName)Gets the value of a parameter.
hasDefault
(param)Checks whether a param has a default value.
hasParam
(paramName)Tests whether this instance contains a param with a given (string) name.
isDefined
(param)Checks whether a param is explicitly set by user or has a default value.
isSet
(param)Checks whether a param is explicitly set by user.
load
(path)Reads an ML instance from the input path, a shortcut of read().load(path).
read
()Returns an MLReader instance for this class.
save
(path)Save this ML instance to the given path, a shortcut of 'write().save(path)'.
set
(param, value)Sets a parameter in the embedded param map.
setBatchSize
(v)Sets batch size, by default 64.
Sets configProto from tensorflow, serialized into byte array.
setDropout
(v)Sets dropout coefficient, by default 0.5.
setEnableOutputLogs
(value)Sets whether to use stdout in addition to Spark logs, by default False.
setInputCols
(*value)Sets column names of input annotations.
setLabelColumn
(value)Sets name of column for data labels
setLazyAnnotator
(value)Sets whether Annotator should be evaluated lazily in a RecursivePipeline.
setLr
(v)Sets Learning Rate, by default 0.005
setMaxEpochs
(epochs)Sets maximum number of epochs to train, by default 30.
setOutputCol
(value)Sets output column name of annotations.
Sets folder path to save training logs.
setParamValue
(paramName)Sets the value of a parameter.
setRandomSeed
(seed)Sets random seed for shuffling
setThreshold
(v)Sets the minimum threshold for the final result otheriwse it will be neutral, by default 0.6.
Sets what the label should be, if the score is less than threshold, by default "neutral".
Sets the proportion of training dataset to be validated against the model on each Epoch, by default it is 0.0 and off.
setVerbose
(value)Sets level of verbosity during training
write
()Returns an MLWriter instance for this ML instance.
Attributes
batchSize
configProtoBytes
dropout
enableOutputLogs
getter_attrs
inputCols
labelColumn
lazyAnnotator
lr
maxEpochs
outputCol
outputLogsPath
Returns all params ordered by name.
randomSeed
threshold
thresholdLabel
validationSplit
verbose
- clear(param)#
Clears a param from the param map if it has been explicitly set.
- copy(extra=None)#
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
- Parameters
extra – Extra parameters to copy to the new instance
- Returns
Copy of this instance
- explainParam(param)#
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
- explainParams()#
Returns the documentation of all params with their optionally default values and user-supplied values.
- extractParamMap(extra=None)#
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
- Parameters
extra – extra param values
- Returns
merged param map
- fit(dataset, params=None)#
Fits a model to the input dataset with optional parameters.
- Parameters
dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
- Returns
fitted model(s)
New in version 1.3.0.
- fitMultiple(dataset, paramMaps)#
Fits a model to the input dataset for each param map in paramMaps.
- Parameters
dataset – input dataset, which is an instance of
pyspark.sql.DataFrame
.paramMaps – A Sequence of param maps.
- Returns
A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.
New in version 2.3.0.
- getInputCols()#
Gets current column names of input annotations.
- getLazyAnnotator()#
Gets whether Annotator should be evaluated lazily in a RecursivePipeline.
- getOrDefault(param)#
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
- getOutputCol()#
Gets output column name of annotations.
- getParam(paramName)#
Gets a param by its name.
- getParamValue(paramName)#
Gets the value of a parameter.
- Parameters
- paramNamestr
Name of the parameter
- hasDefault(param)#
Checks whether a param has a default value.
- hasParam(paramName)#
Tests whether this instance contains a param with a given (string) name.
- isDefined(param)#
Checks whether a param is explicitly set by user or has a default value.
- isSet(param)#
Checks whether a param is explicitly set by user.
- classmethod load(path)#
Reads an ML instance from the input path, a shortcut of read().load(path).
- property params#
Returns all params ordered by name. The default implementation uses
dir()
to get all attributes of typeParam
.
- classmethod read()#
Returns an MLReader instance for this class.
- save(path)#
Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
- set(param, value)#
Sets a parameter in the embedded param map.
- setConfigProtoBytes(b)[source]#
Sets configProto from tensorflow, serialized into byte array.
- Parameters
- bList[int]
ConfigProto from tensorflow, serialized into byte array
- setDropout(v)[source]#
Sets dropout coefficient, by default 0.5.
- Parameters
- vfloat
Dropout coefficient
- setEnableOutputLogs(value)[source]#
Sets whether to use stdout in addition to Spark logs, by default False.
- Parameters
- valuebool
Whether to use stdout in addition to Spark logs
- setInputCols(*value)#
Sets column names of input annotations.
- Parameters
- *valuestr
Input columns for the annotator
- setLabelColumn(value)[source]#
Sets name of column for data labels
- Parameters
- valuestr
Column for data labels
- setLazyAnnotator(value)#
Sets whether Annotator should be evaluated lazily in a RecursivePipeline.
- Parameters
- valuebool
Whether Annotator should be evaluated lazily in a RecursivePipeline
- setMaxEpochs(epochs)[source]#
Sets maximum number of epochs to train, by default 30.
- Parameters
- epochsint
Maximum number of epochs to train
- setOutputCol(value)#
Sets output column name of annotations.
- Parameters
- valuestr
Name of output column
- setOutputLogsPath(p)[source]#
Sets folder path to save training logs.
- Parameters
- pstr
Folder path to save training logs
- setParamValue(paramName)#
Sets the value of a parameter.
- Parameters
- paramNamestr
Name of the parameter
- setRandomSeed(seed)[source]#
Sets random seed for shuffling
- Parameters
- seedint
Random seed for shuffling
- setThreshold(v)[source]#
Sets the minimum threshold for the final result otheriwse it will be neutral, by default 0.6.
- Parameters
- vfloat
Minimum threshold for the final result
- setThresholdLabel(p)[source]#
Sets what the label should be, if the score is less than threshold, by default “neutral”.
- Parameters
- pstr
The label, if the score is less than threshold
- setValidationSplit(v)[source]#
Sets the proportion of training dataset to be validated against the model on each Epoch, by default it is 0.0 and off. The value should be between 0.0 and 1.0.
- Parameters
- vfloat
Proportion of training dataset to be validated
- setVerbose(value)[source]#
Sets level of verbosity during training
- Parameters
- valueint
Level of verbosity
- uid#
A unique id for the object.
- write()#
Returns an MLWriter instance for this ML instance.