sparknlp_jsl.finance.sequence_classification.finance_classifier_dl#

Module Contents#

Classes#

FinanceClassifierDLApproach

Trains a ClassifierDL for generic Multi-class Text Classification.

FinanceClassifierDLModel

FinanceClassifierDL for generic Multi-class Text Classification.

class FinanceClassifierDLApproach#

Bases: sparknlp.annotator.classifier_dl.classifier_dl.ClassifierDLApproach

Trains a ClassifierDL for generic Multi-class Text Classification.

ClassifierDL uses the state-of-the-art Universal Sentence Encoder as an input for text classifications. The ClassifierDL annotator uses a deep learning model (DNNs) we have built inside TensorFlow and supports up to 100 classes.

For instantiated/pretrained models, see ClassifierDLModel.

For extended examples of usage, see the Spark NLP Workshop Spark NLP Workshop.

Input Annotation types

Output Annotation type

SENTENCE_EMBEDDINGS

CATEGORY

Parameters:
  • lr – Learning Rate, by default 0.005

  • batchSize – Batch size, by default 64

  • dropout – Dropout coefficient, by default 0.5

  • maxEpochs – Maximum number of epochs to train, by default 30

  • configProtoBytes – ConfigProto from tensorflow, serialized into byte array.

  • validationSplit – Choose the proportion of training dataset to be validated against the model on each Epoch. The value should be between 0.0 and 1.0 and by default it is 0.0 and off.

  • enableOutputLogs – Whether to use stdout in addition to Spark logs, by default False

  • outputLogsPath – Folder path to save training logs. If no path is specified, the logs won’t be stored in disk. The path can be a local file path, a distributed file path (HDFS, DBFS), or a cloud storage (S3).

  • labelColumn – Column with label per each token

  • verbose – Level of verbosity during training

  • randomSeed – Random seed for shuffling

Notes

  • This annotator accepts a label column of a single item in either type of String, Int, Float, or Double.

  • UniversalSentenceEncoder, Transformer based embeddings, or SentenceEmbeddings can be used for the inputCol.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline

In this example, the training data "sentiment.csv" has the form of:

text,label
This movie is the best movie I have wached ever! In my opinion this movie can win an award.,0
This was a terrible movie! The acting was bad really bad!,1
...

Then traning can be done like so:

>>> smallCorpus = spark.read.option("header","True").csv("src/test/resources/classifier/sentiment.csv")
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> useEmbeddings = UniversalSentenceEncoder.pretrained() \
...     .setInputCols("document") \
...     .setOutputCol("sentence_embeddings")
>>> docClassifier = ClassifierDLApproach() \
...     .setInputCols("sentence_embeddings") \
...     .setOutputCol("category") \
...     .setLabelColumn("label") \
...     .setBatchSize(64) \
...     .setMaxEpochs(20) \
...     .setLr(5e-3) \
...     .setDropout(0.5)
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     useEmbeddings,
...     docClassifier
... ])
>>> pipelineModel = pipeline.fit(smallCorpus)

See also

MultiClassifierDLApproach

for multi-class classification

SentimentDLApproach

for sentiment analysis

batchSize#
configProtoBytes#
dropout#
enableOutputLogs#
evaluationLogExtended#
getter_attrs = []#
inputAnnotatorTypes#
inputCols#
labelColumn#
lazyAnnotator#
lr#
maxEpochs#
optionalInputAnnotatorTypes = []#
outputAnnotatorType = 'category'#
outputCol#
outputLogsPath#
randomSeed#
testDataset#
uid = ''#
validationSplit#
verbose#
clear(param: pyspark.ml.param.Param) None#

Clears a param from the param map if it has been explicitly set.

copy(extra: pyspark.ml._typing.ParamMap | None = None) JP#

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:

extra (dict, optional) – Extra parameters to copy to the new instance

Returns:

Copy of this instance

Return type:

JavaParams

explainParam(param: str | Param) str#

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams() str#

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap#

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:

extra (dict, optional) – extra param values

Returns:

merged param map

Return type:

dict

fit(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = ...) M#
fit(dataset: pyspark.sql.dataframe.DataFrame, params: List[pyspark.ml._typing.ParamMap] | Tuple[pyspark.ml._typing.ParamMap]) List[M]

Fits a model to the input dataset with optional parameters.

New in version 1.3.0.

Parameters:
  • dataset (pyspark.sql.DataFrame) – input dataset.

  • params (dict or list or tuple, optional) – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.

Returns:

fitted model(s)

Return type:

Transformer or a list of Transformer

fitMultiple(dataset: pyspark.sql.dataframe.DataFrame, paramMaps: Sequence[pyspark.ml._typing.ParamMap]) Iterator[Tuple[int, M]]#

Fits a model to the input dataset for each param map in paramMaps.

New in version 2.3.0.

Parameters:
  • dataset (pyspark.sql.DataFrame) – input dataset.

  • paramMaps (collections.abc.Sequence) – A Sequence of param maps.

Returns:

A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.

Return type:

_FitMultipleIterator

getInputCols()#

Gets current column names of input annotations.

getLazyAnnotator()#

Gets whether Annotator should be evaluated lazily in a RecursivePipeline.

getOrDefault(param: str) Any#
getOrDefault(param: Param[T]) T

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()#

Gets output column name of annotations.

getParam(paramName: str) Param#

Gets a param by its name.

getParamValue(paramName)#

Gets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

hasDefault(param: str | Param[Any]) bool#

Checks whether a param has a default value.

hasParam(paramName: str) bool#

Tests whether this instance contains a param with a given (string) name.

inputColsValidation(value)#
isDefined(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user or has a default value.

isSet(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user.

classmethod load(path: str) RL#

Reads an ML instance from the input path, a shortcut of read().load(path).

classmethod read()#

Returns an MLReader instance for this class.

save(path: str) None#

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param: Param, value: Any) None#

Sets a parameter in the embedded param map.

setBatchSize(v)#

Sets batch size, by default 64.

Parameters:

v (int) – Batch size

setConfigProtoBytes(b)#

Sets configProto from tensorflow, serialized into byte array.

Parameters:

b (List[int]) – ConfigProto from tensorflow, serialized into byte array

setDropout(v)#

Sets dropout coefficient, by default 0.5

Parameters:

v (float) – Dropout coefficient

setEnableOutputLogs(value)#

Sets whether to use stdout in addition to Spark logs, by default False.

Parameters:

value (bool) – Whether to use stdout in addition to Spark logs

setEvaluationLogExtended(v)#

Sets whether logs for validation to be extended, by default False. Displays time and evaluation of each label.

Parameters:

v (bool) – Whether logs for validation to be extended

setInputCols(*value)#

Sets column names of input annotations.

Parameters:

*value (List[str]) – Input columns for the annotator

setLabelColumn(value)#

Sets name of column for data labels

Parameters:

value (str) – Column for data labels

setLazyAnnotator(value)#

Sets whether Annotator should be evaluated lazily in a RecursivePipeline.

Parameters:

value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline

setLr(v)#

Sets Learning Rate, by default 0.005

Parameters:

v (float) – Learning Rate

setMaxEpochs(epochs)#

Sets maximum number of epochs to train, by default 30

Parameters:

epochs (int) – Maximum number of epochs to train

setOutputCol(value)#

Sets output column name of annotations.

Parameters:

value (str) – Name of output column

setOutputLogsPath(p)#

Sets folder path to save training logs

Parameters:

p (str) – Folder path to save training logs

setParamValue(paramName)#

Sets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

setRandomSeed(seed)#

Sets random seed for shuffling

Parameters:

seed (int) – Random seed for shuffling

setTestDataset(path, read_as=ReadAs.SPARK, options={'format': 'parquet'})#

Path to a parquet file of a test dataset. If set, it is used to calculate statistics on it during training.

The parquet file must be a dataframe that has the same columns as the model that is being trained. For example, if the model needs as input DOCUMENT, TOKEN, WORD_EMBEDDINGS (Features) and NAMED_ENTITY (label) then these columns also need to be present while saving the dataframe. The pre-processing steps for the training dataframe should also be applied to the test dataframe.

An example on how to create such a parquet file could be:

>>> # assuming preProcessingPipeline
>>> (train, test) = data.randomSplit([0.8, 0.2])
>>> preProcessingPipeline
...     .fit(test)
...     .transform(test)
...     .write
...     .mode("overwrite")
...     .parquet("test_data")
>>> annotator.setTestDataset("test_data")
Parameters:
  • path (str) – Path to test dataset

  • read_as (str, optional) – How to read the resource, by default ReadAs.SPARK

  • options (dict, optional) – Options for reading the resource, by default {“format”: “csv”}

setValidationSplit(v)#

Sets the proportion of training dataset to be validated against the model on each Epoch, by default it is 0.0 and off. The value should be between 0.0 and 1.0.

Parameters:

v (float) – Proportion of training dataset to be validated

setVerbose(value)#

Sets level of verbosity during training

Parameters:

value (int) – Level of verbosity

write() JavaMLWriter#

Returns an MLWriter instance for this ML instance.

class FinanceClassifierDLModel(classname='com.johnsnowlabs.finance.sequence_classification.FinanceClassifierDLModel', java_model=None)#

Bases: sparknlp.annotator.classifier_dl.classifier_dl.ClassifierDLModel

FinanceClassifierDL for generic Multi-class Text Classification.

ClassifierDL uses the state-of-the-art Universal Sentence Encoder as an input for text classifications. The ClassifierDL annotator uses a deep learning model (DNNs) we have built inside TensorFlow and supports up to 100 classes.

This is the instantiated model of the ClassifierDLApproach. For training your own model, please see the documentation of that class.

Pretrained models can be loaded with pretrained() of the companion object:

>>> classifierDL = ClassifierDLModel.pretrained() \
...     .setInputCols(["sentence_embeddings"]) \
...     .setOutputCol("classification")

The default model is "classifierdl_use_trec6", if no name is provided. It uses embeddings from the UniversalSentenceEncoder and is trained on the TREC-6 dataset.

For available pretrained models please see the Models Hub.

For extended examples of usage, see the Spark NLP Workshop.

Input Annotation types

Output Annotation type

SENTENCE_EMBEDDINGS

CATEGORY

Parameters:
  • configProtoBytes – ConfigProto from tensorflow, serialized into byte array.

  • classes – Get the tags used to trained this ClassifierDLModel

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> sentence = SentenceDetector() \
...     .setInputCols("document") \
...     .setOutputCol("sentence")
>>> useEmbeddings = UniversalSentenceEncoder.pretrained() \
...     .setInputCols("document") \
...     .setOutputCol("sentence_embeddings")
>>> sarcasmDL = ClassifierDLModel.pretrained("classifierdl_use_sarcasm") \
...     .setInputCols("sentence_embeddings") \
...     .setOutputCol("sarcasm")
>>> pipeline = Pipeline() \
...     .setStages([
...       documentAssembler,
...       sentence,
...       useEmbeddings,
...       sarcasmDL
...     ])
>>> data = spark.createDataFrame([
...     ["I'm ready!"],
...     ["If I could put into words how much I love waking up at 6 am on Mondays I would."]
... ]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.selectExpr("explode(arrays_zip(sentence, sarcasm)) as out") \
...     .selectExpr("out.sentence.result as sentence", "out.sarcasm.result as sarcasm") \
...     .show(truncate=False)
+-------------------------------------------------------------------------------+-------+
|sentence                                                                       |sarcasm|
+-------------------------------------------------------------------------------+-------+
|I'm ready!                                                                     |normal |
|If I could put into words how much I love waking up at 6 am on Mondays I would.|sarcasm|
+-------------------------------------------------------------------------------+-------+

See also

MultiClassifierDLModel

for multi-class classification

SentimentDLModel

for sentiment analysis

classes#
configProtoBytes#
engine#
getter_attrs = []#
inputAnnotatorTypes#
inputCols#
lazyAnnotator#
name = 'FinanceClassifierDLModel'#
optionalInputAnnotatorTypes = []#
outputAnnotatorType = 'category'#
outputCol#
storageRef#
uid = ''#
clear(param: pyspark.ml.param.Param) None#

Clears a param from the param map if it has been explicitly set.

copy(extra: pyspark.ml._typing.ParamMap | None = None) JP#

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:

extra (dict, optional) – Extra parameters to copy to the new instance

Returns:

Copy of this instance

Return type:

JavaParams

explainParam(param: str | Param) str#

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams() str#

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap#

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:

extra (dict, optional) – extra param values

Returns:

merged param map

Return type:

dict

getEngine()#
Returns:

Deep Learning engine used for this model”

Return type:

str

getInputCols()#

Gets current column names of input annotations.

getLazyAnnotator()#

Gets whether Annotator should be evaluated lazily in a RecursivePipeline.

getOrDefault(param: str) Any#
getOrDefault(param: Param[T]) T

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()#

Gets output column name of annotations.

getParam(paramName: str) Param#

Gets a param by its name.

getParamValue(paramName)#

Gets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

getStorageRef()#

Gets unique reference name for identification.

Returns:

Unique reference name for identification

Return type:

str

hasDefault(param: str | Param[Any]) bool#

Checks whether a param has a default value.

hasParam(paramName: str) bool#

Tests whether this instance contains a param with a given (string) name.

inputColsValidation(value)#
isDefined(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user or has a default value.

isSet(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user.

classmethod load(path: str) RL#

Reads an ML instance from the input path, a shortcut of read().load(path).

static pretrained(name='finclf_sec_filings', lang='en', remote_loc='finance/models')#

Download a pre-trained FinanceClassifierDLModel.

Parameters:
  • name (str) – Name of the pre-trained model, by default “finclf_sec_filings”

  • lang (str) – Language of the pre-trained model, by default “en”

  • remote_loc (str) – Remote location of the pre-trained model. If None, use the

  • "clinical/models" (open-source location. Other values are) –

:param : :param “finance/models”: :param or “legal/models”.:

Returns:

A pre-trained FinanceClassifierDLModel.

Return type:

FinanceClassifierDLModel

classmethod read()#

Returns an MLReader instance for this class.

save(path: str) None#

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param: Param, value: Any) None#

Sets a parameter in the embedded param map.

setConfigProtoBytes(b)#

Sets configProto from tensorflow, serialized into byte array.

Parameters:

b (List[int]) – ConfigProto from tensorflow, serialized into byte array

setInputCols(*value)#

Sets column names of input annotations.

Parameters:

*value (List[str]) – Input columns for the annotator

setLazyAnnotator(value)#

Sets whether Annotator should be evaluated lazily in a RecursivePipeline.

Parameters:

value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline

setOutputCol(value)#

Sets output column name of annotations.

Parameters:

value (str) – Name of output column

setParamValue(paramName)#

Sets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

setParams()#
setStorageRef(value)#

Sets unique reference name for identification.

Parameters:

value (str) – Unique reference name for identification

transform(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = None) pyspark.sql.dataframe.DataFrame#

Transforms the input dataset with optional parameters.

New in version 1.3.0.

Parameters:
  • dataset (pyspark.sql.DataFrame) – input dataset

  • params (dict, optional) – an optional param map that overrides embedded params.

Returns:

transformed dataset

Return type:

pyspark.sql.DataFrame

write() JavaMLWriter#

Returns an MLWriter instance for this ML instance.