sparknlp_jsl.legal.sequence_classification.legal_classifier_dl
#
Module Contents#
Classes#
Trains a ClassifierDL for generic Multi-class Text Classification. |
|
LegalClassifierDL for generic Multi-class Text Classification. |
- class LegalClassifierDLApproach#
Bases:
sparknlp.annotator.classifier_dl.classifier_dl.ClassifierDLApproach
Trains a ClassifierDL for generic Multi-class Text Classification.
ClassifierDL uses the state-of-the-art Universal Sentence Encoder as an input for text classifications. The ClassifierDL annotator uses a deep learning model (DNNs) we have built inside TensorFlow and supports up to 100 classes.
For instantiated/pretrained models, see
ClassifierDLModel
.For extended examples of usage, see the Spark NLP Workshop Spark NLP Workshop.
Input Annotation types
Output Annotation type
SENTENCE_EMBEDDINGS
CATEGORY
- Parameters:
lr – Learning Rate, by default 0.005
batchSize – Batch size, by default 64
dropout – Dropout coefficient, by default 0.5
maxEpochs – Maximum number of epochs to train, by default 30
configProtoBytes – ConfigProto from tensorflow, serialized into byte array.
validationSplit – Choose the proportion of training dataset to be validated against the model on each Epoch. The value should be between 0.0 and 1.0 and by default it is 0.0 and off.
enableOutputLogs – Whether to use stdout in addition to Spark logs, by default False
outputLogsPath – Folder path to save training logs. If no path is specified, the logs won’t be stored in disk. The path can be a local file path, a distributed file path (HDFS, DBFS), or a cloud storage (S3).
labelColumn – Column with label per each token
verbose – Level of verbosity during training
randomSeed – Random seed for shuffling
Notes
This annotator accepts a label column of a single item in either type of String, Int, Float, or Double.
UniversalSentenceEncoder, Transformer based embeddings, or SentenceEmbeddings can be used for the
inputCol
.
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline
In this example, the training data
"sentiment.csv"
has the form of:text,label This movie is the best movie I have wached ever! In my opinion this movie can win an award.,0 This was a terrible movie! The acting was bad really bad!,1 ...
Then traning can be done like so:
>>> smallCorpus = spark.read.option("header","True").csv("src/test/resources/classifier/sentiment.csv") >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> useEmbeddings = UniversalSentenceEncoder.pretrained() \ ... .setInputCols("document") \ ... .setOutputCol("sentence_embeddings") >>> docClassifier = ClassifierDLApproach() \ ... .setInputCols("sentence_embeddings") \ ... .setOutputCol("category") \ ... .setLabelColumn("label") \ ... .setBatchSize(64) \ ... .setMaxEpochs(20) \ ... .setLr(5e-3) \ ... .setDropout(0.5) >>> pipeline = Pipeline().setStages([ ... documentAssembler, ... useEmbeddings, ... docClassifier ... ]) >>> pipelineModel = pipeline.fit(smallCorpus)
See also
MultiClassifierDLApproach
for multi-class classification
SentimentDLApproach
for sentiment analysis
- batchSize#
- configProtoBytes#
- dropout#
- enableOutputLogs#
- evaluationLogExtended#
- getter_attrs = []#
- inputAnnotatorTypes#
- inputCols#
- labelColumn#
- lazyAnnotator#
- lr#
- maxEpochs#
- optionalInputAnnotatorTypes = []#
- outputAnnotatorType#
- outputCol#
- outputLogsPath#
- randomSeed#
- testDataset#
- uid#
- validationSplit#
- verbose#
- clear(param: pyspark.ml.param.Param) None #
Clears a param from the param map if it has been explicitly set.
- copy(extra: pyspark.ml._typing.ParamMap | None = None) JP #
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
- Parameters:
extra (dict, optional) – Extra parameters to copy to the new instance
- Returns:
Copy of this instance
- Return type:
JavaParams
- explainParam(param: str | Param) str #
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
- explainParams() str #
Returns the documentation of all params with their optionally default values and user-supplied values.
- extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap #
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
- Parameters:
extra (dict, optional) – extra param values
- Returns:
merged param map
- Return type:
dict
- fit(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = ...) M #
- fit(dataset: pyspark.sql.dataframe.DataFrame, params: List[pyspark.ml._typing.ParamMap] | Tuple[pyspark.ml._typing.ParamMap]) List[M]
Fits a model to the input dataset with optional parameters.
New in version 1.3.0.
- Parameters:
dataset (
pyspark.sql.DataFrame
) – input dataset.params (dict or list or tuple, optional) – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
- Returns:
fitted model(s)
- Return type:
Transformer
or a list ofTransformer
- fitMultiple(dataset: pyspark.sql.dataframe.DataFrame, paramMaps: Sequence[pyspark.ml._typing.ParamMap]) Iterator[Tuple[int, M]] #
Fits a model to the input dataset for each param map in paramMaps.
New in version 2.3.0.
- Parameters:
dataset (
pyspark.sql.DataFrame
) – input dataset.paramMaps (
collections.abc.Sequence
) – A Sequence of param maps.
- Returns:
A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.
- Return type:
_FitMultipleIterator
- getInputCols()#
Gets current column names of input annotations.
- getLazyAnnotator()#
Gets whether Annotator should be evaluated lazily in a RecursivePipeline.
- getOrDefault(param: str) Any #
- getOrDefault(param: Param[T]) T
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
- getOutputCol()#
Gets output column name of annotations.
- getParam(paramName: str) Param #
Gets a param by its name.
- getParamValue(paramName)#
Gets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- hasDefault(param: str | Param[Any]) bool #
Checks whether a param has a default value.
- hasParam(paramName: str) bool #
Tests whether this instance contains a param with a given (string) name.
- inputColsValidation(value)#
- isDefined(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user or has a default value.
- isSet(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user.
- classmethod load(path: str) RL #
Reads an ML instance from the input path, a shortcut of read().load(path).
- classmethod read()#
Returns an MLReader instance for this class.
- save(path: str) None #
Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
- set(param: Param, value: Any) None #
Sets a parameter in the embedded param map.
- setBatchSize(v)#
Sets batch size, by default 64.
- Parameters:
v (int) – Batch size
- setConfigProtoBytes(b)#
Sets configProto from tensorflow, serialized into byte array.
- Parameters:
b (List[int]) – ConfigProto from tensorflow, serialized into byte array
- setDropout(v)#
Sets dropout coefficient, by default 0.5
- Parameters:
v (float) – Dropout coefficient
- setEnableOutputLogs(value)#
Sets whether to use stdout in addition to Spark logs, by default False.
- Parameters:
value (bool) – Whether to use stdout in addition to Spark logs
- setEvaluationLogExtended(v)#
Sets whether logs for validation to be extended, by default False. Displays time and evaluation of each label.
- Parameters:
v (bool) – Whether logs for validation to be extended
- setInputCols(*value)#
Sets column names of input annotations.
- Parameters:
*value (List[str]) – Input columns for the annotator
- setLabelColumn(value)#
Sets name of column for data labels
- Parameters:
value (str) – Column for data labels
- setLazyAnnotator(value)#
Sets whether Annotator should be evaluated lazily in a RecursivePipeline.
- Parameters:
value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline
- setLr(v)#
Sets Learning Rate, by default 0.005
- Parameters:
v (float) – Learning Rate
- setMaxEpochs(epochs)#
Sets maximum number of epochs to train, by default 30
- Parameters:
epochs (int) – Maximum number of epochs to train
- setOutputCol(value)#
Sets output column name of annotations.
- Parameters:
value (str) – Name of output column
- setOutputLogsPath(p)#
Sets folder path to save training logs
- Parameters:
p (str) – Folder path to save training logs
- setParamValue(paramName)#
Sets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- setRandomSeed(seed)#
Sets random seed for shuffling
- Parameters:
seed (int) – Random seed for shuffling
- setTestDataset(path, read_as=ReadAs.SPARK, options={'format': 'parquet'})#
Path to a parquet file of a test dataset. If set, it is used to calculate statistics on it during training.
The parquet file must be a dataframe that has the same columns as the model that is being trained. For example, if the model needs as input DOCUMENT, TOKEN, WORD_EMBEDDINGS (Features) and NAMED_ENTITY (label) then these columns also need to be present while saving the dataframe. The pre-processing steps for the training dataframe should also be applied to the test dataframe.
An example on how to create such a parquet file could be:
>>> # assuming preProcessingPipeline >>> (train, test) = data.randomSplit([0.8, 0.2]) >>> preProcessingPipeline ... .fit(test) ... .transform(test) ... .write ... .mode("overwrite") ... .parquet("test_data") >>> annotator.setTestDataset("test_data")
- Parameters:
path (str) – Path to test dataset
read_as (str, optional) – How to read the resource, by default ReadAs.SPARK
options (dict, optional) – Options for reading the resource, by default {“format”: “csv”}
- setValidationSplit(v)#
Sets the proportion of training dataset to be validated against the model on each Epoch, by default it is 0.0 and off. The value should be between 0.0 and 1.0.
- Parameters:
v (float) – Proportion of training dataset to be validated
- setVerbose(value)#
Sets level of verbosity during training
- Parameters:
value (int) – Level of verbosity
- write() JavaMLWriter #
Returns an MLWriter instance for this ML instance.
- class LegalClassifierDLModel(classname='com.johnsnowlabs.legal.sequence_classification.LegalClassifierDLModel', java_model=None)#
Bases:
sparknlp.annotator.classifier_dl.classifier_dl.ClassifierDLModel
LegalClassifierDL for generic Multi-class Text Classification.
ClassifierDL uses the state-of-the-art Universal Sentence Encoder as an input for text classifications. The ClassifierDL annotator uses a deep learning model (DNNs) we have built inside TensorFlow and supports up to 100 classes.
This is the instantiated model of the
ClassifierDLApproach
. For training your own model, please see the documentation of that class.Pretrained models can be loaded with
pretrained()
of the companion object:>>> classifierDL = ClassifierDLModel.pretrained() \ ... .setInputCols(["sentence_embeddings"]) \ ... .setOutputCol("classification")
The default model is
"classifierdl_use_trec6"
, if no name is provided. It uses embeddings from the UniversalSentenceEncoder and is trained on the TREC-6 dataset.For available pretrained models please see the Models Hub.
For extended examples of usage, see the Spark NLP Workshop.
Input Annotation types
Output Annotation type
SENTENCE_EMBEDDINGS
CATEGORY
- Parameters:
configProtoBytes – ConfigProto from tensorflow, serialized into byte array.
classes – Get the tags used to trained this ClassifierDLModel
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> sentence = SentenceDetector() \ ... .setInputCols("document") \ ... .setOutputCol("sentence") >>> useEmbeddings = UniversalSentenceEncoder.pretrained() \ ... .setInputCols("document") \ ... .setOutputCol("sentence_embeddings") >>> sarcasmDL = ClassifierDLModel.pretrained("classifierdl_use_sarcasm") \ ... .setInputCols("sentence_embeddings") \ ... .setOutputCol("sarcasm") >>> pipeline = Pipeline() \ ... .setStages([ ... documentAssembler, ... sentence, ... useEmbeddings, ... sarcasmDL ... ]) >>> data = spark.createDataFrame([ ... ["I'm ready!"], ... ["If I could put into words how much I love waking up at 6 am on Mondays I would."] ... ]).toDF("text") >>> result = pipeline.fit(data).transform(data) >>> result.selectExpr("explode(arrays_zip(sentence, sarcasm)) as out") \ ... .selectExpr("out.sentence.result as sentence", "out.sarcasm.result as sarcasm") \ ... .show(truncate=False) +-------------------------------------------------------------------------------+-------+ |sentence |sarcasm| +-------------------------------------------------------------------------------+-------+ |I'm ready! |normal | |If I could put into words how much I love waking up at 6 am on Mondays I would.|sarcasm| +-------------------------------------------------------------------------------+-------+
See also
MultiClassifierDLModel
for multi-class classification
SentimentDLModel
for sentiment analysis
- classes#
- configProtoBytes#
- engine#
- getter_attrs = []#
- inputAnnotatorTypes#
- inputCols#
- lazyAnnotator#
- name = 'LegalClassifierDLModel'#
- optionalInputAnnotatorTypes = []#
- outputAnnotatorType#
- outputCol#
- storageRef#
- uid#
- clear(param: pyspark.ml.param.Param) None #
Clears a param from the param map if it has been explicitly set.
- copy(extra: pyspark.ml._typing.ParamMap | None = None) JP #
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
- Parameters:
extra (dict, optional) – Extra parameters to copy to the new instance
- Returns:
Copy of this instance
- Return type:
JavaParams
- explainParam(param: str | Param) str #
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
- explainParams() str #
Returns the documentation of all params with their optionally default values and user-supplied values.
- extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap #
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
- Parameters:
extra (dict, optional) – extra param values
- Returns:
merged param map
- Return type:
dict
- getEngine()#
- Returns:
Deep Learning engine used for this model”
- Return type:
str
- getInputCols()#
Gets current column names of input annotations.
- getLazyAnnotator()#
Gets whether Annotator should be evaluated lazily in a RecursivePipeline.
- getOrDefault(param: str) Any #
- getOrDefault(param: Param[T]) T
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
- getOutputCol()#
Gets output column name of annotations.
- getParam(paramName: str) Param #
Gets a param by its name.
- getParamValue(paramName)#
Gets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- getStorageRef()#
Gets unique reference name for identification.
- Returns:
Unique reference name for identification
- Return type:
str
- hasDefault(param: str | Param[Any]) bool #
Checks whether a param has a default value.
- hasParam(paramName: str) bool #
Tests whether this instance contains a param with a given (string) name.
- inputColsValidation(value)#
- isDefined(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user or has a default value.
- isSet(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user.
- classmethod load(path: str) RL #
Reads an ML instance from the input path, a shortcut of read().load(path).
- static pretrained(name='legclf_argument_mining', lang='en', remote_loc='legal/models')#
Download a pre-trained LegalClassifierDLModel.
- Parameters:
name (str) – Name of the pre-trained model, by default “legclf_argument_mining”
lang (str) – Language of the pre-trained model, by default “en”
remote_loc (str) – Remote location of the pre-trained model. If None, use the
"clinical/models" (open-source location. Other values are) –
:param : :param “finance/models”: :param or “legal/models”.:
- Returns:
A pre-trained LegalClassifierDLModel.
- Return type:
- classmethod read()#
Returns an MLReader instance for this class.
- save(path: str) None #
Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
- set(param: Param, value: Any) None #
Sets a parameter in the embedded param map.
- setConfigProtoBytes(b)#
Sets configProto from tensorflow, serialized into byte array.
- Parameters:
b (List[int]) – ConfigProto from tensorflow, serialized into byte array
- setInputCols(*value)#
Sets column names of input annotations.
- Parameters:
*value (List[str]) – Input columns for the annotator
- setLazyAnnotator(value)#
Sets whether Annotator should be evaluated lazily in a RecursivePipeline.
- Parameters:
value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline
- setOutputCol(value)#
Sets output column name of annotations.
- Parameters:
value (str) – Name of output column
- setParamValue(paramName)#
Sets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- setParams()#
- setStorageRef(value)#
Sets unique reference name for identification.
- Parameters:
value (str) – Unique reference name for identification
- transform(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = None) pyspark.sql.dataframe.DataFrame #
Transforms the input dataset with optional parameters.
New in version 1.3.0.
- Parameters:
dataset (
pyspark.sql.DataFrame
) – input datasetparams (dict, optional) – an optional param map that overrides embedded params.
- Returns:
transformed dataset
- Return type:
- write() JavaMLWriter #
Returns an MLWriter instance for this ML instance.