sparknlp_jsl.annotator.classification.document_ml_classifier
#
Module Contents#
Classes#
Trains a model to classify documents with a Logarithmic Regression algorithm. Training data requires columns for |
|
Classifies documents with a Logarithmic Regression algorithm. |
|
- class DocumentMLClassifierApproach#
Bases:
sparknlp_jsl.common.AnnotatorApproachInternal
,DocumentMLClassifierParams
Trains a model to classify documents with a Logarithmic Regression algorithm. Training data requires columns for text and their label. The result is a trained GenericClassifierModel.
Input Annotation types
Output Annotation type
TOKEN
`CATEGORY
- Parameters:
labelCol – Column with the value result we are trying to predict.
maxIter – maximum number of iterations.
tol – convergence tolerance after each iteration.
fitIntercept – whether to fit an intercept term, default is true.
labels – array to output the label in the original form.
vectorizationModelPath – specify the vectorization model if it has been already trained.
classificationModelPath – specify the classification model if it has been already trained.
classificationModelClass – specify the SparkML classification class; possible values are: logreg, svm.
maxTokenNgram – the max number of tokens for Ngrams
minTokenNgram – the min number of tokens for Ngrams
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp_jsl.common import * >>> from sparknlp.annotator import * >>> from sparknlp.training import * >>> import sparknlp_jsl >>> from sparknlp_jsl.base import * >>> from sparknlp_jsl.annotator import * >>> from pyspark.ml import Pipeline
An example pipeline could then be defined like this
>>> tokenizer = Tokenizer() \ ... .setInputCols("document") \ ... .setOutputCol("token") ... >>> normalizer = Normalizer() \ ... .setInputCols("token") \ ... .setOutputCol("normalized") ... >>> stopwords_cleaner = StopWordsCleaner()\ ... .setInputCols("normalized")\ ... .setOutputCol("cleanTokens")\ ... .setCaseSensitive(False) ... ...stemmer = Stemmer() ... .setInputCols("cleanTokens") ... .setOutputCol("stem") ... >>> gen_clf = DocumentMLClassifierApproach() \ ... .setlabelCol("category") \ ... .setInputCols("stem") \ ... .setOutputCol("prediction") ... >>> pipeline = Pipeline().setStages([ ... document_assembler, ... tokenizer, ... normalizer, ... stopwords_cleaner, ... stemmer, ... gen_clf ...]) ... >>> clf_model = pipeline.fit(data)
- classificationModelClass#
- classificationModelPath#
- fitIntercept#
- getter_attrs = []#
- inputAnnotatorTypes#
- inputCols#
- labelCol#
- labels#
- lazyAnnotator#
- maxIter#
- maxTokenNgram#
- mergeChunks#
- minTokenNgram#
- optionalInputAnnotatorTypes = []#
- outputAnnotatorType#
- outputCol#
- skipLPInputColsValidation = True#
- tol#
- uid#
- vectorizationModelPath#
- clear(param: pyspark.ml.param.Param) None #
Clears a param from the param map if it has been explicitly set.
- copy(extra: pyspark.ml._typing.ParamMap | None = None) JP #
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
- Parameters:
extra (dict, optional) – Extra parameters to copy to the new instance
- Returns:
Copy of this instance
- Return type:
JavaParams
- explainParam(param: str | Param) str #
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
- explainParams() str #
Returns the documentation of all params with their optionally default values and user-supplied values.
- extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap #
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
- Parameters:
extra (dict, optional) – extra param values
- Returns:
merged param map
- Return type:
dict
- fit(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = ...) M #
- fit(dataset: pyspark.sql.dataframe.DataFrame, params: List[pyspark.ml._typing.ParamMap] | Tuple[pyspark.ml._typing.ParamMap]) List[M]
Fits a model to the input dataset with optional parameters.
New in version 1.3.0.
- Parameters:
dataset (
pyspark.sql.DataFrame
) – input dataset.params (dict or list or tuple, optional) – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
- Returns:
fitted model(s)
- Return type:
Transformer
or a list ofTransformer
- fitMultiple(dataset: pyspark.sql.dataframe.DataFrame, paramMaps: Sequence[pyspark.ml._typing.ParamMap]) Iterator[Tuple[int, M]] #
Fits a model to the input dataset for each param map in paramMaps.
New in version 2.3.0.
- Parameters:
dataset (
pyspark.sql.DataFrame
) – input dataset.paramMaps (
collections.abc.Sequence
) – A Sequence of param maps.
- Returns:
A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.
- Return type:
_FitMultipleIterator
- getInputCols()#
Gets current column names of input annotations.
- getLazyAnnotator()#
Gets whether Annotator should be evaluated lazily in a RecursivePipeline.
- getOrDefault(param: str) Any #
- getOrDefault(param: Param[T]) T
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
- getOutputCol()#
Gets output column name of annotations.
- getParam(paramName: str) Param #
Gets a param by its name.
- getParamValue(paramName)#
Gets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- hasDefault(param: str | Param[Any]) bool #
Checks whether a param has a default value.
- hasParam(paramName: str) bool #
Tests whether this instance contains a param with a given (string) name.
- inputColsValidation(value)#
- isDefined(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user or has a default value.
- isSet(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user.
- classmethod load(path: str) RL #
Reads an ML instance from the input path, a shortcut of read().load(path).
- classmethod read()#
Returns an MLReader instance for this class.
- save(path: str) None #
Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
- set(param: Param, value: Any) None #
Sets a parameter in the embedded param map.
- setClassificationModelClass(value)#
Sets a the classification model class from SparkML to use; possible values are: logreg, svm.
- Parameters:
label (str) – specify the SparkML classification class; possible values are: logreg, svm.
- setClassificationModelPath(value)#
Sets a path to the classification model if it has been already trained.
- Parameters:
label (str) – Path to the classification model if it has been already trained.
- setFitIntercept(merge)#
Sets whether to fit an intercept term, default is true.
- Parameters:
label (str) – Whether to fit an intercept term, default is true.
- setForceInputTypeValidation(etfm)#
- setInputCols(*value)#
Sets column names of input annotations.
- Parameters:
*value (List[str]) – Input columns for the annotator
- setLabelCol(label)#
Sets column with the value result we are trying to predict.
- Parameters:
label (str) – Column with the value result we are trying to predict.
- setLabels(value)#
Sets array to output the label in the original form.
- Parameters:
label (list) – array to output the label in the original form.
- setLazyAnnotator(value)#
Sets whether Annotator should be evaluated lazily in a RecursivePipeline.
- Parameters:
value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline
- setMaxIter(k)#
Sets maximum number of iterations.
- Parameters:
k (int) – Maximum number of iterations.
- setMaxTokenNgram(k)#
Sets maximum number of tokens for Ngrams.
- Parameters:
k (int) – Maximum number of tokens for Ngrams.
- setMergeChunks(merge)#
Sets whether to merge all chunks in a document or not (Default: false)
- Parameters:
merge (list) – whether to merge all chunks in a document or not (Default: false)
- setMinTokenNgram(k)#
Sets minimum number of tokens for Ngrams.
- Parameters:
k (int) – Minimum number of tokens for Ngrams.
- setOutputCol(value)#
Sets output column name of annotations.
- Parameters:
value (str) – Name of output column
- setParamValue(paramName)#
Sets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- setTol(dist)#
Sets convergence tolerance after each iteration.
- Parameters:
dist (float) – Convergence tolerance after each iteration.
- setVectorizationModelPath(value)#
Sets a path to the classification model if it has been already trained.
- Parameters:
label (str) – Path to the classification model if it has been already trained.
- write() JavaMLWriter #
Returns an MLWriter instance for this ML instance.
- class DocumentMLClassifierModel(classname='com.johnsnowlabs.nlp.annotators.classification.DocumentMLClassifierModel', java_model=None)#
Bases:
sparknlp_jsl.common.AnnotatorModelInternal
,DocumentMLClassifierParams
Classifies documents with a Logarithmic Regression algorithm.
Input Annotation types
Output Annotation type
TOKEN
CATEGORY
- Parameters:
mergeChunks – Whether to merge all chunks in a document or not (Default: false)
labels – Array to output the label in the original form.
vectorizationModel – Vectorization model if it has been already trained.
classificationModel – Classification model if it has been already trained.
- classificationModel#
- getter_attrs = []#
- inputAnnotatorTypes#
- inputCols#
- labels#
- lazyAnnotator#
- maxTokenNgram#
- mergeChunks#
- minTokenNgram#
- name = 'DocumentMLClassifierModel'#
- optionalInputAnnotatorTypes = []#
- outputAnnotatorType#
- outputCol#
- skipLPInputColsValidation = True#
- uid#
- vectorizationModel#
- clear(param: pyspark.ml.param.Param) None #
Clears a param from the param map if it has been explicitly set.
- copy(extra: pyspark.ml._typing.ParamMap | None = None) JP #
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
- Parameters:
extra (dict, optional) – Extra parameters to copy to the new instance
- Returns:
Copy of this instance
- Return type:
JavaParams
- explainParam(param: str | Param) str #
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
- explainParams() str #
Returns the documentation of all params with their optionally default values and user-supplied values.
- extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap #
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
- Parameters:
extra (dict, optional) – extra param values
- Returns:
merged param map
- Return type:
dict
- getInputCols()#
Gets current column names of input annotations.
- getLazyAnnotator()#
Gets whether Annotator should be evaluated lazily in a RecursivePipeline.
- getOrDefault(param: str) Any #
- getOrDefault(param: Param[T]) T
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
- getOutputCol()#
Gets output column name of annotations.
- getParam(paramName: str) Param #
Gets a param by its name.
- getParamValue(paramName)#
Gets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- hasDefault(param: str | Param[Any]) bool #
Checks whether a param has a default value.
- hasParam(paramName: str) bool #
Tests whether this instance contains a param with a given (string) name.
- inputColsValidation(value)#
- isDefined(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user or has a default value.
- isSet(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user.
- classmethod load(path: str) RL #
Reads an ML instance from the input path, a shortcut of read().load(path).
- static pretrained(name='classifierml_ade', lang='en', remote_loc='clinical/models')#
Downloads and loads a pretrained model.
- Parameters:
name (str, optional) – Name of the pretrained model, by default “classifierml_ade”
lang (str, optional) – Language of the pretrained model, by default “en”
remote_loc (str, optional) – Optional remote address of the resource, by default “clinical/models”. Will use Spark NLPs repositories otherwise.
- Returns:
The restored model
- Return type:
- classmethod read()#
Returns an MLReader instance for this class.
- save(path: str) None #
Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
- set(param: Param, value: Any) None #
Sets a parameter in the embedded param map.
- setClassificationModel(model)#
Sets a path to the classification model if it has been already trained.
- Parameters:
model (
pyspark.ml.PipelineModel
) – Classification model if it has been already trained.
- setForceInputTypeValidation(etfm)#
- setInputCols(*value)#
Sets column names of input annotations.
- Parameters:
*value (List[str]) – Input columns for the annotator
- setLabels(value)#
Sets array to output the label in the original form.
- Parameters:
label (list) – array to output the label in the original form.
- setLazyAnnotator(value)#
Sets whether Annotator should be evaluated lazily in a RecursivePipeline.
- Parameters:
value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline
- setMergeChunks(merge)#
Sets whether to merge all chunks in a document or not (Default: false)
- Parameters:
merge (list) – whether to merge all chunks in a document or not (Default: false)
- setOutputCol(value)#
Sets output column name of annotations.
- Parameters:
value (str) – Name of output column
- setParamValue(paramName)#
Sets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- setParams()#
- setVectorizationModel(model)#
Sets a path to the classification model if it has been already trained.
- Parameters:
model (
pyspark.ml.PipelineModel
) – Classification model if it has been already trained.
- transform(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = None) pyspark.sql.dataframe.DataFrame #
Transforms the input dataset with optional parameters.
New in version 1.3.0.
- Parameters:
dataset (
pyspark.sql.DataFrame
) – input datasetparams (dict, optional) – an optional param map that overrides embedded params.
- Returns:
transformed dataset
- Return type:
- write() JavaMLWriter #
Returns an MLWriter instance for this ML instance.
- class DocumentMLClassifierParams#
- labels#
- maxTokenNgram#
- mergeChunks#
- minTokenNgram#
- setLabels(value)#
Sets array to output the label in the original form.
- Parameters:
label (list) – array to output the label in the original form.
- setMergeChunks(merge)#
Sets whether to merge all chunks in a document or not (Default: false)
- Parameters:
merge (list) – whether to merge all chunks in a document or not (Default: false)