sparknlp_jsl.annotator.tf_graph_builder#

Module Contents#

Classes#

TFGraphBuilder

Creates Tensorflow graphs.

TFGraphBuilderModel

Creates Tensorflow graphs.

class TFGraphBuilder#

Bases: pyspark.ml.Estimator, pyspark.ml.util.DefaultParamsWritable, pyspark.ml.util.DefaultParamsReadable

Creates Tensorflow graphs. This class is used to build a TensorFlow graph from a given model name and a set of input columns.

Additionally, Graphs are data structures that contain a set of tf.Operation objects,

which represent units of computation; and tf.Tensor objects, which represent the units of data that flow between operations. They are defined in a tf.Graph context. Since these graphs are data structures, they can be saved, run, and restored all without the original Python code.

A TF graph is a data structure that represents a computation.

It is made up of a series of nodes, each of which represents an operation, and edges, which represent the flow of data between operations.

TF graphs are used to represent the computation of DL models.

The nodes in the graph represent the different layers of the model, and the edges represent the connections between the layers.

The biggest idea about TF is that all the numerical computations are expressed as a computational graph.

In other words, the backbone of any TF program is a Graph. Anything that happens in your model is represented by the computational graph.

Examples

>>> graph_folder = "./tf_graphs"
>>> re_graph_folder = TFGraphBuilder() \
...     .setModelName("relation_extraction") \
...     .setInputCols(["embeddings", "pos_tags", "train_ner_chunks", "dependencies"]) \
...     .setLabelColumn("rel") \
...     .setGraphFolder(graph_folder) \
...     .setGraphFile("re_graph.pb") \
...     .setHiddenLayers([300,200]) \
...     .setHiddenAct("relu") \
...     .setHiddenActL2(True) \
...     .setHiddenWeightsL2(False) \
...     .setBatchNorm(False)
batchNorm#
graphFile#
graphFolder#
hiddenAct#
hiddenActL2#
hiddenLayers#
hiddenUnitsNumber#
hiddenWeightsL2#
inputCols#
isLicensed#
labelColumn#
maxSequenceLength#
modelName#
outputAct#
uid#
useRelationDirection#
clear(param: Param) None#

Clears a param from the param map if it has been explicitly set.

copy(extra: pyspark.ml._typing.ParamMap | None = None) P#

Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient.

Parameters:

extra (dict, optional) – Extra parameters to copy to the new instance

Returns:

Copy of this instance

Return type:

Params

explainParam(param: str | Param) str#

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams() str#

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap#

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:

extra (dict, optional) – extra param values

Returns:

merged param map

Return type:

dict

fit(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = ...) M#
fit(dataset: pyspark.sql.dataframe.DataFrame, params: List[pyspark.ml._typing.ParamMap] | Tuple[pyspark.ml._typing.ParamMap]) List[M]

Fits a model to the input dataset with optional parameters.

New in version 1.3.0.

Parameters:
  • dataset (pyspark.sql.DataFrame) – input dataset.

  • params (dict or list or tuple, optional) – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.

Returns:

fitted model(s)

Return type:

Transformer or a list of Transformer

fitMultiple(dataset: pyspark.sql.dataframe.DataFrame, paramMaps: Sequence[pyspark.ml._typing.ParamMap]) Iterator[Tuple[int, M]]#

Fits a model to the input dataset for each param map in paramMaps.

New in version 2.3.0.

Parameters:
  • dataset (pyspark.sql.DataFrame) – input dataset.

  • paramMaps (collections.abc.Sequence) – A Sequence of param maps.

Returns:

A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.

Return type:

_FitMultipleIterator

getBatchNorm()#

Batch normalization, used in RelationExtractionApproach.

getGraphFile()#

Gets the graph file name.

getGraphFolder()#

Gets the graph folder.

getHiddenAct()#

Activation function for hidden layers, used in RelationExtractionApproach.

getHiddenActL2()#

L2 regularization of hidden layer activations, used in RelationExtractionApproach.

getHiddenLayers()#

Gets the list of hiudden layer sizes for RelationExtractionApproach.

getHiddenUnitsNumber()#

Gets the number of hidden units for AssertionDLApproach and MedicalNerApproach.

getHiddenWeightsL2()#

L2 regularization of hidden layer weights, used in RelationExtractionApproach.

getInputCols()#

Gets current column names of input annotations.

getIsLicensed()#

Medical model, used in MedicalNerAprroach and NerDLApproach.

getLabelColumn()#

Gets the name of the label column.

getMaxSequenceLength()#

Gets the maximum sequence length for AssertionDLApproach.

getModelName()#

Gets the name of the model.

getOrDefault(param: str) Any#
getOrDefault(param: Param[T]) T

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputAct()#

Output activation function, used in GenericClassifier.

getParam(paramName: str) Param#

Gets a param by its name.

getUseRelationDirection()#

Checks whether relation direction is encoded in RelationExtractionApproach.

hasDefault(param: str | Param[Any]) bool#

Checks whether a param has a default value.

hasParam(paramName: str) bool#

Tests whether this instance contains a param with a given (string) name.

isDefined(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user or has a default value.

isSet(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user.

classmethod load(path: str) RL#

Reads an ML instance from the input path, a shortcut of read().load(path).

classmethod read() DefaultParamsReader[RL]#

Returns a DefaultParamsReader instance for this class.

save(path: str) None#

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param: Param, value: Any) None#

Sets a parameter in the embedded param map.

setBatchNorm(value)#

Batch normalization, used in RelationExtractionApproach. It is used to enable batch normalization for the model. Batch normalization is a technique that helps to improve the stability and performance of deep neural networks. It works by normalizing the activations of each layer in the network, which helps to prevent the network from becoming too sensitive to changes in the input data. This can lead to faster training times and better accuracy.

Parameters:

value (boolean) – Batch normalization for RelationExtractionApproach

setGraphFile(value)#

Sets the graph file name.

Parameters:

value (srt) – Greaph file name. If set to “auto”, then the graph builder will use the model specific default graph file name.

setGraphFolder(value)#

Sets folder path that contain external graph files.

Parameters:

value (srt) – Folder path that contain external graph files.

setHiddenAct(value)#

Activation function for hidden layers, used in RelationExtractionApproach.

Parameters:

value (string) – Activation function for hidden layers, used in RelationExtractionApproach. Possible value are: relu, sigmoid, tanh, linear

setHiddenActL2(value)#

L2 regularization of hidden layer weights, used in RelationExtractionApproach. L2 regularization (aka weight decay) is a regularization technique used to prevent overfitting in machine learning models, including deep neural networks. It involves adding a regularization term to the loss function, which encourages the model’s weights to have smaller values.

Parameters:

value (boolean) – L2 regularization of hidden layer activations, used in RelationExtractionApproach

setHiddenLayers(value)#

A list of hidden layer sizes for RelationExtractionApproach. Used to set the number and size of hidden layers in the model. The number of hidden layers determines the complexity of the model, while the size of the hidden layers determines the number of neurons in each layer.

Parameters:

value (int) – A list of hidden layer sizes for RelationExtractionApproach

setHiddenUnitsNumber(value)#

Sets the number of hidden units for AssertionDLApproach and MedicalNerApproach. Used to set the number of neurons in the hidden layers of the model. It may be an option to increase or decrease this value depending on the complexity of the problem that the neural network is being trained to solve.

Parameters:

value (int) – Number of hidden units for AssertionDLApproach and MedicalNerApproach

setHiddenWeightsL2(value)#

L2 regularization of hidden layer weights, used in RelationExtractionApproach.

Parameters:

value (boolean) – L2 regularization of hidden layer weights, used in RelationExtractionApproach

setInputCols(*value)#

Sets column names of input annotations.

Parameters:

*value (str) – Input columns for the annotator

setIsLicensed(value)#

Medical model, used in MedicalNerAprroach and NerDLApproach.

Parameters:

value (boolean) – Where medical model or not (default is True)

setLabelColumn(value)#

Sets the name of the column for data labels.

Parameters:

value (str) – Column for data labels

setMaxSequenceLength(value)#

Sets the maximum sequence length for AssertionDLApproach. Used to set the maximum length of a sequence (of tokens) that the model can handle. If the value is set too low, longer sequences may be truncated, potentially losing valuable information. On the other hand, setting the value too high may consume excessive memory and slow down the training or inference process.

Parameters:

value (int) – Maximum sequence length for AssertionDLApproach

setModelName(value)#

Sets the model name.

Parameters:

value (str) – Model name

setUseRelationDirection(value)#

Whether relation direction will be encoded in RelationExtractionApproach.

Parameters:

value (bool) – Whether relation direction is encoded in RelationExtractionApproach

settOutputAct(value)#

Output activation function, used in GenericClassifier.

Parameters:

value (str) – Output activation function: softmax or sigmoid

write() MLWriter#

Returns a DefaultParamsWriter instance for this class.

class TFGraphBuilderModel#

Bases: pyspark.ml.Model, pyspark.ml.util.DefaultParamsWritable, pyspark.ml.util.DefaultParamsReadable

Creates Tensorflow graphs.

This class is used to build a TensorFlow graph from a given model name and a set of input columns.

uid#
clear(param: Param) None#

Clears a param from the param map if it has been explicitly set.

copy(extra: pyspark.ml._typing.ParamMap | None = None) P#

Creates a copy of this instance with the same uid and some extra params. The default implementation creates a shallow copy using copy.copy(), and then copies the embedded and extra parameters over and returns the copy. Subclasses should override this method if the default approach is not sufficient.

Parameters:

extra (dict, optional) – Extra parameters to copy to the new instance

Returns:

Copy of this instance

Return type:

Params

explainParam(param: str | Param) str#

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams() str#

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap#

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:

extra (dict, optional) – extra param values

Returns:

merged param map

Return type:

dict

getOrDefault(param: str) Any#
getOrDefault(param: Param[T]) T

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getParam(paramName: str) Param#

Gets a param by its name.

hasDefault(param: str | Param[Any]) bool#

Checks whether a param has a default value.

hasParam(paramName: str) bool#

Tests whether this instance contains a param with a given (string) name.

isDefined(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user or has a default value.

isSet(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user.

classmethod load(path: str) RL#

Reads an ML instance from the input path, a shortcut of read().load(path).

classmethod read() DefaultParamsReader[RL]#

Returns a DefaultParamsReader instance for this class.

save(path: str) None#

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param: Param, value: Any) None#

Sets a parameter in the embedded param map.

transform(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = None) pyspark.sql.dataframe.DataFrame#

Transforms the input dataset with optional parameters.

New in version 1.3.0.

Parameters:
  • dataset (pyspark.sql.DataFrame) – input dataset

  • params (dict, optional) – an optional param map that overrides embedded params.

Returns:

transformed dataset

Return type:

pyspark.sql.DataFrame

write() MLWriter#

Returns a DefaultParamsWriter instance for this class.