sparknlp_jsl.annotator.re.relation_extraction#

Module Contents#

Classes#

GenericREModel

Generic Relation Extraction Model.

PosologyREModel

Model for Posology Relation Extraction.

RelationExtractionApproach

Trains a TensorFlow model for relation extraction.

RelationExtractionModel

Extracts and classifies instances of relations between named entities.

class GenericREModel(classname='com.johnsnowlabs.nlp.annotators.re.GenericREModel', java_model=None)#

Bases: RelationExtractionModel

Generic Relation Extraction Model.

Instantiated RelationExtractionModel for extracting relationships between any entities. This class is not intended to be directly used, please use the RelationExtractionModel instead. Pairs of entities should be specified using setRelationPairs.

classes#
customLabels#
datasetInfo#
directionSensitive#
doExceptionHandling#
featureScaling#
filterByTokenDistance#
getter_attrs = []#
inputAnnotatorTypes#
inputCols#
lazyAnnotator#
maxSyntacticDistance#
multiClass#
name = 'RelationExtractionModel'#
optionalInputAnnotatorTypes = []#
outputAnnotatorType#
outputCol#
predictionThreshold#
relationPairs#
relationPairsCaseSensitive#
scopeWindow#
skipLPInputColsValidation = True#
uid#
clear(param: pyspark.ml.param.Param) None#

Clears a param from the param map if it has been explicitly set.

copy(extra: pyspark.ml._typing.ParamMap | None = None) JP#

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:

extra (dict, optional) – Extra parameters to copy to the new instance

Returns:

Copy of this instance

Return type:

JavaParams

explainParam(param: str | Param) str#

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams() str#

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap#

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:

extra (dict, optional) – extra param values

Returns:

merged param map

Return type:

dict

getClasses()#

Returns labels used to train this model.

getInputCols()#

Gets current column names of input annotations.

getLazyAnnotator()#

Gets whether Annotator should be evaluated lazily in a RecursivePipeline.

getOrDefault(param: str) Any#
getOrDefault(param: Param[T]) T

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()#

Gets output column name of annotations.

getParam(paramName: str) Param#

Gets a param by its name.

getParamValue(paramName)#

Gets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

getRelationTypePerPair()#

Return the list of entity pairs allowed for a given relation.

hasDefault(param: str | Param[Any]) bool#

Checks whether a param has a default value.

hasParam(paramName: str) bool#

Tests whether this instance contains a param with a given (string) name.

inputColsValidation(value)#
isDefined(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user or has a default value.

isSet(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user.

classmethod load(path: str) RL#

Reads an ML instance from the input path, a shortcut of read().load(path).

static pretrained(name='posology_re', lang='en', remote_loc='clinical/models')#

Download a pre-trained RelationExtractionModel.

Parameters:
  • name (str) – Name of the pre-trained model, by default “posology_re”

  • lang (str) – Language of the pre-trained model, by default “en”

  • remote_loc (str) – Remote location of the pre-trained model. If None, use the open-source location. Other values are “clinical/models”, “finance/models”, or “legal/models”.

Returns:

A pre-trained RelationExtractionModel.

Return type:

RelationExtractionModel

classmethod read()#

Returns an MLReader instance for this class.

save(path: str) None#

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param: Param, value: Any) None#

Sets a parameter in the embedded param map.

setCustomLabels(labels: dict)#

Sets custom relation labels

Parameters:

labels (dict[str, str]) – Dictionary which maps old to new labels

setDatasetInfo(info: str)#

Sets descriptive information about the dataset being used.

Parameters:

info (str) – Descriptive information about the dataset being used.

setDirectionSensitive(value: bool)#

Sets whether to sensitive for direction in processing relation pairs. If it is true, only relations in the form of “ENTITY1-ENTITY2” will be considered, If it is false, both “ENTITY1-ENTITY2” and “ENTITY2-ENTITY1” relations will be considered,

Parameters:

value (bool) – Whether to sensitive for direction in processing relation pairs. Default is ‘False’

setDoExceptionHandling(value: bool)#

If True, exceptions are handled. If exception causing data is passed to the model, a error annotation is emitted which has the exception message. Processing continues with the next one. This comes with a performance penalty.

Parameters:

value (bool) – If True, exceptions are handled.

setFeatureScaling(feature_scaling: str)#

Sets Feature scaling method.

Possible values are ‘zscore’, ‘minmax’ or empty (no scaling).

Parameters:

feature_scaling (str) – Feature scaling method. Possible values are ‘zscore’, ‘minmax’ or empty (no scaling).

setFilterByTokenDistance(distance: int)#
Sets number of tokens between entities. (Default: 0)

Model only finds relations that have fewer than the specified number of tokens between them.

Parameters:

distance (int) – filtering criterion based on number of token between entities. (Default: 0)

setForceInputTypeValidation(etfm)#
setInputCols(*value)#

Sets column names of input annotations.

Parameters:

*value (List[str]) – Input columns for the annotator

setLazyAnnotator(value)#

Sets whether Annotator should be evaluated lazily in a RecursivePipeline.

Parameters:

value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline

setMaxSyntacticDistance(distance)#

Sets maximal syntactic distance, as threshold (Default: 0).

Determine how far the “from entity” can be from the “to entity” in the text. Increasing this value will increase recall, but also increase the number of false positives.

Parameters:

distance (int) – Maximal syntactic distance, as threshold (Default: 0)

setMultiClass(value: bool)#

Sets the model in multi class prediction mode (Default: false).

Parameters:

value (bool) – Whether to return only the label with the highest confidence score or all labels

setOutputCol(value)#

Sets output column name of annotations.

Parameters:

value (str) – Name of output column

setParamValue(paramName)#

Sets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

setParams()#
setPredictionThreshold(threshold: float)#

Sets Minimal activation of the target unit to encode a new relation instance.

Parameters:

threshold (float) – Minimal activation of the target unit to encode a new relation instance

setRelationPairs(pairs)#

Sets List of dash-separated pairs of named entities to be processed.

Parameters:

pairs (str or list[str]) – List of dash-separated pairs of named entities to be processed.

setRelationPairsCaseSensitive(value: bool)#

Sets the case sensitivity of relation pairs

Parameters:

value (bool) – whether relation pairs are case sensitive

setRelationTypePerPair(relationTypePairs)#

Set the list of entity pairs allowed for a given relation.

Parameters:

relationTypePairs (dict[str, list[str]]) – Dictionary which maps relation types to a list of entity pairs.

setScopeWindow(value)#

Sets the scope of the window of the feature generation for relations :param value: Left and right offset if the scope window. Offsets must be non-negative values :type value: [int, int]

transform(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = None) pyspark.sql.dataframe.DataFrame#

Transforms the input dataset with optional parameters.

New in version 1.3.0.

Parameters:
  • dataset (pyspark.sql.DataFrame) – input dataset

  • params (dict, optional) – an optional param map that overrides embedded params.

Returns:

transformed dataset

Return type:

pyspark.sql.DataFrame

write() JavaMLWriter#

Returns an MLWriter instance for this ML instance.

class PosologyREModel(classname='com.johnsnowlabs.nlp.annotators.re.PosologyREModel', java_model=None)#

Bases: RelationExtractionModel

Model for Posology Relation Extraction.

Instantiated RelationExtractionModel for extracting relationships between different recognized drug entitites. This class is not intended to be directly used, please use the RelationExtractionModel instead. Possible values are “DRUG-DOSAGE”, “DRUG-ADE”, “DRUG-FORM”, “DRUG-FREQUENCY”, “DRUG-ROUTE”, “DRUG-REASON”, “DRUG-STRENGTH”, “DRUG-DURATION”.

classes#
customLabels#
datasetInfo#
directionSensitive#
doExceptionHandling#
featureScaling#
filterByTokenDistance#
getter_attrs = []#
inputAnnotatorTypes#
inputCols#
lazyAnnotator#
maxSyntacticDistance#
multiClass#
name = 'RelationExtractionModel'#
optionalInputAnnotatorTypes = []#
outputAnnotatorType#
outputCol#
predictionThreshold#
relationPairs#
relationPairsCaseSensitive#
scopeWindow#
skipLPInputColsValidation = True#
uid#
clear(param: pyspark.ml.param.Param) None#

Clears a param from the param map if it has been explicitly set.

copy(extra: pyspark.ml._typing.ParamMap | None = None) JP#

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:

extra (dict, optional) – Extra parameters to copy to the new instance

Returns:

Copy of this instance

Return type:

JavaParams

explainParam(param: str | Param) str#

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams() str#

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap#

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:

extra (dict, optional) – extra param values

Returns:

merged param map

Return type:

dict

getClasses()#

Returns labels used to train this model.

getInputCols()#

Gets current column names of input annotations.

getLazyAnnotator()#

Gets whether Annotator should be evaluated lazily in a RecursivePipeline.

getOrDefault(param: str) Any#
getOrDefault(param: Param[T]) T

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()#

Gets output column name of annotations.

getParam(paramName: str) Param#

Gets a param by its name.

getParamValue(paramName)#

Gets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

getRelationTypePerPair()#

Return the list of entity pairs allowed for a given relation.

hasDefault(param: str | Param[Any]) bool#

Checks whether a param has a default value.

hasParam(paramName: str) bool#

Tests whether this instance contains a param with a given (string) name.

inputColsValidation(value)#
isDefined(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user or has a default value.

isSet(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user.

classmethod load(path: str) RL#

Reads an ML instance from the input path, a shortcut of read().load(path).

static pretrained(name='posology_re', lang='en', remote_loc='clinical/models')#

Download a pre-trained RelationExtractionModel.

Parameters:
  • name (str) – Name of the pre-trained model, by default “posology_re”

  • lang (str) – Language of the pre-trained model, by default “en”

  • remote_loc (str) – Remote location of the pre-trained model. If None, use the open-source location. Other values are “clinical/models”, “finance/models”, or “legal/models”.

Returns:

A pre-trained RelationExtractionModel.

Return type:

RelationExtractionModel

classmethod read()#

Returns an MLReader instance for this class.

save(path: str) None#

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param: Param, value: Any) None#

Sets a parameter in the embedded param map.

setCustomLabels(labels: dict)#

Sets custom relation labels

Parameters:

labels (dict[str, str]) – Dictionary which maps old to new labels

setDatasetInfo(info: str)#

Sets descriptive information about the dataset being used.

Parameters:

info (str) – Descriptive information about the dataset being used.

setDirectionSensitive(value: bool)#

Sets whether to sensitive for direction in processing relation pairs. If it is true, only relations in the form of “ENTITY1-ENTITY2” will be considered, If it is false, both “ENTITY1-ENTITY2” and “ENTITY2-ENTITY1” relations will be considered,

Parameters:

value (bool) – Whether to sensitive for direction in processing relation pairs. Default is ‘False’

setDoExceptionHandling(value: bool)#

If True, exceptions are handled. If exception causing data is passed to the model, a error annotation is emitted which has the exception message. Processing continues with the next one. This comes with a performance penalty.

Parameters:

value (bool) – If True, exceptions are handled.

setFeatureScaling(feature_scaling: str)#

Sets Feature scaling method.

Possible values are ‘zscore’, ‘minmax’ or empty (no scaling).

Parameters:

feature_scaling (str) – Feature scaling method. Possible values are ‘zscore’, ‘minmax’ or empty (no scaling).

setFilterByTokenDistance(distance: int)#
Sets number of tokens between entities. (Default: 0)

Model only finds relations that have fewer than the specified number of tokens between them.

Parameters:

distance (int) – filtering criterion based on number of token between entities. (Default: 0)

setForceInputTypeValidation(etfm)#
setInputCols(*value)#

Sets column names of input annotations.

Parameters:

*value (List[str]) – Input columns for the annotator

setLazyAnnotator(value)#

Sets whether Annotator should be evaluated lazily in a RecursivePipeline.

Parameters:

value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline

setMaxSyntacticDistance(distance)#

Sets maximal syntactic distance, as threshold (Default: 0).

Determine how far the “from entity” can be from the “to entity” in the text. Increasing this value will increase recall, but also increase the number of false positives.

Parameters:

distance (int) – Maximal syntactic distance, as threshold (Default: 0)

setMultiClass(value: bool)#

Sets the model in multi class prediction mode (Default: false).

Parameters:

value (bool) – Whether to return only the label with the highest confidence score or all labels

setOutputCol(value)#

Sets output column name of annotations.

Parameters:

value (str) – Name of output column

setParamValue(paramName)#

Sets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

setParams()#
setPredictionThreshold(threshold: float)#

Sets Minimal activation of the target unit to encode a new relation instance.

Parameters:

threshold (float) – Minimal activation of the target unit to encode a new relation instance

setRelationPairs(pairs)#

Sets List of dash-separated pairs of named entities to be processed.

Parameters:

pairs (str or list[str]) – List of dash-separated pairs of named entities to be processed.

setRelationPairsCaseSensitive(value: bool)#

Sets the case sensitivity of relation pairs

Parameters:

value (bool) – whether relation pairs are case sensitive

setRelationTypePerPair(relationTypePairs)#

Set the list of entity pairs allowed for a given relation.

Parameters:

relationTypePairs (dict[str, list[str]]) – Dictionary which maps relation types to a list of entity pairs.

setScopeWindow(value)#

Sets the scope of the window of the feature generation for relations :param value: Left and right offset if the scope window. Offsets must be non-negative values :type value: [int, int]

transform(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = None) pyspark.sql.dataframe.DataFrame#

Transforms the input dataset with optional parameters.

New in version 1.3.0.

Parameters:
  • dataset (pyspark.sql.DataFrame) – input dataset

  • params (dict, optional) – an optional param map that overrides embedded params.

Returns:

transformed dataset

Return type:

pyspark.sql.DataFrame

write() JavaMLWriter#

Returns an MLWriter instance for this ML instance.

class RelationExtractionApproach(classname='com.johnsnowlabs.nlp.annotators.re.RelationExtractionApproach')#

Bases: sparknlp_jsl.annotator.generic_classifier.generic_classifier.GenericClassifierApproach, sparknlp_jsl.annotator.handle_exception_params.HandleExceptionParams

Trains a TensorFlow model for relation extraction.

For pretrained models, see the documentation of RelationExtractionModel.

To train a custom relation extraction model, you need to first create a Tensorflow graph using either the TfGraphBuilder annotator or the tf_graph module. Then, set the path to the Tensorflow graph using the method .setModelFile(“path/to/tensorflow_graph.pb”).

If the parameter relationDirectionCol is set, the model will be trained using the direction information (see the parameter decription for details). Otherwise, the model won’t have direction between the relation of the entities.

After training a model (using the .fit() method), the resulting object is of class RelationExtractionModel.

Input Annotation types

Output Annotation type

WORD_EMBEDDINGS, POS, CHUNK, DEPENDENCY

CATEGORY

Parameters:
  • fromEntityBeginCol – From Entity Begining Column

  • fromEntityEndCol – From Entity End Column

  • fromEntityLabelCol – From Entity Label Column

  • toEntityBeginCol – To Entity Begining Column

  • toEntityEndCol – To Entity End Column

  • toEntityLabelCol – To Entity Label Column

  • relationDirectionCol – Relation direction column which contains the information of the relation between the “from entity” and the “to entity”. Possible values in this column are: - “leftwards”: the relation arguments are ordered from right to left (<-) in the text. - “rightwards”: the relation arguments are ordered from left to right (->) in the text. - “both”: Order doesn’t matter, the relation is symmetric (<->). If this parameter is not set, the model will not have direction between the relation of the entities.

  • pretrainedModelPath – Path to an already trained MedicalNerModel, which is used as a starting point for training the new model. The path can be a local file path, a distributed file path (HDFS, DBFS), or a cloud storage (S3).

  • directionSensitive – Specify direction sensitivity in processing relation pairs Default is ‘false’.

  • filterByTokenDistance – Filtering criterion based on number of token between entities (Default: 0).

  • scopeWindow – The scope window of the feature generation. Default :[-1, -1].

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp_jsl.common import *
>>> from sparknlp.annotator import *
>>> from sparknlp.training import *
>>> import sparknlp_jsl
>>> from sparknlp_jsl.base import *
>>> from sparknlp_jsl.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...   .setInputCol("text") \
...   .setOutputCol("document")
...
>>> tokenizer = Tokenizer() \
...   .setInputCols(["document"]) \
...   .setOutputCol("tokens")
...
>>> embedder = WordEmbeddingsModel \
...   .pretrained("embeddings_clinical", "en", "clinical/models") \
...   .setInputCols(["document", "tokens"]) \
...   .setOutputCol("embeddings")
...
>>> posTagger = PerceptronModel \
...   .pretrained("pos_clinical", "en", "clinical/models") \
...   .setInputCols(["document", "tokens"]) \
...   .setOutputCol("posTags")
...
>>> nerTagger = MedicalNerModel \
...   .pretrained("ner_events_clinical", "en", "clinical/models") \
...   .setInputCols(["document", "tokens", "embeddings"]) \
...   .setOutputCol("ner_tags")
...
>>> nerConverter = NerConverter() \
...   .setInputCols(["document", "tokens", "ner_tags"]) \
...   .setOutputCol("nerChunks")
...
>>> depencyParser = DependencyParserModel \
...   .pretrained("dependency_conllu", "en") \
...   .setInputCols(["document", "posTags", "tokens"]) \
...   .setOutputCol("dependencies")
...
>>> re = RelationExtractionApproach() \
...   .setInputCols(["embeddings", "posTags", "train_ner_chunks", "dependencies"]) \
...   .setOutputCol("relations_t") \
...   .setLabelColumn("target_rel") \
...   .setEpochsNumber(300) \
...   .setBatchSize(200) \
...   .setLearningRate(0.001) \
...   .setModelFile("path/to/graph_file.pb") \
...   .setFixImbalance(True) \
...   .setValidationSplit(0.05) \
...   .setFromEntity("from_begin", "from_end", "from_label") \
...   .setToEntity("to_begin", "to_end", "to_label")
...
>>> pipeline = Pipeline(stages=[
...     documentAssembler,
...     tokenizer,
...     embedder,
...     posTagger,
...     nerTagger,
...     nerConverter,
...     depencyParser,
...     re])
>>> model = pipeline.fit(trainData)
batchSize#
customLabels#
datasetInfo#
directionSensitive#
doExceptionHandling#
dropout#
engine#
epochsN#
featureScaling#
filterByTokenDistance#
fixImbalance#
fromEntityBeginCol#
fromEntityEndCol#
fromEntityLabelCol#
getter_attrs = []#
inputAnnotatorTypes#
inputCols#
labelColumn#
lazyAnnotator#
learningRate#
maxSyntacticDistance#
modelFile#
multiClass#
name = 'RelationExtractionApproach'#
optionalInputAnnotatorTypes = []#
outputAnnotatorType#
outputCol#
outputLogsPath#
overrideExistingLabels#
pretrainedModelPath#
relationDirectionCol#
scopeWindow#
skipLPInputColsValidation = True#
toEntityBeginCol#
toEntityEndCol#
toEntityLabelCol#
uid#
validationSplit#
clear(param: pyspark.ml.param.Param) None#

Clears a param from the param map if it has been explicitly set.

copy(extra: pyspark.ml._typing.ParamMap | None = None) JP#

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:

extra (dict, optional) – Extra parameters to copy to the new instance

Returns:

Copy of this instance

Return type:

JavaParams

explainParam(param: str | Param) str#

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams() str#

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap#

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:

extra (dict, optional) – extra param values

Returns:

merged param map

Return type:

dict

fit(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = ...) M#
fit(dataset: pyspark.sql.dataframe.DataFrame, params: List[pyspark.ml._typing.ParamMap] | Tuple[pyspark.ml._typing.ParamMap]) List[M]

Fits a model to the input dataset with optional parameters.

New in version 1.3.0.

Parameters:
  • dataset (pyspark.sql.DataFrame) – input dataset.

  • params (dict or list or tuple, optional) – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.

Returns:

fitted model(s)

Return type:

Transformer or a list of Transformer

fitMultiple(dataset: pyspark.sql.dataframe.DataFrame, paramMaps: Sequence[pyspark.ml._typing.ParamMap]) Iterator[Tuple[int, M]]#

Fits a model to the input dataset for each param map in paramMaps.

New in version 2.3.0.

Parameters:
  • dataset (pyspark.sql.DataFrame) – input dataset.

  • paramMaps (collections.abc.Sequence) – A Sequence of param maps.

Returns:

A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.

Return type:

_FitMultipleIterator

getEngine()#
Returns:

Deep Learning engine used for this model”

Return type:

str

getInputCols()#

Gets current column names of input annotations.

getLazyAnnotator()#

Gets whether Annotator should be evaluated lazily in a RecursivePipeline.

getOrDefault(param: str) Any#
getOrDefault(param: Param[T]) T

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()#

Gets output column name of annotations.

getParam(paramName: str) Param#

Gets a param by its name.

getParamValue(paramName)#

Gets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

hasDefault(param: str | Param[Any]) bool#

Checks whether a param has a default value.

hasParam(paramName: str) bool#

Tests whether this instance contains a param with a given (string) name.

inputColsValidation(value)#
isDefined(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user or has a default value.

isSet(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user.

classmethod load(path: str) RL#

Reads an ML instance from the input path, a shortcut of read().load(path).

classmethod read()#

Returns an MLReader instance for this class.

save(path: str) None#

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param: Param, value: Any) None#

Sets a parameter in the embedded param map.

setBatchSize(size: int)#

Size for each batch in the optimization process

Parameters:

size (int) – Size for each batch in the optimization process

setCustomLabels(labels: dict)#

Sets custom relation labels

Parameters:

labels (dict[str, str]) – Dictionary which maps old to new labels

setDatasetInfo(info: str)#

Sets descriptive information about the dataset being used.

Parameters:

info (str) – Descriptive information about the dataset being used.

setDirectionSensitive(value: bool)#

Sets whether to sensitive for direction in processing relation pairs. If it is true, only relations in the form of “ENTITY1-ENTITY2” will be considered, If it is false, both “ENTITY1-ENTITY2” and “ENTITY2-ENTITY1” relations will be considered,

Parameters:

value (bool) – Whether to sensitive for direction in processing relation pairs. Default is ‘False’

setDoExceptionHandling(value: bool)#

If True, exceptions are handled. If exception causing data is passed to the model, a error annotation is emitted which has the exception message. Processing continues with the next one. This comes with a performance penalty.

Parameters:

value (bool) – If True, exceptions are handled.

setDropout(dropout: float)#

Sets drouptup

Parameters:

dropout (float) – Dropout at the output of each layer

setEpochsNumber(epochs: int)#

Sets number of epochs for the optimization process

Parameters:

epochs (int) – Number of epochs for the optimization process

setFeatureScaling(feature_scaling: str)#

Sets Feature scaling method. Possible values are ‘zscore’, ‘minmax’ or empty (no scaling

Parameters:

feature_scaling (str) – Feature scaling method. Possible values are ‘zscore’, ‘minmax’ or empty (no scaling

setFilterByTokenDistance(distance: int)#
Sets number of tokens between entities. (Default: 0)

Model only finds relations that have fewer than the specified number of tokens between them.

Parameters:

distance (int) – filtering criterion based on number of token between entities. (Default: 0)

setFixImbalance(fix_imbalance: bool)#

Sets A flag indicating whenther to balance the trainig set.

Parameters:

fix_imbalance (bool) – A flag indicating whenther to balance the trainig set.

setForceInputTypeValidation(etfm)#
setFromEntity(begin_col: str, end_col: str, label_col: str)#

Sets from entity

Parameters:
  • begin_col (str) – Column that has a reference of where the chunk begins

  • end_col (str) – Column that has a reference of where the chunk ends

  • label_col (str) – Column that has a reference what are the type of chunk

setInputCols(*value)#

Sets column names of input annotations.

Parameters:

*value (List[str]) – Input columns for the annotator

setLabelCol(label_column: str)#

Sets Size for each batch in the optimization process

Parameters:

label_column (str) – Column with the value result we are trying to predict.

setLazyAnnotator(value)#

Sets whether Annotator should be evaluated lazily in a RecursivePipeline.

Parameters:

value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline

setLearningRate(learning_rate: float)#

Sets learning rate for the optimization process

Parameters:

learning_rate (float) – Learning rate for the optimization process

setMaxSyntacticDistance(distance: int)#

Sets maximal syntactic distance, as threshold (Default: 0)

Parameters:

distance (int) – Maximal syntactic distance, as threshold (Default: 0)

setModelFile(mode_file: str)#

Sets file name to load the mode from”

Parameters:

label (str) – File name to load the mode from”

setMultiClass(value: bool)#

Sets the model in multi class prediction mode (Default: false)

Parameters:

value (bool) – Whether to return only the label with the highest confidence score or all labels

setOutputCol(value)#

Sets output column name of annotations.

Parameters:

value (str) – Name of output column

setOutputLogsPath(output_logs_path: str)#

Sets path to folder where logs will be saved. If no path is specified, no logs are generated

Parameters:

output_logs_path (str) – Path to folder where logs will be saved. If no path is specified, no logs are generated

setParamValue(paramName)#

Sets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

setPretrainedModelPath(value: str)#

Sets location of pretrained model.

Parameters:

value (str) – Path to an already trained model saved to disk, which is used as a starting point for training the new model.

setRelationDirectionCol(col: str)#

Sets relation direction column

Parameters:

col (str) – Column contains the relation direction values

setScopeWindow(value)#

Sets the scope of the window of the feature generation for relations :param value: Left and right offset if the scope window. Offsets must be non-negative values :type value: [int, int]

setToEntity(begin_col: str, end_col: str, label_col: str)#

Sets to entity

Parameters:
  • begin_col (str) – Column that has a reference of where the chunk begins

  • end_col (str) – Column that has a reference of where the chunk ends

  • label_col (str) – Column that has a reference what are the type of chunk

setValidationSplit(validation_split: float)#

Sets validaiton split - how much data to use for validation

Parameters:

validation_split (float) – Validaiton split - how much data to use for validation

setОverrideExistingLabels(value: bool)#

Sets whether to override already learned tags. Default is ‘true’.

Parameters:

value (bool) – Whether to override already learned labels when using a pretrained model to initialize the new model. Default is ‘true’

write() JavaMLWriter#

Returns an MLWriter instance for this ML instance.

class RelationExtractionModel(classname='com.johnsnowlabs.nlp.annotators.re.RelationExtractionModel', java_model=None)#

Bases: sparknlp_jsl.annotator.generic_classifier.generic_classifier.GenericClassifierModel, sparknlp_jsl.annotator.handle_exception_params.HandleExceptionParams

Extracts and classifies instances of relations between named entities.

This is the AnnotatorModel version, if you want to train a model, use RelationExtractionApproach.

Visit NLP Models Hub for existing pretrained models.

Input Annotation types

Output Annotation type

WORD_EMBEDDINGS, POS, CHUNK, DEPENDENCY

CATEGORY

Parameters:
  • predictionThreshold – Minimal activation of the target unit to encode a new relation instance.

  • relationPairs – List of dash-separated pairs of named entities. For example, [“Biomarker-RelativeDay”] will process all relations between entities of type “Biomarker” and “RelativeDay”.

  • relationPairsCaseSensitive – Determines whether relation pairs are case sensitive.

  • relationTypePerPair – List of entity pairs per relations which limit the entities can form a relation. For example, {“CAUSE”: [“PROBLEM”, “SYMPTOM”]} which only let a “CAUSE” relation to hold between a problem (“PROBLEM) and a symptom (“SYMTOM”).

  • maxSyntacticDistance – Maximal syntactic distance, as threshold (Default: 0)

  • customLabels – Custom relation labels.

  • directionSensitive – Specify direction sensitivity in processing relation pairs Default is ‘false’.

  • filterByTokenDistance – Filtering criterion based on number of token between entities (Default: 0).

  • scopeWindow – The scope window of the feature generation for relations. Default :[-1, -1].

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp_jsl.common import *
>>> from sparknlp.annotator import *
>>> from sparknlp.training import *
>>> import sparknlp_jsl
>>> from sparknlp_jsl.base import *
>>> from sparknlp_jsl.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...   .setInputCol("text") \
...   .setOutputCol("document")
...
>>> tokenizer = Tokenizer() \
...   .setInputCols(["document"]) \
...   .setOutputCol("tokens")
...
>>> embedder = WordEmbeddingsModel \
...   .pretrained("embeddings_clinical", "en", "clinical/models") \
...   .setInputCols(["document", "tokens"]) \
...   .setOutputCol("embeddings")
...
>>> posTagger = PerceptronModel \
...   .pretrained("pos_clinical", "en", "clinical/models") \
...   .setInputCols(["document", "tokens"]) \
...   .setOutputCol("posTags")
...
>>> nerTagger = MedicalNerModel \
...   .pretrained("ner_events_clinical", "en", "clinical/models") \
...   .setInputCols(["document", "tokens", "embeddings"]) \
...   .setOutputCol("ner_tags")
...
>>> nerConverter = NerConverter() \
...   .setInputCols(["document", "tokens", "ner_tags"]) \
...   .setOutputCol("nerChunks")
...
>>> depencyParser = DependencyParserModel \
...   .pretrained("dependency_conllu", "en") \
...   .setInputCols(["document", "posTags", "tokens"]) \
...   .setOutputCol("dependencies")
...
>>> relationPairs = [
...   "direction-external_body_part_or_region",
...   "external_body_part_or_region-direction",
...   "direction-internal_organ_or_component",
...   "internal_organ_or_component-direction"
... ]
...
>>> re_model = RelationExtractionModel.pretrained("re_bodypart_directions", "en", "clinical/models") \
...     .setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"]) \
...     .setOutputCol("relations") \
...     .setRelationPairs(relationPairs) \
...     .setMaxSyntacticDistance(4) \
...     .setPredictionThreshold(0.9)
...
>>> pipeline = Pipeline(stages=[
...     documentAssembler,
...     tokenizer,
...     embedder,
...     posTagger,
...     nerTagger,
...     nerConverter,
...     depencyParser,
...     re_model])
>>> model = pipeline.fit(trainData)
>>> data = spark.createDataFrame([["MRI demonstrated infarction in the upper brain stem , left cerebellum and  right basil ganglia"]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
...
>>> result.selectExpr("explode(relations) as relations")
...  .select(
...    "relations.metadata.chunk1",
...    "relations.metadata.entity1",
...    "relations.metadata.chunk2",
...    "relations.metadata.entity2",
...    "relations.result"
...  )
...  .where("result != 0")
...  .show(truncate=False)
...
... # Show results
... result.selectExpr("explode(relations) as relations") \
...   .select(
...      "relations.metadata.chunk1",
...      "relations.metadata.entity1",
...      "relations.metadata.chunk2",
...      "relations.metadata.entity2",
...      "relations.result"
...   ).where("result != 0") \
...   .show(truncate=False)
+------+---------+-------------+---------------------------+------+
|chunk1|entity1  |chunk2       |entity2                    |result|
+------+---------+-------------+---------------------------+------+
|upper |Direction|brain stem   |Internal_organ_or_component|1     |
|left  |Direction|cerebellum   |Internal_organ_or_component|1     |
|right |Direction|basil ganglia|Internal_organ_or_component|1     |
+------+---------+-------------+---------------------------+------+
classes#
customLabels#
datasetInfo#
directionSensitive#
doExceptionHandling#
featureScaling#
filterByTokenDistance#
getter_attrs = []#
inputAnnotatorTypes#
inputCols#
lazyAnnotator#
maxSyntacticDistance#
multiClass#
name = 'RelationExtractionModel'#
optionalInputAnnotatorTypes = []#
outputAnnotatorType#
outputCol#
predictionThreshold#
relationPairs#
relationPairsCaseSensitive#
scopeWindow#
skipLPInputColsValidation = True#
uid#
clear(param: pyspark.ml.param.Param) None#

Clears a param from the param map if it has been explicitly set.

copy(extra: pyspark.ml._typing.ParamMap | None = None) JP#

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:

extra (dict, optional) – Extra parameters to copy to the new instance

Returns:

Copy of this instance

Return type:

JavaParams

explainParam(param: str | Param) str#

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams() str#

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap#

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:

extra (dict, optional) – extra param values

Returns:

merged param map

Return type:

dict

getClasses()#

Returns labels used to train this model.

getInputCols()#

Gets current column names of input annotations.

getLazyAnnotator()#

Gets whether Annotator should be evaluated lazily in a RecursivePipeline.

getOrDefault(param: str) Any#
getOrDefault(param: Param[T]) T

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()#

Gets output column name of annotations.

getParam(paramName: str) Param#

Gets a param by its name.

getParamValue(paramName)#

Gets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

getRelationTypePerPair()#

Return the list of entity pairs allowed for a given relation.

hasDefault(param: str | Param[Any]) bool#

Checks whether a param has a default value.

hasParam(paramName: str) bool#

Tests whether this instance contains a param with a given (string) name.

inputColsValidation(value)#
isDefined(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user or has a default value.

isSet(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user.

classmethod load(path: str) RL#

Reads an ML instance from the input path, a shortcut of read().load(path).

static pretrained(name='posology_re', lang='en', remote_loc='clinical/models')#

Download a pre-trained RelationExtractionModel.

Parameters:
  • name (str) – Name of the pre-trained model, by default “posology_re”

  • lang (str) – Language of the pre-trained model, by default “en”

  • remote_loc (str) – Remote location of the pre-trained model. If None, use the open-source location. Other values are “clinical/models”, “finance/models”, or “legal/models”.

Returns:

A pre-trained RelationExtractionModel.

Return type:

RelationExtractionModel

classmethod read()#

Returns an MLReader instance for this class.

save(path: str) None#

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param: Param, value: Any) None#

Sets a parameter in the embedded param map.

setCustomLabels(labels: dict)#

Sets custom relation labels

Parameters:

labels (dict[str, str]) – Dictionary which maps old to new labels

setDatasetInfo(info: str)#

Sets descriptive information about the dataset being used.

Parameters:

info (str) – Descriptive information about the dataset being used.

setDirectionSensitive(value: bool)#

Sets whether to sensitive for direction in processing relation pairs. If it is true, only relations in the form of “ENTITY1-ENTITY2” will be considered, If it is false, both “ENTITY1-ENTITY2” and “ENTITY2-ENTITY1” relations will be considered,

Parameters:

value (bool) – Whether to sensitive for direction in processing relation pairs. Default is ‘False’

setDoExceptionHandling(value: bool)#

If True, exceptions are handled. If exception causing data is passed to the model, a error annotation is emitted which has the exception message. Processing continues with the next one. This comes with a performance penalty.

Parameters:

value (bool) – If True, exceptions are handled.

setFeatureScaling(feature_scaling: str)#

Sets Feature scaling method.

Possible values are ‘zscore’, ‘minmax’ or empty (no scaling).

Parameters:

feature_scaling (str) – Feature scaling method. Possible values are ‘zscore’, ‘minmax’ or empty (no scaling).

setFilterByTokenDistance(distance: int)#
Sets number of tokens between entities. (Default: 0)

Model only finds relations that have fewer than the specified number of tokens between them.

Parameters:

distance (int) – filtering criterion based on number of token between entities. (Default: 0)

setForceInputTypeValidation(etfm)#
setInputCols(*value)#

Sets column names of input annotations.

Parameters:

*value (List[str]) – Input columns for the annotator

setLazyAnnotator(value)#

Sets whether Annotator should be evaluated lazily in a RecursivePipeline.

Parameters:

value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline

setMaxSyntacticDistance(distance)#

Sets maximal syntactic distance, as threshold (Default: 0).

Determine how far the “from entity” can be from the “to entity” in the text. Increasing this value will increase recall, but also increase the number of false positives.

Parameters:

distance (int) – Maximal syntactic distance, as threshold (Default: 0)

setMultiClass(value: bool)#

Sets the model in multi class prediction mode (Default: false).

Parameters:

value (bool) – Whether to return only the label with the highest confidence score or all labels

setOutputCol(value)#

Sets output column name of annotations.

Parameters:

value (str) – Name of output column

setParamValue(paramName)#

Sets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

setParams()#
setPredictionThreshold(threshold: float)#

Sets Minimal activation of the target unit to encode a new relation instance.

Parameters:

threshold (float) – Minimal activation of the target unit to encode a new relation instance

setRelationPairs(pairs)#

Sets List of dash-separated pairs of named entities to be processed.

Parameters:

pairs (str or list[str]) – List of dash-separated pairs of named entities to be processed.

setRelationPairsCaseSensitive(value: bool)#

Sets the case sensitivity of relation pairs

Parameters:

value (bool) – whether relation pairs are case sensitive

setRelationTypePerPair(relationTypePairs)#

Set the list of entity pairs allowed for a given relation.

Parameters:

relationTypePairs (dict[str, list[str]]) – Dictionary which maps relation types to a list of entity pairs.

setScopeWindow(value)#

Sets the scope of the window of the feature generation for relations :param value: Left and right offset if the scope window. Offsets must be non-negative values :type value: [int, int]

transform(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = None) pyspark.sql.dataframe.DataFrame#

Transforms the input dataset with optional parameters.

New in version 1.3.0.

Parameters:
  • dataset (pyspark.sql.DataFrame) – input dataset

  • params (dict, optional) – an optional param map that overrides embedded params.

Returns:

transformed dataset

Return type:

pyspark.sql.DataFrame

write() JavaMLWriter#

Returns an MLWriter instance for this ML instance.