sparknlp_jsl.annotator.re.relation_extraction
#
Module Contents#
Classes#
Generic Relation Extraction Model. |
|
Model for Posology Relation Extraction. |
|
Trains a TensorFlow model for relation extraction. |
|
Extracts and classifies instances of relations between named entities. |
- class GenericREModel(classname='com.johnsnowlabs.nlp.annotators.re.GenericREModel', java_model=None)#
Bases:
RelationExtractionModel
Generic Relation Extraction Model.
Instantiated RelationExtractionModel for extracting relationships between any entities. This class is not intended to be directly used, please use the RelationExtractionModel instead. Pairs of entities should be specified using setRelationPairs.
- classes#
- customLabels#
- datasetInfo#
- directionSensitive#
- doExceptionHandling#
- featureScaling#
- filterByTokenDistance#
- getter_attrs = []#
- inputAnnotatorTypes#
- inputCols#
- lazyAnnotator#
- maxSyntacticDistance#
- multiClass#
- name = 'RelationExtractionModel'#
- optionalInputAnnotatorTypes = []#
- outputAnnotatorType = 'category'#
- outputCol#
- predictionThreshold#
- relationPairs#
- relationPairsCaseSensitive#
- scopeWindow#
- skipLPInputColsValidation = True#
- uid = ''#
- clear(param: pyspark.ml.param.Param) None #
Clears a param from the param map if it has been explicitly set.
- copy(extra: pyspark.ml._typing.ParamMap | None = None) JP #
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
- Parameters:
extra (dict, optional) – Extra parameters to copy to the new instance
- Returns:
Copy of this instance
- Return type:
JavaParams
- explainParam(param: str | Param) str #
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
- explainParams() str #
Returns the documentation of all params with their optionally default values and user-supplied values.
- extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap #
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
- Parameters:
extra (dict, optional) – extra param values
- Returns:
merged param map
- Return type:
dict
- getClasses()#
Returns labels used to train this model.
- getInputCols()#
Gets current column names of input annotations.
- getLazyAnnotator()#
Gets whether Annotator should be evaluated lazily in a RecursivePipeline.
- getOrDefault(param: str) Any #
- getOrDefault(param: Param[T]) T
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
- getOutputCol()#
Gets output column name of annotations.
- getParam(paramName: str) Param #
Gets a param by its name.
- getParamValue(paramName)#
Gets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- getRelationTypePerPair()#
Return the list of entity pairs allowed for a given relation.
- hasDefault(param: str | Param[Any]) bool #
Checks whether a param has a default value.
- hasParam(paramName: str) bool #
Tests whether this instance contains a param with a given (string) name.
- inputColsValidation(value)#
- isDefined(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user or has a default value.
- isSet(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user.
- classmethod load(path: str) RL #
Reads an ML instance from the input path, a shortcut of read().load(path).
- static pretrained(name='posology_re', lang='en', remote_loc='clinical/models')#
Download a pre-trained RelationExtractionModel.
- Parameters:
name (str) – Name of the pre-trained model, by default “posology_re”
lang (str) – Language of the pre-trained model, by default “en”
remote_loc (str) – Remote location of the pre-trained model. If None, use the open-source location. Other values are “clinical/models”, “finance/models”, or “legal/models”.
- Returns:
A pre-trained RelationExtractionModel.
- Return type:
- classmethod read()#
Returns an MLReader instance for this class.
- save(path: str) None #
Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
- set(param: Param, value: Any) None #
Sets a parameter in the embedded param map.
- setCustomLabels(labels: dict)#
Sets custom relation labels
- Parameters:
labels (dict[str, str]) – Dictionary which maps old to new labels
- setDatasetInfo(info: str)#
Sets descriptive information about the dataset being used.
- Parameters:
info (str) – Descriptive information about the dataset being used.
- setDirectionSensitive(value: bool)#
Sets whether to sensitive for direction in processing relation pairs. If it is true, only relations in the form of “ENTITY1-ENTITY2” will be considered, If it is false, both “ENTITY1-ENTITY2” and “ENTITY2-ENTITY1” relations will be considered,
- Parameters:
value (bool) – Whether to sensitive for direction in processing relation pairs. Default is ‘False’
- setDoExceptionHandling(value: bool)#
If True, exceptions are handled. If exception causing data is passed to the model, a error annotation is emitted which has the exception message. Processing continues with the next one. This comes with a performance penalty.
- Parameters:
value (bool) – If True, exceptions are handled.
- setFeatureScaling(feature_scaling: str)#
Sets Feature scaling method.
Possible values are ‘zscore’, ‘minmax’ or empty (no scaling).
- Parameters:
feature_scaling (str) – Feature scaling method. Possible values are ‘zscore’, ‘minmax’ or empty (no scaling).
- setFilterByTokenDistance(distance: int)#
- Sets number of tokens between entities. (Default: 0)
Model only finds relations that have fewer than the specified number of tokens between them.
- Parameters:
distance (int) – filtering criterion based on number of token between entities. (Default: 0)
- setForceInputTypeValidation(etfm)#
- setInputCols(*value)#
Sets column names of input annotations.
- Parameters:
*value (List[str]) – Input columns for the annotator
- setLazyAnnotator(value)#
Sets whether Annotator should be evaluated lazily in a RecursivePipeline.
- Parameters:
value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline
- setMaxSyntacticDistance(distance)#
Sets maximal syntactic distance, as threshold (Default: 0).
Determine how far the “from entity” can be from the “to entity” in the text. Increasing this value will increase recall, but also increase the number of false positives.
- Parameters:
distance (int) – Maximal syntactic distance, as threshold (Default: 0)
- setMultiClass(value: bool)#
Sets the model in multi class prediction mode (Default: false).
- Parameters:
value (bool) – Whether to return only the label with the highest confidence score or all labels
- setOutputCol(value)#
Sets output column name of annotations.
- Parameters:
value (str) – Name of output column
- setParamValue(paramName)#
Sets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- setParams()#
- setPredictionThreshold(threshold: float)#
Sets Minimal activation of the target unit to encode a new relation instance.
- Parameters:
threshold (float) – Minimal activation of the target unit to encode a new relation instance
- setRelationPairs(pairs)#
Sets List of dash-separated pairs of named entities to be processed.
- Parameters:
pairs (str or list[str]) – List of dash-separated pairs of named entities to be processed.
- setRelationPairsCaseSensitive(value: bool)#
Sets the case sensitivity of relation pairs
- Parameters:
value (bool) – whether relation pairs are case sensitive
- setRelationTypePerPair(relationTypePairs)#
Set the list of entity pairs allowed for a given relation.
- Parameters:
relationTypePairs (dict[str, list[str]]) – Dictionary which maps relation types to a list of entity pairs.
- setScopeWindow(value)#
Sets the scope of the window of the feature generation for relations :param value: Left and right offset if the scope window. Offsets must be non-negative values :type value: [int, int]
- transform(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = None) pyspark.sql.dataframe.DataFrame #
Transforms the input dataset with optional parameters.
New in version 1.3.0.
- Parameters:
dataset (
pyspark.sql.DataFrame
) – input datasetparams (dict, optional) – an optional param map that overrides embedded params.
- Returns:
transformed dataset
- Return type:
- write() JavaMLWriter #
Returns an MLWriter instance for this ML instance.
- class PosologyREModel(classname='com.johnsnowlabs.nlp.annotators.re.PosologyREModel', java_model=None)#
Bases:
RelationExtractionModel
Model for Posology Relation Extraction.
Instantiated RelationExtractionModel for extracting relationships between different recognized drug entitites. This class is not intended to be directly used, please use the RelationExtractionModel instead. Possible values are “DRUG-DOSAGE”, “DRUG-ADE”, “DRUG-FORM”, “DRUG-FREQUENCY”, “DRUG-ROUTE”, “DRUG-REASON”, “DRUG-STRENGTH”, “DRUG-DURATION”.
- classes#
- customLabels#
- datasetInfo#
- directionSensitive#
- doExceptionHandling#
- featureScaling#
- filterByTokenDistance#
- getter_attrs = []#
- inputAnnotatorTypes#
- inputCols#
- lazyAnnotator#
- maxSyntacticDistance#
- multiClass#
- name = 'RelationExtractionModel'#
- optionalInputAnnotatorTypes = []#
- outputAnnotatorType = 'category'#
- outputCol#
- predictionThreshold#
- relationPairs#
- relationPairsCaseSensitive#
- scopeWindow#
- skipLPInputColsValidation = True#
- uid = ''#
- clear(param: pyspark.ml.param.Param) None #
Clears a param from the param map if it has been explicitly set.
- copy(extra: pyspark.ml._typing.ParamMap | None = None) JP #
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
- Parameters:
extra (dict, optional) – Extra parameters to copy to the new instance
- Returns:
Copy of this instance
- Return type:
JavaParams
- explainParam(param: str | Param) str #
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
- explainParams() str #
Returns the documentation of all params with their optionally default values and user-supplied values.
- extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap #
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
- Parameters:
extra (dict, optional) – extra param values
- Returns:
merged param map
- Return type:
dict
- getClasses()#
Returns labels used to train this model.
- getInputCols()#
Gets current column names of input annotations.
- getLazyAnnotator()#
Gets whether Annotator should be evaluated lazily in a RecursivePipeline.
- getOrDefault(param: str) Any #
- getOrDefault(param: Param[T]) T
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
- getOutputCol()#
Gets output column name of annotations.
- getParam(paramName: str) Param #
Gets a param by its name.
- getParamValue(paramName)#
Gets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- getRelationTypePerPair()#
Return the list of entity pairs allowed for a given relation.
- hasDefault(param: str | Param[Any]) bool #
Checks whether a param has a default value.
- hasParam(paramName: str) bool #
Tests whether this instance contains a param with a given (string) name.
- inputColsValidation(value)#
- isDefined(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user or has a default value.
- isSet(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user.
- classmethod load(path: str) RL #
Reads an ML instance from the input path, a shortcut of read().load(path).
- static pretrained(name='posology_re', lang='en', remote_loc='clinical/models')#
Download a pre-trained RelationExtractionModel.
- Parameters:
name (str) – Name of the pre-trained model, by default “posology_re”
lang (str) – Language of the pre-trained model, by default “en”
remote_loc (str) – Remote location of the pre-trained model. If None, use the open-source location. Other values are “clinical/models”, “finance/models”, or “legal/models”.
- Returns:
A pre-trained RelationExtractionModel.
- Return type:
- classmethod read()#
Returns an MLReader instance for this class.
- save(path: str) None #
Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
- set(param: Param, value: Any) None #
Sets a parameter in the embedded param map.
- setCustomLabels(labels: dict)#
Sets custom relation labels
- Parameters:
labels (dict[str, str]) – Dictionary which maps old to new labels
- setDatasetInfo(info: str)#
Sets descriptive information about the dataset being used.
- Parameters:
info (str) – Descriptive information about the dataset being used.
- setDirectionSensitive(value: bool)#
Sets whether to sensitive for direction in processing relation pairs. If it is true, only relations in the form of “ENTITY1-ENTITY2” will be considered, If it is false, both “ENTITY1-ENTITY2” and “ENTITY2-ENTITY1” relations will be considered,
- Parameters:
value (bool) – Whether to sensitive for direction in processing relation pairs. Default is ‘False’
- setDoExceptionHandling(value: bool)#
If True, exceptions are handled. If exception causing data is passed to the model, a error annotation is emitted which has the exception message. Processing continues with the next one. This comes with a performance penalty.
- Parameters:
value (bool) – If True, exceptions are handled.
- setFeatureScaling(feature_scaling: str)#
Sets Feature scaling method.
Possible values are ‘zscore’, ‘minmax’ or empty (no scaling).
- Parameters:
feature_scaling (str) – Feature scaling method. Possible values are ‘zscore’, ‘minmax’ or empty (no scaling).
- setFilterByTokenDistance(distance: int)#
- Sets number of tokens between entities. (Default: 0)
Model only finds relations that have fewer than the specified number of tokens between them.
- Parameters:
distance (int) – filtering criterion based on number of token between entities. (Default: 0)
- setForceInputTypeValidation(etfm)#
- setInputCols(*value)#
Sets column names of input annotations.
- Parameters:
*value (List[str]) – Input columns for the annotator
- setLazyAnnotator(value)#
Sets whether Annotator should be evaluated lazily in a RecursivePipeline.
- Parameters:
value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline
- setMaxSyntacticDistance(distance)#
Sets maximal syntactic distance, as threshold (Default: 0).
Determine how far the “from entity” can be from the “to entity” in the text. Increasing this value will increase recall, but also increase the number of false positives.
- Parameters:
distance (int) – Maximal syntactic distance, as threshold (Default: 0)
- setMultiClass(value: bool)#
Sets the model in multi class prediction mode (Default: false).
- Parameters:
value (bool) – Whether to return only the label with the highest confidence score or all labels
- setOutputCol(value)#
Sets output column name of annotations.
- Parameters:
value (str) – Name of output column
- setParamValue(paramName)#
Sets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- setParams()#
- setPredictionThreshold(threshold: float)#
Sets Minimal activation of the target unit to encode a new relation instance.
- Parameters:
threshold (float) – Minimal activation of the target unit to encode a new relation instance
- setRelationPairs(pairs)#
Sets List of dash-separated pairs of named entities to be processed.
- Parameters:
pairs (str or list[str]) – List of dash-separated pairs of named entities to be processed.
- setRelationPairsCaseSensitive(value: bool)#
Sets the case sensitivity of relation pairs
- Parameters:
value (bool) – whether relation pairs are case sensitive
- setRelationTypePerPair(relationTypePairs)#
Set the list of entity pairs allowed for a given relation.
- Parameters:
relationTypePairs (dict[str, list[str]]) – Dictionary which maps relation types to a list of entity pairs.
- setScopeWindow(value)#
Sets the scope of the window of the feature generation for relations :param value: Left and right offset if the scope window. Offsets must be non-negative values :type value: [int, int]
- transform(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = None) pyspark.sql.dataframe.DataFrame #
Transforms the input dataset with optional parameters.
New in version 1.3.0.
- Parameters:
dataset (
pyspark.sql.DataFrame
) – input datasetparams (dict, optional) – an optional param map that overrides embedded params.
- Returns:
transformed dataset
- Return type:
- write() JavaMLWriter #
Returns an MLWriter instance for this ML instance.
- class RelationExtractionApproach(classname='com.johnsnowlabs.nlp.annotators.re.RelationExtractionApproach')#
Bases:
sparknlp_jsl.annotator.generic_classifier.generic_classifier.GenericClassifierApproach
,sparknlp_jsl.annotator.handle_exception_params.HandleExceptionParams
Trains a TensorFlow model for relation extraction.
For pretrained models, see the documentation of RelationExtractionModel.
To train a custom relation extraction model, you need to first create a Tensorflow graph using either the TfGraphBuilder annotator or the tf_graph module. Then, set the path to the Tensorflow graph using the method .setModelFile(“path/to/tensorflow_graph.pb”).
If the parameter relationDirectionCol is set, the model will be trained using the direction information (see the parameter decription for details). Otherwise, the model won’t have direction between the relation of the entities.
After training a model (using the .fit() method), the resulting object is of class RelationExtractionModel.
Input Annotation types
Output Annotation type
WORD_EMBEDDINGS, POS, CHUNK, DEPENDENCY
CATEGORY
- Parameters:
fromEntityBeginCol – From Entity Begining Column
fromEntityEndCol – From Entity End Column
fromEntityLabelCol – From Entity Label Column
toEntityBeginCol – To Entity Begining Column
toEntityEndCol – To Entity End Column
toEntityLabelCol – To Entity Label Column
relationDirectionCol – Relation direction column which contains the information of the relation between the “from entity” and the “to entity”. Possible values in this column are: - “leftwards”: the relation arguments are ordered from right to left (<-) in the text. - “rightwards”: the relation arguments are ordered from left to right (->) in the text. - “both”: Order doesn’t matter, the relation is symmetric (<->). If this parameter is not set, the model will not have direction between the relation of the entities.
pretrainedModelPath – Path to an already trained MedicalNerModel, which is used as a starting point for training the new model. The path can be a local file path, a distributed file path (HDFS, DBFS), or a cloud storage (S3).
directionSensitive – Specify direction sensitivity in processing relation pairs Default is ‘false’.
filterByTokenDistance – Filtering criterion based on number of token between entities (Default: 0).
scopeWindow – The scope window of the feature generation. Default :[-1, -1].
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp_jsl.common import * >>> from sparknlp.annotator import * >>> from sparknlp.training import * >>> import sparknlp_jsl >>> from sparknlp_jsl.base import * >>> from sparknlp_jsl.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") ... >>> tokenizer = Tokenizer() \ ... .setInputCols(["document"]) \ ... .setOutputCol("tokens") ... >>> embedder = WordEmbeddingsModel \ ... .pretrained("embeddings_clinical", "en", "clinical/models") \ ... .setInputCols(["document", "tokens"]) \ ... .setOutputCol("embeddings") ... >>> posTagger = PerceptronModel \ ... .pretrained("pos_clinical", "en", "clinical/models") \ ... .setInputCols(["document", "tokens"]) \ ... .setOutputCol("posTags") ... >>> nerTagger = MedicalNerModel \ ... .pretrained("ner_events_clinical", "en", "clinical/models") \ ... .setInputCols(["document", "tokens", "embeddings"]) \ ... .setOutputCol("ner_tags") ... >>> nerConverter = NerConverter() \ ... .setInputCols(["document", "tokens", "ner_tags"]) \ ... .setOutputCol("nerChunks") ... >>> depencyParser = DependencyParserModel \ ... .pretrained("dependency_conllu", "en") \ ... .setInputCols(["document", "posTags", "tokens"]) \ ... .setOutputCol("dependencies") ... >>> re = RelationExtractionApproach() \ ... .setInputCols(["embeddings", "posTags", "train_ner_chunks", "dependencies"]) \ ... .setOutputCol("relations_t") \ ... .setLabelColumn("target_rel") \ ... .setEpochsNumber(300) \ ... .setBatchSize(200) \ ... .setLearningRate(0.001) \ ... .setModelFile("path/to/graph_file.pb") \ ... .setFixImbalance(True) \ ... .setValidationSplit(0.05) \ ... .setFromEntity("from_begin", "from_end", "from_label") \ ... .setToEntity("to_begin", "to_end", "to_label") ... >>> pipeline = Pipeline(stages=[ ... documentAssembler, ... tokenizer, ... embedder, ... posTagger, ... nerTagger, ... nerConverter, ... depencyParser, ... re])
>>> model = pipeline.fit(trainData)
- batchSize#
- customLabels#
- datasetInfo#
- directionSensitive#
- doExceptionHandling#
- dropout#
- engine#
- epochsN#
- featureScaling#
- filterByTokenDistance#
- fixImbalance#
- fromEntityBeginCol#
- fromEntityEndCol#
- fromEntityLabelCol#
- getter_attrs = []#
- inputAnnotatorTypes#
- inputCols#
- labelColumn#
- lazyAnnotator#
- learningRate#
- maxSyntacticDistance#
- modelFile#
- multiClass#
- name = 'RelationExtractionApproach'#
- optionalInputAnnotatorTypes = []#
- outputAnnotatorType = 'category'#
- outputCol#
- outputLogsPath#
- overrideExistingLabels#
- pretrainedModelPath#
- relationDirectionCol#
- scopeWindow#
- skipLPInputColsValidation = True#
- toEntityBeginCol#
- toEntityEndCol#
- toEntityLabelCol#
- uid = ''#
- validationSplit#
- clear(param: pyspark.ml.param.Param) None #
Clears a param from the param map if it has been explicitly set.
- copy(extra: pyspark.ml._typing.ParamMap | None = None) JP #
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
- Parameters:
extra (dict, optional) – Extra parameters to copy to the new instance
- Returns:
Copy of this instance
- Return type:
JavaParams
- explainParam(param: str | Param) str #
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
- explainParams() str #
Returns the documentation of all params with their optionally default values and user-supplied values.
- extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap #
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
- Parameters:
extra (dict, optional) – extra param values
- Returns:
merged param map
- Return type:
dict
- fit(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = ...) M #
- fit(dataset: pyspark.sql.dataframe.DataFrame, params: List[pyspark.ml._typing.ParamMap] | Tuple[pyspark.ml._typing.ParamMap]) List[M]
Fits a model to the input dataset with optional parameters.
New in version 1.3.0.
- Parameters:
dataset (
pyspark.sql.DataFrame
) – input dataset.params (dict or list or tuple, optional) – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
- Returns:
fitted model(s)
- Return type:
Transformer
or a list ofTransformer
- fitMultiple(dataset: pyspark.sql.dataframe.DataFrame, paramMaps: Sequence[pyspark.ml._typing.ParamMap]) Iterator[Tuple[int, M]] #
Fits a model to the input dataset for each param map in paramMaps.
New in version 2.3.0.
- Parameters:
dataset (
pyspark.sql.DataFrame
) – input dataset.paramMaps (
collections.abc.Sequence
) – A Sequence of param maps.
- Returns:
A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.
- Return type:
_FitMultipleIterator
- getEngine()#
- Returns:
Deep Learning engine used for this model”
- Return type:
str
- getInputCols()#
Gets current column names of input annotations.
- getLazyAnnotator()#
Gets whether Annotator should be evaluated lazily in a RecursivePipeline.
- getOrDefault(param: str) Any #
- getOrDefault(param: Param[T]) T
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
- getOutputCol()#
Gets output column name of annotations.
- getParam(paramName: str) Param #
Gets a param by its name.
- getParamValue(paramName)#
Gets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- hasDefault(param: str | Param[Any]) bool #
Checks whether a param has a default value.
- hasParam(paramName: str) bool #
Tests whether this instance contains a param with a given (string) name.
- inputColsValidation(value)#
- isDefined(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user or has a default value.
- isSet(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user.
- classmethod load(path: str) RL #
Reads an ML instance from the input path, a shortcut of read().load(path).
- classmethod read()#
Returns an MLReader instance for this class.
- save(path: str) None #
Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
- set(param: Param, value: Any) None #
Sets a parameter in the embedded param map.
- setBatchSize(size: int)#
Size for each batch in the optimization process
- Parameters:
size (int) – Size for each batch in the optimization process
- setCustomLabels(labels: dict)#
Sets custom relation labels
- Parameters:
labels (dict[str, str]) – Dictionary which maps old to new labels
- setDatasetInfo(info: str)#
Sets descriptive information about the dataset being used.
- Parameters:
info (str) – Descriptive information about the dataset being used.
- setDirectionSensitive(value: bool)#
Sets whether to sensitive for direction in processing relation pairs. If it is true, only relations in the form of “ENTITY1-ENTITY2” will be considered, If it is false, both “ENTITY1-ENTITY2” and “ENTITY2-ENTITY1” relations will be considered,
- Parameters:
value (bool) – Whether to sensitive for direction in processing relation pairs. Default is ‘False’
- setDoExceptionHandling(value: bool)#
If True, exceptions are handled. If exception causing data is passed to the model, a error annotation is emitted which has the exception message. Processing continues with the next one. This comes with a performance penalty.
- Parameters:
value (bool) – If True, exceptions are handled.
- setDropout(dropout: float)#
Sets drouptup
- Parameters:
dropout (float) – Dropout at the output of each layer
- setEpochsNumber(epochs: int)#
Sets number of epochs for the optimization process
- Parameters:
epochs (int) – Number of epochs for the optimization process
- setFeatureScaling(feature_scaling: str)#
Sets Feature scaling method. Possible values are ‘zscore’, ‘minmax’ or empty (no scaling
- Parameters:
feature_scaling (str) – Feature scaling method. Possible values are ‘zscore’, ‘minmax’ or empty (no scaling
- setFilterByTokenDistance(distance: int)#
- Sets number of tokens between entities. (Default: 0)
Model only finds relations that have fewer than the specified number of tokens between them.
- Parameters:
distance (int) – filtering criterion based on number of token between entities. (Default: 0)
- setFixImbalance(fix_imbalance: bool)#
Sets A flag indicating whenther to balance the trainig set.
- Parameters:
fix_imbalance (bool) – A flag indicating whenther to balance the trainig set.
- setForceInputTypeValidation(etfm)#
- setFromEntity(begin_col: str, end_col: str, label_col: str)#
Sets from entity
- Parameters:
begin_col (str) – Column that has a reference of where the chunk begins
end_col (str) – Column that has a reference of where the chunk ends
label_col (str) – Column that has a reference what are the type of chunk
- setInputCols(*value)#
Sets column names of input annotations.
- Parameters:
*value (List[str]) – Input columns for the annotator
- setLabelCol(label_column: str)#
Sets Size for each batch in the optimization process
- Parameters:
label_column (str) – Column with the value result we are trying to predict.
- setLazyAnnotator(value)#
Sets whether Annotator should be evaluated lazily in a RecursivePipeline.
- Parameters:
value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline
- setLearningRate(learning_rate: float)#
Sets learning rate for the optimization process
- Parameters:
learning_rate (float) – Learning rate for the optimization process
- setMaxSyntacticDistance(distance: int)#
Sets maximal syntactic distance, as threshold (Default: 0)
- Parameters:
distance (int) – Maximal syntactic distance, as threshold (Default: 0)
- setModelFile(mode_file: str)#
Sets file name to load the mode from”
- Parameters:
label (str) – File name to load the mode from”
- setMultiClass(value: bool)#
Sets the model in multi class prediction mode (Default: false)
- Parameters:
value (bool) – Whether to return only the label with the highest confidence score or all labels
- setOutputCol(value)#
Sets output column name of annotations.
- Parameters:
value (str) – Name of output column
- setOutputLogsPath(output_logs_path: str)#
Sets path to folder where logs will be saved. If no path is specified, no logs are generated
- Parameters:
output_logs_path (str) – Path to folder where logs will be saved. If no path is specified, no logs are generated
- setParamValue(paramName)#
Sets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- setPretrainedModelPath(value: str)#
Sets location of pretrained model.
- Parameters:
value (str) – Path to an already trained model saved to disk, which is used as a starting point for training the new model.
- setRelationDirectionCol(col: str)#
Sets relation direction column
- Parameters:
col (str) – Column contains the relation direction values
- setScopeWindow(value)#
Sets the scope of the window of the feature generation for relations :param value: Left and right offset if the scope window. Offsets must be non-negative values :type value: [int, int]
- setToEntity(begin_col: str, end_col: str, label_col: str)#
Sets to entity
- Parameters:
begin_col (str) – Column that has a reference of where the chunk begins
end_col (str) – Column that has a reference of where the chunk ends
label_col (str) – Column that has a reference what are the type of chunk
- setValidationSplit(validation_split: float)#
Sets validaiton split - how much data to use for validation
- Parameters:
validation_split (float) – Validaiton split - how much data to use for validation
- setОverrideExistingLabels(value: bool)#
Sets whether to override already learned tags. Default is ‘true’.
- Parameters:
value (bool) – Whether to override already learned labels when using a pretrained model to initialize the new model. Default is ‘true’
- write() JavaMLWriter #
Returns an MLWriter instance for this ML instance.
- class RelationExtractionModel(classname='com.johnsnowlabs.nlp.annotators.re.RelationExtractionModel', java_model=None)#
Bases:
sparknlp_jsl.annotator.generic_classifier.generic_classifier.GenericClassifierModel
,sparknlp_jsl.annotator.handle_exception_params.HandleExceptionParams
Extracts and classifies instances of relations between named entities.
This is the AnnotatorModel version, if you want to train a model, use
RelationExtractionApproach
.Visit NLP Models Hub for existing pretrained models.
Input Annotation types
Output Annotation type
WORD_EMBEDDINGS, POS, CHUNK, DEPENDENCY
CATEGORY
- Parameters:
predictionThreshold – Minimal activation of the target unit to encode a new relation instance.
relationPairs – List of dash-separated pairs of named entities. For example, [“Biomarker-RelativeDay”] will process all relations between entities of type “Biomarker” and “RelativeDay”.
relationPairsCaseSensitive – Determines whether relation pairs are case sensitive.
relationTypePerPair – List of entity pairs per relations which limit the entities can form a relation. For example, {“CAUSE”: [“PROBLEM”, “SYMPTOM”]} which only let a “CAUSE” relation to hold between a problem (“PROBLEM) and a symptom (“SYMTOM”).
maxSyntacticDistance – Maximal syntactic distance, as threshold (Default: 0)
customLabels – Custom relation labels.
directionSensitive – Specify direction sensitivity in processing relation pairs Default is ‘false’.
filterByTokenDistance – Filtering criterion based on number of token between entities (Default: 0).
scopeWindow – The scope window of the feature generation for relations. Default :[-1, -1].
Examples
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp_jsl.common import * >>> from sparknlp.annotator import * >>> from sparknlp.training import * >>> import sparknlp_jsl >>> from sparknlp_jsl.base import * >>> from sparknlp_jsl.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") ... >>> tokenizer = Tokenizer() \ ... .setInputCols(["document"]) \ ... .setOutputCol("tokens") ... >>> embedder = WordEmbeddingsModel \ ... .pretrained("embeddings_clinical", "en", "clinical/models") \ ... .setInputCols(["document", "tokens"]) \ ... .setOutputCol("embeddings") ... >>> posTagger = PerceptronModel \ ... .pretrained("pos_clinical", "en", "clinical/models") \ ... .setInputCols(["document", "tokens"]) \ ... .setOutputCol("posTags") ... >>> nerTagger = MedicalNerModel \ ... .pretrained("ner_events_clinical", "en", "clinical/models") \ ... .setInputCols(["document", "tokens", "embeddings"]) \ ... .setOutputCol("ner_tags") ... >>> nerConverter = NerConverter() \ ... .setInputCols(["document", "tokens", "ner_tags"]) \ ... .setOutputCol("nerChunks") ... >>> depencyParser = DependencyParserModel \ ... .pretrained("dependency_conllu", "en") \ ... .setInputCols(["document", "posTags", "tokens"]) \ ... .setOutputCol("dependencies") ... >>> relationPairs = [ ... "direction-external_body_part_or_region", ... "external_body_part_or_region-direction", ... "direction-internal_organ_or_component", ... "internal_organ_or_component-direction" ... ] ... >>> re_model = RelationExtractionModel.pretrained("re_bodypart_directions", "en", "clinical/models") \ ... .setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"]) \ ... .setOutputCol("relations") \ ... .setRelationPairs(relationPairs) \ ... .setMaxSyntacticDistance(4) \ ... .setPredictionThreshold(0.9) ... >>> pipeline = Pipeline(stages=[ ... documentAssembler, ... tokenizer, ... embedder, ... posTagger, ... nerTagger, ... nerConverter, ... depencyParser, ... re_model])
>>> model = pipeline.fit(trainData) >>> data = spark.createDataFrame([["MRI demonstrated infarction in the upper brain stem , left cerebellum and right basil ganglia"]]).toDF("text") >>> result = pipeline.fit(data).transform(data) ... >>> result.selectExpr("explode(relations) as relations") ... .select( ... "relations.metadata.chunk1", ... "relations.metadata.entity1", ... "relations.metadata.chunk2", ... "relations.metadata.entity2", ... "relations.result" ... ) ... .where("result != 0") ... .show(truncate=False) ... ... # Show results ... result.selectExpr("explode(relations) as relations") \ ... .select( ... "relations.metadata.chunk1", ... "relations.metadata.entity1", ... "relations.metadata.chunk2", ... "relations.metadata.entity2", ... "relations.result" ... ).where("result != 0") \ ... .show(truncate=False) +------+---------+-------------+---------------------------+------+ |chunk1|entity1 |chunk2 |entity2 |result| +------+---------+-------------+---------------------------+------+ |upper |Direction|brain stem |Internal_organ_or_component|1 | |left |Direction|cerebellum |Internal_organ_or_component|1 | |right |Direction|basil ganglia|Internal_organ_or_component|1 | +------+---------+-------------+---------------------------+------+
- classes#
- customLabels#
- datasetInfo#
- directionSensitive#
- doExceptionHandling#
- featureScaling#
- filterByTokenDistance#
- getter_attrs = []#
- inputAnnotatorTypes#
- inputCols#
- lazyAnnotator#
- maxSyntacticDistance#
- multiClass#
- name = 'RelationExtractionModel'#
- optionalInputAnnotatorTypes = []#
- outputAnnotatorType = 'category'#
- outputCol#
- predictionThreshold#
- relationPairs#
- relationPairsCaseSensitive#
- scopeWindow#
- skipLPInputColsValidation = True#
- uid = ''#
- clear(param: pyspark.ml.param.Param) None #
Clears a param from the param map if it has been explicitly set.
- copy(extra: pyspark.ml._typing.ParamMap | None = None) JP #
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
- Parameters:
extra (dict, optional) – Extra parameters to copy to the new instance
- Returns:
Copy of this instance
- Return type:
JavaParams
- explainParam(param: str | Param) str #
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
- explainParams() str #
Returns the documentation of all params with their optionally default values and user-supplied values.
- extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap #
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
- Parameters:
extra (dict, optional) – extra param values
- Returns:
merged param map
- Return type:
dict
- getClasses()#
Returns labels used to train this model.
- getInputCols()#
Gets current column names of input annotations.
- getLazyAnnotator()#
Gets whether Annotator should be evaluated lazily in a RecursivePipeline.
- getOrDefault(param: str) Any #
- getOrDefault(param: Param[T]) T
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
- getOutputCol()#
Gets output column name of annotations.
- getParam(paramName: str) Param #
Gets a param by its name.
- getParamValue(paramName)#
Gets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- getRelationTypePerPair()#
Return the list of entity pairs allowed for a given relation.
- hasDefault(param: str | Param[Any]) bool #
Checks whether a param has a default value.
- hasParam(paramName: str) bool #
Tests whether this instance contains a param with a given (string) name.
- inputColsValidation(value)#
- isDefined(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user or has a default value.
- isSet(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user.
- classmethod load(path: str) RL #
Reads an ML instance from the input path, a shortcut of read().load(path).
- static pretrained(name='posology_re', lang='en', remote_loc='clinical/models')#
Download a pre-trained RelationExtractionModel.
- Parameters:
name (str) – Name of the pre-trained model, by default “posology_re”
lang (str) – Language of the pre-trained model, by default “en”
remote_loc (str) – Remote location of the pre-trained model. If None, use the open-source location. Other values are “clinical/models”, “finance/models”, or “legal/models”.
- Returns:
A pre-trained RelationExtractionModel.
- Return type:
- classmethod read()#
Returns an MLReader instance for this class.
- save(path: str) None #
Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
- set(param: Param, value: Any) None #
Sets a parameter in the embedded param map.
- setCustomLabels(labels: dict)#
Sets custom relation labels
- Parameters:
labels (dict[str, str]) – Dictionary which maps old to new labels
- setDatasetInfo(info: str)#
Sets descriptive information about the dataset being used.
- Parameters:
info (str) – Descriptive information about the dataset being used.
- setDirectionSensitive(value: bool)#
Sets whether to sensitive for direction in processing relation pairs. If it is true, only relations in the form of “ENTITY1-ENTITY2” will be considered, If it is false, both “ENTITY1-ENTITY2” and “ENTITY2-ENTITY1” relations will be considered,
- Parameters:
value (bool) – Whether to sensitive for direction in processing relation pairs. Default is ‘False’
- setDoExceptionHandling(value: bool)#
If True, exceptions are handled. If exception causing data is passed to the model, a error annotation is emitted which has the exception message. Processing continues with the next one. This comes with a performance penalty.
- Parameters:
value (bool) – If True, exceptions are handled.
- setFeatureScaling(feature_scaling: str)#
Sets Feature scaling method.
Possible values are ‘zscore’, ‘minmax’ or empty (no scaling).
- Parameters:
feature_scaling (str) – Feature scaling method. Possible values are ‘zscore’, ‘minmax’ or empty (no scaling).
- setFilterByTokenDistance(distance: int)#
- Sets number of tokens between entities. (Default: 0)
Model only finds relations that have fewer than the specified number of tokens between them.
- Parameters:
distance (int) – filtering criterion based on number of token between entities. (Default: 0)
- setForceInputTypeValidation(etfm)#
- setInputCols(*value)#
Sets column names of input annotations.
- Parameters:
*value (List[str]) – Input columns for the annotator
- setLazyAnnotator(value)#
Sets whether Annotator should be evaluated lazily in a RecursivePipeline.
- Parameters:
value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline
- setMaxSyntacticDistance(distance)#
Sets maximal syntactic distance, as threshold (Default: 0).
Determine how far the “from entity” can be from the “to entity” in the text. Increasing this value will increase recall, but also increase the number of false positives.
- Parameters:
distance (int) – Maximal syntactic distance, as threshold (Default: 0)
- setMultiClass(value: bool)#
Sets the model in multi class prediction mode (Default: false).
- Parameters:
value (bool) – Whether to return only the label with the highest confidence score or all labels
- setOutputCol(value)#
Sets output column name of annotations.
- Parameters:
value (str) – Name of output column
- setParamValue(paramName)#
Sets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- setParams()#
- setPredictionThreshold(threshold: float)#
Sets Minimal activation of the target unit to encode a new relation instance.
- Parameters:
threshold (float) – Minimal activation of the target unit to encode a new relation instance
- setRelationPairs(pairs)#
Sets List of dash-separated pairs of named entities to be processed.
- Parameters:
pairs (str or list[str]) – List of dash-separated pairs of named entities to be processed.
- setRelationPairsCaseSensitive(value: bool)#
Sets the case sensitivity of relation pairs
- Parameters:
value (bool) – whether relation pairs are case sensitive
- setRelationTypePerPair(relationTypePairs)#
Set the list of entity pairs allowed for a given relation.
- Parameters:
relationTypePairs (dict[str, list[str]]) – Dictionary which maps relation types to a list of entity pairs.
- setScopeWindow(value)#
Sets the scope of the window of the feature generation for relations :param value: Left and right offset if the scope window. Offsets must be non-negative values :type value: [int, int]
- transform(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = None) pyspark.sql.dataframe.DataFrame #
Transforms the input dataset with optional parameters.
New in version 1.3.0.
- Parameters:
dataset (
pyspark.sql.DataFrame
) – input datasetparams (dict, optional) – an optional param map that overrides embedded params.
- Returns:
transformed dataset
- Return type:
- write() JavaMLWriter #
Returns an MLWriter instance for this ML instance.