sparknlp_jsl.annotator.rag.vectordb_post_processor#

Contains Class for VectorDBPostProcessor

Module Contents#

Classes#

VectorDBPostProcessor

VectorDBPostProcessor is used to filter and sort the annotations from the sparknlp_jsl.annotator.resolution.VectorDBModel.

class VectorDBPostProcessor(classname='com.johnsnowlabs.nlp.annotators.rag.VectorDBPostProcessor', java_model=None)#

Bases: sparknlp_jsl.common.AnnotatorModelInternal

VectorDBPostProcessor is used to filter and sort the annotations from the sparknlp_jsl.annotator.resolution.VectorDBModel.

Input Annotation types

Output Annotation type

VECTOR_SIMILARITY_RANKINGS

VECTOR_SIMILARITY_RANKINGS

Parameters:
  • filterBy (str) – The filterBy parameter is used to select and prioritize filter options. Options: metadata, and diversity_by_threshold. Options can be given as a comma separated string like “metadata, diversity_by_threshold”. The order of the options will be used to filter the annotations. - metadata: Filter by metadata fields. The metadataCriteria parameter should be set. - diversity_by_threshold: Filter by diversity threshold. Filter by the distance between the sorted annotations. When diversity_by_threshold option is set, diversityThreshold parameter should be used to set the threshold. Default: metadata

  • sortBy (str) – The sortBy parameter is used to select sorting option. Options: ascending, descending, lost_in_the_middle, diversity. - ascending: Sort by ascending order of distance. - descending: Sort by descending order of distance. - lost_in_the_middle: Sort by lost in the middle ranker. Let’s say we have 5 annotations with distances [1, 2, 3, 4, 5]. The lost in the middle ranker will sort them as [1, 3, 5, 4, 2]. - diversity: Sort by diversity ranker. The annotations are sorted by distance and the first annotation select, and then the next annotation is selected by the maximum average distance from the selected annotations. Default: ascending

  • caseSensitive (bool) – Whether the criteria of the string operators are case sensitive or not. For example, if set to False, the operator “equals” will match “John” with “john”. Default: False

  • diversityThreshold (float) – The diversityThreshold parameter is used to set the threshold for the diversityByThreshold filter. The diversityByThreshold filter selects the annotations by the distance between the sorted annotations. diversityThreshold must be greater than 0. Default: 0.01

  • maxTopKAfterFiltering (int) – The maxTopKAfterFiltering parameter is used to set the maximum number of annotations to return after filtering. If the number of annotations after filtering is greater than maxTopKAfterFiltering, the top maxTopKAfterFiltering annotations are selected. maxTopKAfterFiltering must be greater than 0. Default: 20

  • allowZeroContentAfterFiltering (bool) – Whether to allow zero annotation after filtering. If set to True, the output may contain zero annotation if all annotations are filtered out. If set to False, The output is tried to contain at least one annotation. Default: False

  • metadataCriteria (list[dict]) – The metadataCriteria parameter is used to filter the annotations by metadata fields.

allowZeroContentAfterFiltering#
caseSensitive#
diversityThreshold#
filterBy#
getter_attrs = []#
inputAnnotatorTypes#
inputCols#
lazyAnnotator#
maxTopKAfterFiltering#
name = 'VectorDBPostProcessor'#
optionalInputAnnotatorTypes = []#
outputAnnotatorType = 'vector_similarity_rankings'#
outputCol#
skipLPInputColsValidation = True#
sortBy#
uid = ''#
clear(param: pyspark.ml.param.Param) None#

Clears a param from the param map if it has been explicitly set.

copy(extra: pyspark.ml._typing.ParamMap | None = None) JP#

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:

extra (dict, optional) – Extra parameters to copy to the new instance

Returns:

Copy of this instance

Return type:

JavaParams

explainParam(param: str | Param) str#

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams() str#

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap#

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:

extra (dict, optional) – extra param values

Returns:

merged param map

Return type:

dict

getInputCols()#

Gets current column names of input annotations.

getLazyAnnotator()#

Gets whether Annotator should be evaluated lazily in a RecursivePipeline.

getOrDefault(param: str) Any#
getOrDefault(param: Param[T]) T

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()#

Gets output column name of annotations.

getParam(paramName: str) Param#

Gets a param by its name.

getParamValue(paramName)#

Gets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

hasDefault(param: str | Param[Any]) bool#

Checks whether a param has a default value.

hasParam(paramName: str) bool#

Tests whether this instance contains a param with a given (string) name.

inputColsValidation(value)#
isDefined(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user or has a default value.

isSet(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user.

classmethod load(path: str) RL#

Reads an ML instance from the input path, a shortcut of read().load(path).

classmethod read()#

Returns an MLReader instance for this class.

save(path: str) None#

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param: Param, value: Any) None#

Sets a parameter in the embedded param map.

setAllowZeroContentAfterFiltering(value: bool)#

Sets whether to allow zero annotation after filtering. If set to True, the output may contain zero annotation if all annotations are filtered out. If set to False, The output is tried to contain at least one annotation.

Default: False

Parameters:

value (bool) – Whether to allow zero annotation after filtering.

setCaseSensitive(value: bool)#

Sets whether the criteria of the string operators are case sensitive or not.

For example, if set to False, the operator “equals” will match “John” with “john”.

Default: False

Parameters:

value (bool) – Whether the criteria of the string operators are case sensitive or not. Default: True.

setDiversityThreshold(value: float)#

Sets the diversityThreshold parameter is used to set the threshold for the diversityByThreshold filter. The diversityByThreshold filter selects the annotations by the distance between the sorted annotations. diversityThreshold must be greater than 0. Default: 0.01

Parameters:

value (float) – The diversityThreshold parameter is used to set the threshold for the diversityByThreshold filter.

setFilterBy(value: str)#

Sets the filterBy parameter is used to select and prioritize filter options.

Options: metadata, and diversity_by_threshold. Options can be given as a comma separated string like “metadata, diversity_by_threshold”. The order of the options will be used to filter the annotations.

  • metadata: Filter by metadata fields. The metadataCriteria parameter should be set.

  • diversity_by_threshold: Filter by diversity threshold. Filter by the distance between the sorted annotations.

When diversity_by_threshold option is set, diversityThreshold parameter should be used to set the threshold.

Default: metadata

Parameters:

value (str) – The filterBy parameter is used to select and prioritize filter options. Default: metadata

setForceInputTypeValidation(etfm)#
setInputCols(*value)#

Sets column names of input annotations.

Parameters:

*value (List[str]) – Input columns for the annotator

setLazyAnnotator(value)#

Sets whether Annotator should be evaluated lazily in a RecursivePipeline.

Parameters:

value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline

setMaxTopKAfterFiltering(value: int)#

Sets the maxTopKAfterFiltering parameter which is used to set the maximum number of annotations to return after filtering. If the number of annotations after filtering is greater than maxTopKAfterFiltering, the top maxTopKAfterFiltering annotations are selected. maxTopKAfterFiltering must be greater than 0.

Default: 20

Parameters:

value (int) – The maxTopKAfterFiltering parameter is used to set the maximum number of annotations to return after filtering.

setMetadataCriteria(value: list)#

Sets the metadataCriteria parameter is used to filter the annotations by metadata fields. The metadataCriteria param is a list of dictionaries. A dictionary should contain the following keys:

  • field: The field of the metadata to filter.

  • fieldType: The type of the field to filter. Options: string, int, float, date.

  • operator: The operator to apply to the filter. Options: equals, not_equals, greater_than, greater_than_or_equals, less_than, less_than_or_equals, contains, not_contains, regex.

  • value: The value to filter.

  • matchMode: The match mode to apply to the filter. Options: any, all, none.

  • matchValues: The values to filter.

  • dateFormats: The date formats to parse the date metadata field.

  • converterFallback: The converter fallback when hitting cast exception. Options: filter, not_filter, error.

Notes:#

  • field, fieldType, and operator are required. Other keys are optional.

  • fieldType is set to string, supported operators are: equals, not_equals, contains, not_contains, regex.

  • fieldType is set to int or float or date, supported operators are: equals, not_equals, greater_than, greater_than_or_equals, less_than, less_than_or_equals.

  • If matchMode and matchValues are not set, value must be set.

  • If value is set, matchMode and matchValues are ignored.

  • If fieldType is set to date, dateFormats must be set.

  • matchMode and matchValues must be set together.

  • If converterFallback is set to error, the filter will throw an error when hitting cast exception. Default ‘error’.

Example:#

>>> VectorDBPostProcessor() \
...     .setFilterBy('metadata') \
...     .setMetadataCriteria([ \
...         {"field": "publish_date", "fieldType": "date", "operator": "greater_than", "value": "2022 May 11", "dateFormats": ["yyyy MMM dd", "yyyy MMM d"], "converterFallback": "filter"}, \
...         {"field": "distance", "fieldType": "float", "operator": "less_than", "value": "0.5470"}, \
...         {"field": "title", "fieldType": "string", "operator": "contains", "matchMode": "any", "matchValues": ["diabetes", "immune system"]}] \
param value:

The metadataCriteria parameter is used to filter the annotations by metadata fields.

type value:

list[dict]

setMetadataCriteriaAsStr(value: str)#

Sets the metadataCriteria parameter is used to filter the annotations by metadata fields. The metadataCriteria param is a list of dictionaries. A dictionary should contain the following keys:

  • field: The field of the metadata to filter.

  • fieldType: The type of the field to filter. Options: string, int, float, date.

  • operator: The operator to apply to the filter. Options: equals, not_equals, greater_than, greater_than_or_equals, less_than, less_than_or_equals, contains, not_contains, regex.

  • value: The value to filter.

  • matchMode: The match mode to apply to the filter. Options: any, all, none.

  • matchValues: The values to filter.

  • dateFormats: The date formats to parse the date metadata field.

  • converterFallback: The converter fallback when hitting cast exception. Options: filter, not_filter, error.

Notes:#

  • field, fieldType, and operator are required. Other keys are optional.

  • fieldType is set to string, supported operators are: equals, not_equals, contains, not_contains, regex.

  • fieldType is set to int or float or date, supported operators are: equals, not_equals, greater_than, greater_than_or_equals, less_than, less_than_or_equals.

  • If matchMode and matchValues are not set, value must be set.

  • If value is set, matchMode and matchValues are ignored.

  • If fieldType is set to date, dateFormats must be set.

  • matchMode and matchValues must be set together.

  • If converterFallback is set to error, the filter will throw an error when hitting cast exception. Default ‘error’.

param value:

The metadataCriteria parameter is used to filter the annotations by metadata fields.

type value:

str

setOutputCol(value)#

Sets output column name of annotations.

Parameters:

value (str) – Name of output column

setParamValue(paramName)#

Sets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

setParams()#
setSortBy(value: str)#

Sets the sortBy parameter is used to select sorting option. Options: ascending, descending, lost_in_the_middle, diversity.

  • ascending: Sort by ascending order of distance.

  • descending: Sort by descending order of distance.

  • lost_in_the_middle: Sort by lost in the middle ranker. Let’s say we have 5 annotations with distances [1, 2, 3, 4, 5]. The lost in the middle ranker will sort them as [1, 3, 5, 4, 2].

  • diversity: Sort by diversity ranker. The annotations are sorted by distance and the first annotation select, and then the next annotation is selected by the maximum average distance from the selected annotations.

Default: ascending

Parameters:

value (str) – The sortBy parameter is used to select sorting option. Default: ascending

transform(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = None) pyspark.sql.dataframe.DataFrame#

Transforms the input dataset with optional parameters.

New in version 1.3.0.

Parameters:
  • dataset (pyspark.sql.DataFrame) – input dataset

  • params (dict, optional) – an optional param map that overrides embedded params.

Returns:

transformed dataset

Return type:

pyspark.sql.DataFrame

write() JavaMLWriter#

Returns an MLWriter instance for this ML instance.