sparknlp_jsl.annotator.rag.vectordb_post_processor
#
Contains Class for VectorDBPostProcessor
Module Contents#
Classes#
VectorDBPostProcessor is used to filter and sort the annotations from the |
- class VectorDBPostProcessor(classname='com.johnsnowlabs.nlp.annotators.rag.VectorDBPostProcessor', java_model=None)#
Bases:
sparknlp_jsl.common.AnnotatorModelInternal
VectorDBPostProcessor is used to filter and sort the annotations from the
sparknlp_jsl.annotator.resolution.VectorDBModel
.Input Annotation types
Output Annotation type
VECTOR_SIMILARITY_RANKINGS
VECTOR_SIMILARITY_RANKINGS
- Parameters:
filterBy (str) – The filterBy parameter is used to select and prioritize filter options. Options: metadata, and diversity_by_threshold. Options can be given as a comma separated string like “metadata, diversity_by_threshold”. The order of the options will be used to filter the annotations. - metadata: Filter by metadata fields. The metadataCriteria parameter should be set. - diversity_by_threshold: Filter by diversity threshold. Filter by the distance between the sorted annotations. When diversity_by_threshold option is set, diversityThreshold parameter should be used to set the threshold. Default: metadata
sortBy (str) – The sortBy parameter is used to select sorting option. Options: ascending, descending, lost_in_the_middle, diversity. - ascending: Sort by ascending order of distance. - descending: Sort by descending order of distance. - lost_in_the_middle: Sort by lost in the middle ranker. Let’s say we have 5 annotations with distances [1, 2, 3, 4, 5]. The lost in the middle ranker will sort them as [1, 3, 5, 4, 2]. - diversity: Sort by diversity ranker. The annotations are sorted by distance and the first annotation select, and then the next annotation is selected by the maximum average distance from the selected annotations. Default: ascending
caseSensitive (bool) – Whether the criteria of the string operators are case sensitive or not. For example, if set to False, the operator “equals” will match “John” with “john”. Default: False
diversityThreshold (float) – The diversityThreshold parameter is used to set the threshold for the diversityByThreshold filter. The diversityByThreshold filter selects the annotations by the distance between the sorted annotations. diversityThreshold must be greater than 0. Default: 0.01
maxTopKAfterFiltering (int) – The maxTopKAfterFiltering parameter is used to set the maximum number of annotations to return after filtering. If the number of annotations after filtering is greater than maxTopKAfterFiltering, the top maxTopKAfterFiltering annotations are selected. maxTopKAfterFiltering must be greater than 0. Default: 20
allowZeroContentAfterFiltering (bool) – Whether to allow zero annotation after filtering. If set to True, the output may contain zero annotation if all annotations are filtered out. If set to False, The output is tried to contain at least one annotation. Default: False
metadataCriteria (list[dict]) – The metadataCriteria parameter is used to filter the annotations by metadata fields.
- allowZeroContentAfterFiltering#
- caseSensitive#
- diversityThreshold#
- filterBy#
- getter_attrs = []#
- inputAnnotatorTypes#
- inputCols#
- lazyAnnotator#
- maxTopKAfterFiltering#
- name = 'VectorDBPostProcessor'#
- optionalInputAnnotatorTypes = []#
- outputAnnotatorType = 'vector_similarity_rankings'#
- outputCol#
- skipLPInputColsValidation = True#
- sortBy#
- uid = ''#
- clear(param: pyspark.ml.param.Param) None #
Clears a param from the param map if it has been explicitly set.
- copy(extra: pyspark.ml._typing.ParamMap | None = None) JP #
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
- Parameters:
extra (dict, optional) – Extra parameters to copy to the new instance
- Returns:
Copy of this instance
- Return type:
JavaParams
- explainParam(param: str | Param) str #
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
- explainParams() str #
Returns the documentation of all params with their optionally default values and user-supplied values.
- extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap #
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
- Parameters:
extra (dict, optional) – extra param values
- Returns:
merged param map
- Return type:
dict
- getInputCols()#
Gets current column names of input annotations.
- getLazyAnnotator()#
Gets whether Annotator should be evaluated lazily in a RecursivePipeline.
- getOrDefault(param: str) Any #
- getOrDefault(param: Param[T]) T
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
- getOutputCol()#
Gets output column name of annotations.
- getParam(paramName: str) Param #
Gets a param by its name.
- getParamValue(paramName)#
Gets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- hasDefault(param: str | Param[Any]) bool #
Checks whether a param has a default value.
- hasParam(paramName: str) bool #
Tests whether this instance contains a param with a given (string) name.
- inputColsValidation(value)#
- isDefined(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user or has a default value.
- isSet(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user.
- classmethod load(path: str) RL #
Reads an ML instance from the input path, a shortcut of read().load(path).
- classmethod read()#
Returns an MLReader instance for this class.
- save(path: str) None #
Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
- set(param: Param, value: Any) None #
Sets a parameter in the embedded param map.
- setAllowZeroContentAfterFiltering(value: bool)#
Sets whether to allow zero annotation after filtering. If set to True, the output may contain zero annotation if all annotations are filtered out. If set to False, The output is tried to contain at least one annotation.
Default: False
- Parameters:
value (bool) – Whether to allow zero annotation after filtering.
- setCaseSensitive(value: bool)#
Sets whether the criteria of the string operators are case sensitive or not.
For example, if set to False, the operator “equals” will match “John” with “john”.
Default: False
- Parameters:
value (bool) – Whether the criteria of the string operators are case sensitive or not. Default: True.
- setDiversityThreshold(value: float)#
Sets the diversityThreshold parameter is used to set the threshold for the diversityByThreshold filter. The diversityByThreshold filter selects the annotations by the distance between the sorted annotations. diversityThreshold must be greater than 0. Default: 0.01
- Parameters:
value (float) – The diversityThreshold parameter is used to set the threshold for the diversityByThreshold filter.
- setFilterBy(value: str)#
Sets the filterBy parameter is used to select and prioritize filter options.
Options: metadata, and diversity_by_threshold. Options can be given as a comma separated string like “metadata, diversity_by_threshold”. The order of the options will be used to filter the annotations.
metadata: Filter by metadata fields. The metadataCriteria parameter should be set.
diversity_by_threshold: Filter by diversity threshold. Filter by the distance between the sorted annotations.
When diversity_by_threshold option is set, diversityThreshold parameter should be used to set the threshold.
Default: metadata
- Parameters:
value (str) – The filterBy parameter is used to select and prioritize filter options. Default: metadata
- setForceInputTypeValidation(etfm)#
- setInputCols(*value)#
Sets column names of input annotations.
- Parameters:
*value (List[str]) – Input columns for the annotator
- setLazyAnnotator(value)#
Sets whether Annotator should be evaluated lazily in a RecursivePipeline.
- Parameters:
value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline
- setMaxTopKAfterFiltering(value: int)#
Sets the maxTopKAfterFiltering parameter which is used to set the maximum number of annotations to return after filtering. If the number of annotations after filtering is greater than maxTopKAfterFiltering, the top maxTopKAfterFiltering annotations are selected. maxTopKAfterFiltering must be greater than 0.
Default: 20
- Parameters:
value (int) – The maxTopKAfterFiltering parameter is used to set the maximum number of annotations to return after filtering.
- setMetadataCriteria(value: list)#
Sets the metadataCriteria parameter is used to filter the annotations by metadata fields. The metadataCriteria param is a list of dictionaries. A dictionary should contain the following keys:
field: The field of the metadata to filter.
fieldType: The type of the field to filter. Options: string, int, float, date.
operator: The operator to apply to the filter. Options: equals, not_equals, greater_than, greater_than_or_equals, less_than, less_than_or_equals, contains, not_contains, regex.
value: The value to filter.
matchMode: The match mode to apply to the filter. Options: any, all, none.
matchValues: The values to filter.
dateFormats: The date formats to parse the date metadata field.
converterFallback: The converter fallback when hitting cast exception. Options: filter, not_filter, error.
Notes:#
field, fieldType, and operator are required. Other keys are optional.
fieldType is set to string, supported operators are: equals, not_equals, contains, not_contains, regex.
fieldType is set to int or float or date, supported operators are: equals, not_equals, greater_than, greater_than_or_equals, less_than, less_than_or_equals.
If matchMode and matchValues are not set, value must be set.
If value is set, matchMode and matchValues are ignored.
If fieldType is set to date, dateFormats must be set.
matchMode and matchValues must be set together.
If converterFallback is set to error, the filter will throw an error when hitting cast exception. Default ‘error’.
Example:#
>>> VectorDBPostProcessor() \ ... .setFilterBy('metadata') \ ... .setMetadataCriteria([ \ ... {"field": "publish_date", "fieldType": "date", "operator": "greater_than", "value": "2022 May 11", "dateFormats": ["yyyy MMM dd", "yyyy MMM d"], "converterFallback": "filter"}, \ ... {"field": "distance", "fieldType": "float", "operator": "less_than", "value": "0.5470"}, \ ... {"field": "title", "fieldType": "string", "operator": "contains", "matchMode": "any", "matchValues": ["diabetes", "immune system"]}] \
- param value:
The metadataCriteria parameter is used to filter the annotations by metadata fields.
- type value:
list[dict]
- setMetadataCriteriaAsStr(value: str)#
Sets the metadataCriteria parameter is used to filter the annotations by metadata fields. The metadataCriteria param is a list of dictionaries. A dictionary should contain the following keys:
field: The field of the metadata to filter.
fieldType: The type of the field to filter. Options: string, int, float, date.
operator: The operator to apply to the filter. Options: equals, not_equals, greater_than, greater_than_or_equals, less_than, less_than_or_equals, contains, not_contains, regex.
value: The value to filter.
matchMode: The match mode to apply to the filter. Options: any, all, none.
matchValues: The values to filter.
dateFormats: The date formats to parse the date metadata field.
converterFallback: The converter fallback when hitting cast exception. Options: filter, not_filter, error.
Notes:#
field, fieldType, and operator are required. Other keys are optional.
fieldType is set to string, supported operators are: equals, not_equals, contains, not_contains, regex.
fieldType is set to int or float or date, supported operators are: equals, not_equals, greater_than, greater_than_or_equals, less_than, less_than_or_equals.
If matchMode and matchValues are not set, value must be set.
If value is set, matchMode and matchValues are ignored.
If fieldType is set to date, dateFormats must be set.
matchMode and matchValues must be set together.
If converterFallback is set to error, the filter will throw an error when hitting cast exception. Default ‘error’.
- param value:
The metadataCriteria parameter is used to filter the annotations by metadata fields.
- type value:
str
- setOutputCol(value)#
Sets output column name of annotations.
- Parameters:
value (str) – Name of output column
- setParamValue(paramName)#
Sets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- setParams()#
- setSortBy(value: str)#
Sets the sortBy parameter is used to select sorting option. Options: ascending, descending, lost_in_the_middle, diversity.
ascending: Sort by ascending order of distance.
descending: Sort by descending order of distance.
lost_in_the_middle: Sort by lost in the middle ranker. Let’s say we have 5 annotations with distances [1, 2, 3, 4, 5]. The lost in the middle ranker will sort them as [1, 3, 5, 4, 2].
diversity: Sort by diversity ranker. The annotations are sorted by distance and the first annotation select, and then the next annotation is selected by the maximum average distance from the selected annotations.
Default: ascending
- Parameters:
value (str) – The sortBy parameter is used to select sorting option. Default: ascending
- transform(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = None) pyspark.sql.dataframe.DataFrame #
Transforms the input dataset with optional parameters.
New in version 1.3.0.
- Parameters:
dataset (
pyspark.sql.DataFrame
) – input datasetparams (dict, optional) – an optional param map that overrides embedded params.
- Returns:
transformed dataset
- Return type:
- write() JavaMLWriter #
Returns an MLWriter instance for this ML instance.