sparknlp_jsl.annotator.context.contextual_entity_filterer
#
Module Contents#
Classes#
ContextualEntityFilterer can filter chunks coming from CHUNK annotations based on entity(identifier,field) info in metadata. |
- class ContextualEntityFilterer(classname='com.johnsnowlabs.nlp.annotators.context.ContextualEntityFilterer', java_model=None)#
Bases:
sparknlp_jsl.common.AnnotatorModelInternal
,sparknlp_jsl.annotator.handle_exception_params.HandleExceptionParams
ContextualEntityFilterer can filter chunks coming from CHUNK annotations based on entity(identifier,field) info in metadata. Filters can be done via white list entities, black list entities, black list word and white list words. The filter can be applied to the scope of the sentence or the document.
Input Annotation types
Output Annotation type
DOCUMENT, TOKEN, CHUNK
CHUNK
- Parameters:
ruleScope (str) – The rule scope to apply the filter. Options: sentence, document.
rules (list[dict]) – The filtering rules. Each rule is a dictionary with the following keys: - entity: The target entity field for filtering. - scopeWindow: A list of two integers [before, after], specifying how many tokens/chunks before and after the target to consider. - whiteListEntities: The white list of entities. If one of the entity from this list appears within the scope window, the chunk will be kept. Only one element is enough to keep the chunk. - blackListEntities: The black list of entities. If an entity from this list appears within the scope window, the chunk will be filtered out. All elements must be absent to keep the chunk. - scopeWindowLevel: Determines whether the scopeWindow is applied at the token or chunk level. Options: token, chunk. - blackListWords: The black list of words. If a word from this list appears within the scope window, the chunk will be filtered out. - whiteListWords: The white list of words. If a word from this list appears within the scope window, the chunk will be kept. - confidenceThreshold: The confidence threshold to filter the chunks. Filtering is only applied if the confidence of the chunk is below the threshold. - possibleRegexContext : The possible regex context to filter the chunks. If the regex is found in the context(chunk), the chunk is kept. - impossibleRegexContext : The impossible regex context to filter the chunks. If the regex is found in the context(chunk), the chunk is removed.
Examples
>>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from sparknlp_jsl.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document")
>>> sentenceDetector = SentenceDetector() \ ... .setInputCols(["document"]) \ ... .setOutputCol("sentence")
>>> tokenizer = Tokenizer() \ ... .setInputCols(["sentence"]) \ ... .setOutputCol("token")
>>> word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \ ... .setInputCols(["sentence", "token"]) \ ... .setOutputCol("embeddings")
>>> clinical_ner = MedicalNerModel.pretrained("ner_deid_generic_augmented", "en", "clinical/models") \ ... .setInputCols(["sentence", "token", "embeddings"]) \ ... .setOutputCol("ner")
>>> ner_converter = NerConverterInternal() \ ... .setInputCols(["sentence", "token", "ner"]) \ ... .setOutputCol("ner_chunks")
Define the ContextualEntityFilterer with the rules to filter the chunks based on contextual rules.
>>> contextual_entity_filterer = ContextualEntityFilterer() \ ... .setInputCols("sentence", "token", "ner_chunks") \ ... .setOutputCol("filtered_ner_chunks") \ ... .setRules( \ ... [{ \ ... "entity": "LOCATION", \ ... "scopeWindow": [2, 2], \ ... "whiteListEntities": ["AGE", "DATE"], \ ... "blackListEntities": ["ID", "NAME"], \ ... "scopeWindowLevel": "token", \ ... "blackListWords": ["known", "in"], \ ... }, \ ... { \ ... "entity": "DATE", \ ... "scopeWindow": [2, 2], \ ... "whiteListEntities": ["AGE", "DATE"], \ ... "blackListEntities": ["ID", "NAME"], \ ... "scopeWindowLevel": "chunk", \ ... "confidenceThreshold": 0.5 \ ... }] \ ... ) \ ... .setRuleScope("sentence")
>>> nlpPipeline = Pipeline(stages=[ \ ... documentAssembler, \ ... sentenceDetector, \ ... tokenizer, \ ... word_embeddings, \ ... clinical_ner, \ ... ner_converter, \ ... contextual_entity_filterer \ ... ])
>>> >>> df = spark.createDataFrame([[ "California, known for its beautiful beaches,and he is 36 years. " "The Grand Canyon in Arizona, where the age is 37, is a stunning natural landmark. " "It was founded on September 9, 1850, and Arizona on February 14, 1912."]]).toDF("text")
>>> model = nlpPipeline.fit(df).transform(df)
Results:
>>> model.selectExpr("explode(filtered_ner_chunks) as filtered_chunks").show(20, truncate=False)
- caseSensitive#
- doExceptionHandling#
- getter_attrs = []#
- inputAnnotatorTypes#
- inputCols#
- lazyAnnotator#
- name = 'ContextualEntityFilterer'#
- optionalInputAnnotatorTypes = []#
- outputAnnotatorType = 'chunk'#
- outputCol#
- ruleScope#
- skipLPInputColsValidation = True#
- uid = ''#
- clear(param: pyspark.ml.param.Param) None #
Clears a param from the param map if it has been explicitly set.
- copy(extra: pyspark.ml._typing.ParamMap | None = None) JP #
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
- Parameters:
extra (dict, optional) – Extra parameters to copy to the new instance
- Returns:
Copy of this instance
- Return type:
JavaParams
- explainParam(param: str | Param) str #
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
- explainParams() str #
Returns the documentation of all params with their optionally default values and user-supplied values.
- extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap #
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
- Parameters:
extra (dict, optional) – extra param values
- Returns:
merged param map
- Return type:
dict
- getInputCols()#
Gets current column names of input annotations.
- getLazyAnnotator()#
Gets whether Annotator should be evaluated lazily in a RecursivePipeline.
- getOrDefault(param: str) Any #
- getOrDefault(param: Param[T]) T
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
- getOutputCol()#
Gets output column name of annotations.
- getParam(paramName: str) Param #
Gets a param by its name.
- getParamValue(paramName)#
Gets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- hasDefault(param: str | Param[Any]) bool #
Checks whether a param has a default value.
- hasParam(paramName: str) bool #
Tests whether this instance contains a param with a given (string) name.
- inputColsValidation(value)#
- isDefined(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user or has a default value.
- isSet(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user.
- classmethod load(path: str) RL #
Reads an ML instance from the input path, a shortcut of read().load(path).
- classmethod read()#
Returns an MLReader instance for this class.
- save(path: str) None #
Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
- set(param: Param, value: Any) None #
Sets a parameter in the embedded param map.
- setCaseSensitive(value: bool)#
Sets the caseSensitive parameter to use case sensitive when matching words.
- Parameters:
value (bool) – Whether to use case sensitive when matching words.
- setDoExceptionHandling(value: bool)#
If True, exceptions are handled. If exception causing data is passed to the model, a error annotation is emitted which has the exception message. Processing continues with the next one. This comes with a performance penalty.
- Parameters:
value (bool) – If True, exceptions are handled.
- setForceInputTypeValidation(etfm)#
- setInputCols(*value)#
Sets column names of input annotations.
- Parameters:
*value (List[str]) – Input columns for the annotator
- setLazyAnnotator(value)#
Sets whether Annotator should be evaluated lazily in a RecursivePipeline.
- Parameters:
value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline
- setOutputCol(value)#
Sets output column name of annotations.
- Parameters:
value (str) – Name of output column
- setParamValue(paramName)#
Sets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- setParams()#
- setRuleScope(value: str)#
Sets the ruleScope parameter to apply the filter. Options: sentence, document.
- Parameters:
value (str) – The ruleScope parameter to apply the filter.
- setRules(value: list)#
Sets the rules parameter to filter chunks based on contextual rules. The rules parameter is a list of dictionaries. A dictionary should contain the following keys:
entity: The entity field to filter.
scopeWindow: The scope window around the entity, defined as a list of two integers [before, after], specifying how many chunks before and after should be considered.
whiteListEntities: The white list of entities. If an entity from this list appears within the scope window, the chunk will be kept.
blackListEntities: The black list of entities. If an entity from this list appears within the scope window, the chunk will be filtered out.
scopeWindowLevel: The level to apply the scope window. Options: token, chunk.
blackListWords: The black list of words. If a word from this list appears within the scope window, the chunk will be filtered out.
whiteListWords: The white list of words. If a word from this list appears within the scope window, the chunk will be kept.
confidenceThreshold: The confidence threshold to filter the chunks. Filtering is only applied if the confidence of the chunk is below the threshold.
possibleRegexContext : The possible regex context to filter the chunks. If the regex is found in the context(chunk), the chunk is kept.
impossibleRegexContext : The impossible regex context to filter the chunks. If the regex is found in the context(chunk), the chunk is removed.
Notes:#
entity, scopeWindow, and scopeWindowLevel are required. Other keys are optional.
scopeWindowLevel defines whether the window is applied at the token or chunk level.
scopeWindow is defined as [before, after], representing how many chunks or tokens to check around the target entity. Scope can be calculated looking at tokens or chunks.Decision of chunk or token can be defined by scopeWindowLevel.
whiteListEntities and blackListEntities are optional; if neither is set, no filtering occurs.
One element of whiteListEntities is enough to keep the chunk.
All elements of the blackList must be absent to keep the chunk.
For possibleRegexContext and impossibleRegexContext, when writing regular expressions in code, ensure you use double escape characters () to properly escape special characters.
Example:#
>>> contextual_entity_filterer = ContextualEntityFilterer() \ ... .setInputCols("sentences", "tokens", "chunks") \ ... .setRules( \ ... [{ \ ... "entity": "LOCATION", \ ... "scopeWindow": [2, 2], \ ... "whiteListEntities": ["AGE", "DATE"], \ ... "blackListEntities": ["ID", "NAME"], \ ... "scopeWindowLevel": "token", \ ... "blackListWords": ["known", "in"], \ ... "impossibleRegexContext" : "\b(1[2-9]\d|[2-9]\d{2,}|\d{4,})\b" \ ... }, \ ... { \ ... "entity": "DATE", \ ... "scopeWindow": [2, 2], \ ... "whiteListEntities": ["AGE", "DATE"], \ ... "blackListEntities": ["ID", "NAME"], \ ... "scopeWindowLevel": "chunk", \ ... "confidenceThreshold": 0.5 \ ... }] \ ... )
- param value:
The rules parameter to filter the chunks based on contextual rules.
- type value:
list[dict]
- setRulesAsStr(value: str)#
- Sets the rules parameter to filter chunks based on contextual rules.
The rules parameter is a list of dictionaries. A dictionary should contain the following keys:
entity: The entity field to filter.
scopeWindow: The scope window around the entity, defined as a list of two integers [before, after], specifying how many chunks before and after should be considered.
whiteListEntities: The white list of entities. If one of the entity from this list appears within the scope window, the chunk will be kept. Only one element is enough to keep the chunk.
blackListEntities: The black list of entities. If an entity from this list appears within the scope window, the chunk will be filtered out. All elements must be absent to keep the chunk.
scopeWindowLevel: The level to apply the scope window. Options: token, chunk.
blackListWords: The black list of words. If a word from this list appears within the scope window, the chunk will be filtered out.
whiteListWords: The white list of words. If a word from this list appears within the scope window, the chunk will be kept.
confidenceThreshold: The confidence threshold to filter the chunks. Filtering is only applied if the confidence of the chunk is below the threshold.
possibleRegexContext : The possible regex context to filter the chunks. If the regex is found in the context(chunk), the chunk is kept.
impossibleRegexContext : The impossible regex context to filter the chunks. If the regex is found in the context(chunk), the chunk is removed.
Notes:#
entity, scopeWindow, and scopeWindowLevel are required. Other keys are optional.
scopeWindowLevel defines whether the window is applied at the token or chunk level.
scopeWindow is defined as [before, after], representing how many chunks or tokens to check around the target entity. Scope can be calculated looking at tokens or chunks.Decision of chunk or token can be defined by scopeWindowLevel.
whiteListEntities and blackListEntities are optional; if neither is set, no filtering occurs.
One element of whiteListEntities is enough to keep the chunk.
All elements of the blackListEntities must be absent to keep the chunk.
One element of whiteListWords is enough to keep the chunk.
All elements of the blackListWords must be absent to keep the chunk.
Filtering is only applied if the confidence of the chunk is below the threshold.
For possibleRegexContext and impossibleRegexContext, when writing regular expressions in code, ensure you use double escape characters () to properly escape special characters.
>>> contextual_entity_filterer = ContextualEntityFilterer() \ ... .setRules( \ ... [{ \ ... "entity": "LOCATION", \ ... "scopeWindow": [2, 2], \ ... "whiteListEntities": ["AGE", "DATE"], \ ... "blackListEntities": ["ID", "NAME"], \ ... "scopeWindowLevel": "token", \ ... "blackListWords": ["known", "in"], \ ... "impossibleRegexContext" : "\b(1[2-9]\d|[2-9]\d{2,}|\d{4,})\b" \ ... }, \ ... { \ ... "entity": "DATE", \ ... "scopeWindow": [2, 2], \ ... "whiteListEntities": ["AGE", "DATE"], \ ... "blackListEntities": ["ID", "NAME"], \ ... "scopeWindowLevel": "chunk", \ ... "confidenceThreshold": 0.5 \ ... }] \ ... )
- valuelist[dict]
The rules parameter to filter the chunks based on contextual rules.
- transform(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = None) pyspark.sql.dataframe.DataFrame #
Transforms the input dataset with optional parameters.
New in version 1.3.0.
- Parameters:
dataset (
pyspark.sql.DataFrame
) – input datasetparams (dict, optional) – an optional param map that overrides embedded params.
- Returns:
transformed dataset
- Return type:
- write() JavaMLWriter #
Returns an MLWriter instance for this ML instance.