sparknlp_jsl.annotator.matcher.text_matcher_internal#
Contains classes for the TextMatcherInternal.
Module Contents#
Classes#
Annotator to match exact phrases (by token) provided in a file against a |
|
Instantiated model of the TextMatcherInternal. |
- class TextMatcherInternal#
Bases:
sparknlp_jsl.common.AnnotatorApproachInternal,sparknlp_jsl.annotator.matcher.text_matcher_params.TextMatcherParamsAnnotator to match exact phrases (by token) provided in a file against a Document.
A text file of predefined phrases must be provided with
setEntities().Input Annotation types
Output Annotation type
DOCUMENT, TOKENCHUNK- Parameters:
entities – ExternalResource for entities
caseSensitive – Whether to match regardless of case, by default True
mergeOverlapping – Whether to merge overlapping matched chunks, by default False
entityValue – Value for the entity metadata field
buildFromTokens – Whether the TextMatcherInternal should take the CHUNK from TOKEN
dictionary – lemmatizer external dictionary.
enableLemmatizer – Whether to enable lemmatizer, by default False.
enableStemmer – Whether to enable stemmer, by default False.
stopWords – List of stop words to be removed, by default None.
cleanStopWords – Whether to clean stop words, by default False.
shuffleEntitySubTokens – Whether to generate and use variations (permutations) of the entity phrases. Defaults to false.
Examples
In this example, the entities file is of the form:
… dolore magna aliqua, entity_name_1 lorem ipsum dolor. sit, entity_name_1 laborum, entity_name_1 …
where each line represents an entity phrase to be extracted.
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> documentAssembler = DocumentAssembler() \ ... .setInputCol("text") \ ... .setOutputCol("document") >>> tokenizer = Tokenizer() \ ... .setInputCols(["document"]) \ ... .setOutputCol("token") >>> data = spark.createDataFrame([["Hello dolore magna aliqua. Lorem ipsum dolor. sit in laborum"]]).toDF("text") >>> entityExtractor = TextMatcherInternal() \ ... .setInputCols(["document", "token"]) \ ... .setEntities("src/test/resources/entity-extractor/test-phrases.txt", ReadAs.TEXT) \ ... .setOutputCol("entity") \ ... .setCaseSensitive(False) >>> pipeline = Pipeline().setStages([documentAssembler, tokenizer, entityExtractor]) >>> results = pipeline.fit(data).transform(data) >>> results.selectExpr("explode(entity) as result").show(truncate=False) +------------------------------------------------------------------------------------------+ |result | +------------------------------------------------------------------------------------------+ |[chunk, 6, 24, dolore magna aliqua, [entity -> entity, sentence -> 0, chunk -> 0], []] | |[chunk, 27, 48, Lorem ipsum dolor. sit, [entity -> entity, sentence -> 0, chunk -> 1], []]| |[chunk, 53, 59, laborum, [entity -> entity, sentence -> 0, chunk -> 2], []] | +------------------------------------------------------------------------------------------+
- buildFromTokens#
- caseSensitive#
- cleanKeywords#
- cleanStopWords#
- delimiter#
- enableLemmatizer#
- enableStemmer#
- entities#
- entityValue#
- excludePunctuation#
- excludeRegexPatterns#
- getter_attrs = []#
- inputAnnotatorTypes#
- inputCols#
- lazyAnnotator#
- lemmatizerDictionary#
- mergeOverlapping#
- optionalInputAnnotatorTypes = []#
- outputAnnotatorType = 'chunk'#
- outputCol#
- returnChunks#
- safeKeywords#
- shuffleEntitySubTokens#
- skipLPInputColsValidation = True#
- skipMatcherAugmentation#
- skipSourceTextAugmentation#
- stopWords#
- uid = ''#
- clear(param: pyspark.ml.param.Param) None#
Clears a param from the param map if it has been explicitly set.
- copy(extra: pyspark.ml._typing.ParamMap | None = None) JP#
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
- Parameters:
extra (dict, optional) – Extra parameters to copy to the new instance
- Returns:
Copy of this instance
- Return type:
JavaParams
- explainParam(param: str | Param) str#
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
- explainParams() str#
Returns the documentation of all params with their optionally default values and user-supplied values.
- extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap#
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
- Parameters:
extra (dict, optional) – extra param values
- Returns:
merged param map
- Return type:
dict
- fit(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = ...) M#
- fit(dataset: pyspark.sql.dataframe.DataFrame, params: List[pyspark.ml._typing.ParamMap] | Tuple[pyspark.ml._typing.ParamMap]) List[M]
Fits a model to the input dataset with optional parameters.
New in version 1.3.0.
- Parameters:
dataset (
pyspark.sql.DataFrame) – input dataset.params (dict or list or tuple, optional) – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.
- Returns:
fitted model(s)
- Return type:
Transformeror a list ofTransformer
- fitMultiple(dataset: pyspark.sql.dataframe.DataFrame, paramMaps: Sequence[pyspark.ml._typing.ParamMap]) Iterator[Tuple[int, M]]#
Fits a model to the input dataset for each param map in paramMaps.
New in version 2.3.0.
- Parameters:
dataset (
pyspark.sql.DataFrame) – input dataset.paramMaps (
collections.abc.Sequence) – A Sequence of param maps.
- Returns:
A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.
- Return type:
_FitMultipleIterator
- getCleanKeywords()#
Gets the additional keywords to be removed alongside default stopwords.
- getExcludeRegexPatterns()#
Gets the regex patterns used to drop matched chunks.
- getInputCols()#
Gets current column names of input annotations.
- getLazyAnnotator()#
Gets whether Annotator should be evaluated lazily in a RecursivePipeline.
- getOrDefault(param: str) Any#
- getOrDefault(param: Param[T]) T
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
- getOutputCol()#
Gets output column name of annotations.
- getParam(paramName: str) Param#
Gets a param by its name.
- getParamValue(paramName)#
Gets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- getReturnChunks()#
Gets whether to return the original text chunks from input or the matched (e.g., stemmed/lemmatized) phrases.
- getSafeKeywords()#
Gets the keywords to preserve during stopword removal when cleanStopWords is enabled.
- getStopWords()#
Gets the stop words to be removed.
- hasDefault(param: str | Param[Any]) bool#
Checks whether a param has a default value.
- hasParam(paramName: str) bool#
Tests whether this instance contains a param with a given (string) name.
- inputColsValidation(value)#
- isDefined(param: str | Param[Any]) bool#
Checks whether a param is explicitly set by user or has a default value.
- isSet(param: str | Param[Any]) bool#
Checks whether a param is explicitly set by user.
- classmethod load(path: str) RL#
Reads an ML instance from the input path, a shortcut of read().load(path).
- classmethod read()#
Returns an MLReader instance for this class.
- save(path: str) None#
Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
- set(param: Param, value: Any) None#
Sets a parameter in the embedded param map.
- setBuildFromTokens(b)#
Sets whether the TextMatcherInternal should take the CHUNK from TOKEN.
- Parameters:
b (bool) – Whether the TextMatcherInternal should take the CHUNK from TOKEN
- setCaseSensitive(b)#
Sets whether to match regardless of case, by default True.
- Parameters:
b (bool) – Whether to match regardless of case
- setCleanKeywords(b)#
Sets the additional keywords to be removed alongside default stopwords. Defaults to empty.
- Parameters:
b (list) – List of additional keywords to be removed
- setCleanStopWords(b)#
Sets whether to clean stop words, by default False.
- Parameters:
b (bool) – Whether to clean stop words
- setDelimiter(b)#
Sets Value for the delimiter between Phrase, Entity.
- Parameters:
b (bool) – Whether the TextMatcherInternal should take the CHUNK from TOKEN
- setEnableLemmatizer(b)#
Sets whether to enable lemmatizer, by default False.
- Parameters:
b (bool) – Whether to enable lemmatizer
- setEnableStemmer(b)#
Sets whether to enable stemmer, by default False.
- Parameters:
b (bool) – Whether to enable stemmer
- setEntities(path, read_as=ReadAs.TEXT, options={'format': 'text'})#
Sets the external resource for the entities.
- Parameters:
path (str) – Path to the external resource
read_as (str, optional) – How to read the resource, by default ReadAs.TEXT
options (dict, optional) – Options for reading the resource, by default {“format”: “text”}
- setEntityValue(b)#
Sets value for the entity metadata field.
- Parameters:
b (str) – Value for the entity metadata field
- setExcludePunctuation(b)#
Sets whether to exclude punctuation, by default True.
- Parameters:
b (bool) – Whether to exclude punctuation
- setExcludeRegexPatterns(b)#
Sets the regex patterns used to drop matched chunks. Defaults to empty.
- Parameters:
b (list) – List of regex patterns
- setForceInputTypeValidation(etfm)#
- setInputCols(*value)#
Sets column names of input annotations.
- Parameters:
*value (List[str]) – Input columns for the annotator
- setLazyAnnotator(value)#
Sets whether Annotator should be evaluated lazily in a RecursivePipeline.
- Parameters:
value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline
- setLemmatizerDictionary(path, key_delimiter, value_delimiter, read_as=ReadAs.TEXT, options={'format': 'text'})#
Sets the external dictionary for the lemmatizer.
- Parameters:
path (str) – Path to the source files
key_delimiter (str) – Delimiter for the key
value_delimiter (str) – Delimiter for the values
read_as (str, optional) – How to read the file, by default ReadAs.TEXT
options (dict, optional) – Options to read the resource, by default {“format”: “text”}
Examples
Here the file has each key is delimited by
"->"and values are delimited by\t:... pick -> pick picks picking picked peck -> peck pecking pecked pecks pickle -> pickle pickles pickled pickling pepper -> pepper peppers peppered peppering ...
- setMergeOverlapping(b)#
Sets whether to merge overlapping matched chunks, by default False.
- Parameters:
b (bool) – Whether to merge overlapping matched chunks
- setOutputCol(value)#
Sets output column name of annotations.
- Parameters:
value (str) – Name of output column
- setParamValue(paramName)#
Sets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- setReturnChunks(b)#
Sets whether to return the original text chunks from input or the matched (e.g., stemmed/lemmatized) phrases. Can be ‘original’ or ‘matched’. Defaults to ‘original’.
- Parameters:
b (str) – ‘original’ or ‘matched’
- setSafeKeywords(b)#
Sets the keywords to preserve during stopword removal when cleanStopWords is enabled. This will filter out the safe keywords from the stopwords list.
- Parameters:
b (list) – List of safe keywords
- setShuffleEntitySubTokens(b)#
Sets whether to generate and use variations (permutations) of the entity phrases, by default False.
- Parameters:
b (bool) – Whether to generate and use variations (permutations) of the entity phrases
- setSkipMatcherAugmentation(b)#
Sets whether to skip matcher augmentation, by default False.
- Parameters:
b (bool) – Whether to skip matcher augmentation
- setSkipSourceTextAugmentation(b)#
Sets whether to skip source text augmentation, by default False.
- Parameters:
b (bool) – Whether to skip source text augmentation
- setStopWords(b)#
Sets the stop words to be removed.
- Parameters:
b (list) – List of stop words to be removed
- write() JavaMLWriter#
Returns an MLWriter instance for this ML instance.
- class TextMatcherInternalModel(classname='com.johnsnowlabs.nlp.annotators.matcher.TextMatcherInternalModel', java_model=None)#
Bases:
sparknlp_jsl.common.AnnotatorModelInternal,sparknlp_jsl.annotator.matcher.text_matcher_params.TextMatcherParamsInstantiated model of the TextMatcherInternal.
This is the instantiated model of the
TextMatcherInternal. For training your own model, please see the documentation of that class.Input Annotation types
Output Annotation type
DOCUMENT, TOKENCHUNK- Parameters:
mergeOverlapping – Whether to merge overlapping matched chunks, by default False
entityValue – Value for the entity metadata field
buildFromTokens – Whether the TextMatcherInternal should take the CHUNK from TOKEN
enableLemmatizer – Whether to enable lemmatizer, by default False.
enableStemmer – Whether to enable stemmer, by default False.
stopWords – List of stop words to be removed, by default None.
cleanStopWords – Whether to clean stop words, by default False.
returnChunks – Whether to return original chunks or matched chunks. Defaults to original chunks.
safeKeywords – Keywords to preserve during stopword removal when cleanStopWords is enabled. Defaults to empty.
excludePunctuation – If true, punctuation will be removed from the text. Defaults to true.
cleanKeywords – Additional keywords to be removed alongside default stopwords. Defaults to empty.
excludeRegexPatterns – Regex patterns used to drop matched chunks. Defaults to empty.
- buildFromTokens#
- caseSensitive#
- cleanKeywords#
- cleanStopWords#
- delimiter#
- enableLemmatizer#
- enableStemmer#
- entityValue#
- excludePunctuation#
- excludeRegexPatterns#
- getter_attrs = []#
- inputAnnotatorTypes#
- inputCols#
- lazyAnnotator#
- mergeOverlapping#
- name = 'TextMatcherInternalModel'#
- optionalInputAnnotatorTypes = []#
- outputAnnotatorType = 'chunk'#
- outputCol#
- returnChunks#
- safeKeywords#
- searchTrie#
- searchTrieInternal#
- skipLPInputColsValidation = True#
- skipMatcherAugmentation#
- skipSourceTextAugmentation#
- stopWords#
- uid = ''#
- clear(param: pyspark.ml.param.Param) None#
Clears a param from the param map if it has been explicitly set.
- copy(extra: pyspark.ml._typing.ParamMap | None = None) JP#
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
- Parameters:
extra (dict, optional) – Extra parameters to copy to the new instance
- Returns:
Copy of this instance
- Return type:
JavaParams
- explainParam(param: str | Param) str#
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
- explainParams() str#
Returns the documentation of all params with their optionally default values and user-supplied values.
- extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap#
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
- Parameters:
extra (dict, optional) – extra param values
- Returns:
merged param map
- Return type:
dict
- getCaseSensitive()#
Gets whether the model is matching regardless of case
- getCleanKeywords()#
Gets the additional keywords to be removed alongside default stopwords.
- getDelimiter()#
Gets value for the delimiter between Phrase, Entity.
- getExcludeRegexPatterns()#
Gets the regex patterns used to drop matched chunks.
- getInputCols()#
Gets current column names of input annotations.
- getLazyAnnotator()#
Gets whether Annotator should be evaluated lazily in a RecursivePipeline.
- getOrDefault(param: str) Any#
- getOrDefault(param: Param[T]) T
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
- getOutputCol()#
Gets output column name of annotations.
- getParam(paramName: str) Param#
Gets a param by its name.
- getParamValue(paramName)#
Gets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- getReturnChunks()#
Gets whether to return the original text chunks from input or the matched (e.g., stemmed/lemmatized) phrases.
- getSafeKeywords()#
Gets the keywords to preserve during stopword removal when cleanStopWords is enabled.
- getStopWords()#
Gets the stop words to be removed.
- hasDefault(param: str | Param[Any]) bool#
Checks whether a param has a default value.
- hasParam(paramName: str) bool#
Tests whether this instance contains a param with a given (string) name.
- inputColsValidation(value)#
- isDefined(param: str | Param[Any]) bool#
Checks whether a param is explicitly set by user or has a default value.
- isSet(param: str | Param[Any]) bool#
Checks whether a param is explicitly set by user.
- classmethod load(path: str) RL#
Reads an ML instance from the input path, a shortcut of read().load(path).
- static pretrained(name, lang='en', remote_loc=None)#
Downloads and loads a pretrained model.
- Parameters:
name (str, optional) – Name of the pretrained model
lang (str, optional) – Language of the pretrained model, by default “en”
remote_loc (str, optional) – Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.
- Returns:
The restored model
- Return type:
- classmethod read()#
Returns an MLReader instance for this class.
- save(path: str) None#
Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
- set(param: Param, value: Any) None#
Sets a parameter in the embedded param map.
- setBuildFromTokens(b)#
Sets whether the TextMatcherInternal should take the CHUNK from TOKEN.
- Parameters:
b (bool) – Whether the TextMatcherInternal should take the CHUNK from TOKEN
- setCleanKeywords(b)#
Sets the additional keywords to be removed alongside default stopwords. Defaults to empty.
- Parameters:
b (list) – List of additional keywords to be removed
- setCleanStopWords(b)#
Sets whether to clean stop words, by default False.
- Parameters:
b (bool) – Whether to clean stop words
- setDelimiter(b)#
Sets Value for the delimiter between Phrase, Entity.
- Parameters:
b (bool) – Whether the TextMatcherInternal should take the CHUNK from TOKEN
- setEnableLemmatizer(b)#
Sets whether to enable lemmatizer, by default False.
- Parameters:
b (bool) – Whether to enable lemmatizer
- setEnableStemmer(b)#
Sets whether to enable stemmer, by default False.
- Parameters:
b (bool) – Whether to enable stemmer
- setEntityValue(b)#
Sets value for the entity metadata field.
- Parameters:
b (str) – Value for the entity metadata field
- setExcludePunctuation(b)#
Sets whether to exclude punctuation, by default True.
- Parameters:
b (bool) – Whether to exclude punctuation
- setExcludeRegexPatterns(b)#
Sets the regex patterns used to drop matched chunks. Defaults to empty.
- Parameters:
b (list) – List of regex patterns
- setForceInputTypeValidation(etfm)#
- setInputCols(*value)#
Sets column names of input annotations.
- Parameters:
*value (List[str]) – Input columns for the annotator
- setLazyAnnotator(value)#
Sets whether Annotator should be evaluated lazily in a RecursivePipeline.
- Parameters:
value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline
- setMergeOverlapping(b)#
Sets whether to merge overlapping matched chunks, by default False.
- Parameters:
b (bool) – Whether to merge overlapping matched chunks
- setOutputCol(value)#
Sets output column name of annotations.
- Parameters:
value (str) – Name of output column
- setParamValue(paramName)#
Sets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- setParams()#
- setReturnChunks(b)#
Sets whether to return the original text chunks from input or the matched (e.g., stemmed/lemmatized) phrases. Can be ‘original’ or ‘matched’. Defaults to ‘original’.
- Parameters:
b (str) – ‘original’ or ‘matched’
- setSafeKeywords(b)#
Sets the keywords to preserve during stopword removal when cleanStopWords is enabled. This will filter out the safe keywords from the stopwords list.
- Parameters:
b (list) – List of safe keywords
- setSkipMatcherAugmentation(b)#
Sets whether to skip matcher augmentation, by default False.
- Parameters:
b (bool) – Whether to skip matcher augmentation
- setSkipSourceTextAugmentation(b)#
Sets whether to skip source text augmentation, by default False.
- Parameters:
b (bool) – Whether to skip source text augmentation
- setStopWords(b)#
Sets the stop words to be removed.
- Parameters:
b (list) – List of stop words to be removed
- transform(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = None) pyspark.sql.dataframe.DataFrame#
Transforms the input dataset with optional parameters.
New in version 1.3.0.
- Parameters:
dataset (
pyspark.sql.DataFrame) – input datasetparams (dict, optional) – an optional param map that overrides embedded params.
- Returns:
transformed dataset
- Return type:
- write() JavaMLWriter#
Returns an MLWriter instance for this ML instance.