com.johnsnowlabs.nlp.annotators.matcher
TextMatcherInternal
Companion object TextMatcherInternal
class TextMatcherInternal extends AnnotatorApproach[TextMatcherInternalModel] with TextMatcherInternalParams with ParamsAndFeaturesWritable
Annotator to match exact phrases (by token) provided in a file against a Document.
A text file of predefined phrases must be provided with setEntities
. The text file can als
be set directly as an ExternalResource.
For extended examples of usage, see the
Example
In this example, the entities file is of the form
... dolore magna aliqua lorem ipsum dolor. sit laborum ...
where each line represents an entity phrase to be extracted.
import spark.implicits._ import com.johnsnowlabs.nlp.DocumentAssembler import com.johnsnowlabs.nlp.annotator.Tokenizer import com.johnsnowlabs.nlp.annotator.TextMatcherInternal import com.johnsnowlabs.nlp.util.io.ReadAs import org.apache.spark.ml.Pipeline val documentAssembler = new DocumentAssembler() .setInputCol("text") .setOutputCol("document") val tokenizer = new Tokenizer() .setInputCols("document") .setOutputCol("token") val data = Seq("Hello dolore magna aliqua. Lorem ipsum dolor. sit in laborum").toDF("text") val entityExtractor = new TextMatcherInternal() .setInputCols("document", "token") .setEntities("src/test/resources/entity-extractor/test-phrases.txt", ReadAs.TEXT) .setOutputCol("entity") .setCaseSensitive(false) .setTokenizer(tokenizer.fit(data)) val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, entityExtractor)) val results = pipeline.fit(data).transform(data) results.selectExpr("explode(entity) as result").show(false) +------------------------------------------------------------------------------------------+ |result | +------------------------------------------------------------------------------------------+ |[chunk, 6, 24, dolore magna aliqua, [entity -> entity, sentence -> 0, chunk -> 0], []] | |[chunk, 27, 48, Lorem ipsum dolor. sit, [entity -> entity, sentence -> 0, chunk -> 1], []]| |[chunk, 53, 59, laborum, [entity -> entity, sentence -> 0, chunk -> 2], []] | +------------------------------------------------------------------------------------------+
- Grouped
- Alphabetic
- By Inheritance
- TextMatcherInternal
- ParamsAndFeaturesWritable
- TextMatcherInternalParams
- HasFeatures
- AnnotatorApproach
- CanBeLazy
- DefaultParamsWritable
- MLWritable
- HasOutputAnnotatorType
- HasOutputAnnotationCol
- HasInputAnnotationCols
- Estimator
- PipelineStage
- Logging
- Params
- Serializable
- Serializable
- Identifiable
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Instance Constructors
Type Members
-
type
AnnotatorType = String
- Definition Classes
- HasOutputAnnotatorType
Value Members
-
final
def
!=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
##(): Int
- Definition Classes
- AnyRef → Any
-
final
def
$[T](param: Param[T]): T
- Attributes
- protected
- Definition Classes
- Params
-
def
$$[T](feature: StructFeature[T]): T
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
$$[K, V](feature: MapFeature[K, V]): Map[K, V]
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
$$[T](feature: SetFeature[T]): Set[T]
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
$$[T](feature: ArrayFeature[T]): Array[T]
- Attributes
- protected
- Definition Classes
- HasFeatures
-
final
def
==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
def
_fit(dataset: Dataset[_], recursiveStages: Option[PipelineModel]): TextMatcherInternalModel
- Attributes
- protected
- Definition Classes
- AnnotatorApproach
-
final
def
asInstanceOf[T0]: T0
- Definition Classes
- Any
-
def
beforeTraining(spark: SparkSession): Unit
- Definition Classes
- AnnotatorApproach
-
val
buildFromTokens: BooleanParam
Whether the TextMatcherInternal should take the CHUNK from TOKEN or not (Default:
false
) -
def
cartesianTokenVariants(tokens: Seq[Annotation], lemmaDictionary: Map[String, String]): Seq[Seq[String]]
- Attributes
- protected
- Definition Classes
- TextMatcherInternalParams
-
val
caseSensitive: BooleanParam
Whether to match regardless of case (Default:
true
)Whether to match regardless of case (Default:
true
)- Definition Classes
- TextMatcherInternalParams
-
final
def
checkSchema(schema: StructType, inputAnnotatorType: String): Boolean
- Attributes
- protected
- Definition Classes
- HasInputAnnotationCols
-
val
cleanKeywords: StringArrayParam
A parameter defining additional keywords to be removed during text processing, in addition to the standard stopwords.
A parameter defining additional keywords to be removed during text processing, in addition to the standard stopwords.
These keywords are appended to the default stopwords list and will be excluded from the text when
cleanStopWords
is enabled.By default, this parameter is an empty array, meaning no additional keywords are filtered unless specified.
- Definition Classes
- TextMatcherInternalParams
-
val
cleanStopWords: BooleanParam
Parameter indicating whether to clean stop words during text processing.
Parameter indicating whether to clean stop words during text processing. Defaults to
true
.- Definition Classes
- TextMatcherInternalParams
-
final
def
clear(param: Param[_]): TextMatcherInternal.this.type
- Definition Classes
- Params
-
def
clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
final
def
copy(extra: ParamMap): Estimator[TextMatcherInternalModel]
- Definition Classes
- AnnotatorApproach → Estimator → PipelineStage → Params
-
def
copyValues[T <: Params](to: T, extra: ParamMap): T
- Attributes
- protected
- Definition Classes
- Params
-
final
def
defaultCopy[T <: Params](extra: ParamMap): T
- Attributes
- protected
- Definition Classes
- Params
-
val
delimiter: Param[String]
Value for the delimiter between Phrase, Entity in the entities file (Default:
,
) -
val
description: String
Extracts entities from target dataset given in a text file
Extracts entities from target dataset given in a text file
- Definition Classes
- TextMatcherInternal → AnnotatorApproach
-
val
enableLemmatizer: BooleanParam
A Boolean parameter that controls whether lemmatization should be applied during text processing.
A Boolean parameter that controls whether lemmatization should be applied during text processing.
Lemmatization is the process of reducing words to their base or dictionary form (lemma). When this parameter is set to
true
: - The incoming tokens (words from the input text) are lemmatized. - The predefined entities (the terms you want to match against) are also lemmatized.This allows for more flexible and accurate matching. For example, words like
"running"
,"ran"
, or"runs"
will all be reduced to"run"
, and can match consistently even if the exact form in the text differs.Default value is
false
, meaning lemmatization is disabled unless explicitly turned on.- Definition Classes
- TextMatcherInternalParams
-
val
enableStemmer: BooleanParam
A Boolean parameter that controls whether stemming should be applied during text processing.
A Boolean parameter that controls whether stemming should be applied during text processing.
Stemming reduces words to their root forms (e.g., "running", "runs", and "runner" → "run"). This can help match different word forms more effectively in tasks such as keyword matching and entity recognition.
When this parameter is set to
true
, stemming is applied in addition to the original form: - Input tokens are matched both in their original and stemmed forms. - Target entities can also be matched using their stemmed forms.This does not replace original matching — it complements it. Matching is performed using both the original and processed (stemmed) versions to improve recall and flexibility.
Default value is
false
.- Definition Classes
- TextMatcherInternalParams
-
val
entities: ExternalResourceParam
External resource for the entities, e.g.
External resource for the entities, e.g. a text file where each line is the string of an entity
-
val
entityValue: Param[String]
Value for the entity metadata field (Default:
"entity"
) -
final
def
eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
def
equals(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
val
excludePunctuation: BooleanParam
A parameter indicating whether punctuation marks should be removed during text processing.
A parameter indicating whether punctuation marks should be removed during text processing.
When set to
true
, most punctuation characters will be excluded from the processed text. This is typically used to clean text by removing non-word characters.Defaults to
true
, meaning punctuation is removed unless explicitly disabled. Some characters may be preserved if specifically handled by other parameters (e.g., safe keywords).- Definition Classes
- TextMatcherInternalParams
-
val
excludeRegexPatterns: StringArrayParam
A parameter specifying regular expression patterns used to exclude matching chunks during text processing.
A parameter specifying regular expression patterns used to exclude matching chunks during text processing.
Each string in this array is a regex pattern. If a detected chunk matches any of these patterns, it will be discarded and excluded from the final output.
This is useful for removing unwanted matches based on pattern rules (e.g., specific codes, formats, or noise). By default, this parameter is empty, meaning no chunks are dropped based on regex.
- Definition Classes
- TextMatcherInternalParams
-
def
explainParam(param: Param[_]): String
- Definition Classes
- Params
-
def
explainParams(): String
- Definition Classes
- Params
-
final
def
extractParamMap(): ParamMap
- Definition Classes
- Params
-
final
def
extractParamMap(extra: ParamMap): ParamMap
- Definition Classes
- Params
-
val
features: ArrayBuffer[Feature[_, _, _]]
- Definition Classes
- HasFeatures
-
def
finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( classOf[java.lang.Throwable] )
-
final
def
fit(dataset: Dataset[_]): TextMatcherInternalModel
- Definition Classes
- AnnotatorApproach → Estimator
-
def
fit(dataset: Dataset[_], paramMaps: Seq[ParamMap]): Seq[TextMatcherInternalModel]
- Definition Classes
- Estimator
- Annotations
- @Since( "2.0.0" )
-
def
fit(dataset: Dataset[_], paramMap: ParamMap): TextMatcherInternalModel
- Definition Classes
- Estimator
- Annotations
- @Since( "2.0.0" )
-
def
fit(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): TextMatcherInternalModel
- Definition Classes
- Estimator
- Annotations
- @Since( "2.0.0" ) @varargs()
-
def
get[T](feature: StructFeature[T]): Option[T]
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
get[K, V](feature: MapFeature[K, V]): Option[Map[K, V]]
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
get[T](feature: SetFeature[T]): Option[Set[T]]
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
get[T](feature: ArrayFeature[T]): Option[Array[T]]
- Attributes
- protected
- Definition Classes
- HasFeatures
-
final
def
get[T](param: Param[T]): Option[T]
- Definition Classes
- Params
-
def
getBuildFromTokens: Boolean
Getter for buildFromTokens param
-
def
getCaseSensitive: Boolean
Whether to match regardless of case (Default:
true
) -
final
def
getClass(): Class[_]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
def
getCleanKeywords: Array[String]
Retrieves the list of keywords to be filtered out.
Retrieves the list of keywords to be filtered out.
- returns
an array of strings representing the keywords.
- Definition Classes
- TextMatcherInternalParams
-
def
getCleanStopWords: Boolean
Retrieves the current state of the
cleanStopWords
parameter.Retrieves the current state of the
cleanStopWords
parameter.- returns
true if the
cleanStopWords
option is enabled, false otherwise.
- Definition Classes
- TextMatcherInternalParams
-
final
def
getDefault[T](param: Param[T]): Option[T]
- Definition Classes
- Params
-
def
getDelimiter: String
Get the Phrase, Entity delimiter in the entities file (Default:
,
) -
def
getDictionary: ExternalResource
External dictionary to be used by the lemmatizer
-
def
getEnableLemmatizer: Boolean
Gets the current state of the lemmatizer enablement setting.
Gets the current state of the lemmatizer enablement setting.
- returns
true if the lemmatizer is enabled, false otherwise.
- Definition Classes
- TextMatcherInternalParams
-
def
getEnableStemmer: Boolean
Retrieves the current value of the
enableStemmer
parameter.Retrieves the current value of the
enableStemmer
parameter.- returns
true if stemming is enabled, false otherwise
- Definition Classes
- TextMatcherInternalParams
-
def
getEntityValue: String
Getter for Value for the entity metadata field
-
def
getExcludeRegexPattern: Array[String]
Retrieves the list of regex patterns used to exclude specific text matches during processing.
Retrieves the list of regex patterns used to exclude specific text matches during processing.
- returns
an array of strings representing the regex patterns to be excluded.
- Definition Classes
- TextMatcherInternalParams
-
def
getInputCols: Array[String]
- Definition Classes
- HasInputAnnotationCols
-
def
getLazyAnnotator: Boolean
- Definition Classes
- CanBeLazy
-
def
getMergeOverlapping: Boolean
Whether to merge overlapping matched chunks (Default:
false
) -
final
def
getOrDefault[T](param: Param[T]): T
- Definition Classes
- Params
-
final
def
getOutputCol: String
- Definition Classes
- HasOutputAnnotationCol
-
def
getParam(paramName: String): Param[Any]
- Definition Classes
- Params
-
def
getReturnChunks: String
Retrieves the current value of the
returnChunks
parameter.Retrieves the current value of the
returnChunks
parameter.- returns
A string representing the configured value for the
returnChunks
setting.
- Definition Classes
- TextMatcherInternalParams
-
def
getSafeKeywords: Array[String]
Retrieves the list of keywords to be filtered out.
Retrieves the list of keywords to be filtered out.
- returns
an array of strings representing the keywords.
- Definition Classes
- TextMatcherInternalParams
-
def
getShuffleEntitySubTokens: Boolean
Getter for enableEntityVariations param
-
def
getSkipMatcherAugmentation: Boolean
Gets whether augmentation for matcher patterns is skipped.
Gets whether augmentation for matcher patterns is skipped.
- returns
true if augmentation for matcher patterns is skipped, false otherwise.
- Definition Classes
- TextMatcherInternalParams
-
def
getSkipSourceTextAugmentation: Boolean
Gets whether augmentation for source text is skipped.
Gets whether augmentation for source text is skipped.
- returns
true if augmentation for source text is skipped, false otherwise.
- Definition Classes
- TextMatcherInternalParams
-
def
getStopWords: Array[String]
Retrieves the list of stop words used within the text matching process.
Retrieves the list of stop words used within the text matching process.
- returns
an array of strings representing the stop words.
- Definition Classes
- TextMatcherInternalParams
-
def
getTokenVariants(token: Annotation, lemmaDictionary: Map[String, String]): Seq[String]
- Attributes
- protected
- Definition Classes
- TextMatcherInternalParams
-
def
getTokenizer: TokenizerModel
The Tokenizer to perform tokenization with
-
final
def
hasDefault[T](param: Param[T]): Boolean
- Definition Classes
- Params
-
def
hasParam(paramName: String): Boolean
- Definition Classes
- Params
-
def
hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
def
initializeLogIfNecessary(isInterpreter: Boolean, silent: Boolean): Boolean
- Attributes
- protected
- Definition Classes
- Logging
-
def
initializeLogIfNecessary(isInterpreter: Boolean): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
val
inputAnnotatorTypes: Array[String]
Output annotator type : DOCUMENT, TOKEN
Output annotator type : DOCUMENT, TOKEN
- Definition Classes
- TextMatcherInternal → HasInputAnnotationCols
-
final
val
inputCols: StringArrayParam
- Attributes
- protected
- Definition Classes
- HasInputAnnotationCols
-
final
def
isDefined(param: Param[_]): Boolean
- Definition Classes
- Params
-
final
def
isInstanceOf[T0]: Boolean
- Definition Classes
- Any
-
final
def
isSet(param: Param[_]): Boolean
- Definition Classes
- Params
-
def
isTraceEnabled(): Boolean
- Attributes
- protected
- Definition Classes
- Logging
-
val
lazyAnnotator: BooleanParam
- Definition Classes
- CanBeLazy
-
val
lemmaDict: MapFeature[String, String]
lemmaDict
lemmaDict
- Definition Classes
- TextMatcherInternalParams
-
val
lemmatizerDictionary: ExternalResourceParam
External dictionary to be used by the lemmatizer, which needs '
keyDelimiter
' and 'valueDelimiter
' for parsing the resourceExternal dictionary to be used by the lemmatizer, which needs '
keyDelimiter
' and 'valueDelimiter
' for parsing the resourceExample
... pick -> pick picks picking picked peck -> peck pecking pecked pecks pickle -> pickle pickles pickled pickling pepper -> pepper peppers peppered peppering ...
where each key is delimited by
->
and values are delimited by\t
-
def
log: Logger
- Attributes
- protected
- Definition Classes
- Logging
-
def
logDebug(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logDebug(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logError(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logError(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logInfo(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logInfo(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logName: String
- Attributes
- protected
- Definition Classes
- Logging
-
def
logTrace(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logTrace(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logWarning(msg: ⇒ String, throwable: Throwable): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
def
logWarning(msg: ⇒ String): Unit
- Attributes
- protected
- Definition Classes
- Logging
-
val
mergeOverlapping: BooleanParam
Whether to merge overlapping matched chunks (Default:
false
) -
def
msgHelper(schema: StructType): String
- Attributes
- protected
- Definition Classes
- HasInputAnnotationCols
-
final
def
ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
final
def
notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
final
def
notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
def
onTrained(model: TextMatcherInternalModel, spark: SparkSession): Unit
- Definition Classes
- AnnotatorApproach
-
def
onWrite(path: String, spark: SparkSession): Unit
- Attributes
- protected
- Definition Classes
- ParamsAndFeaturesWritable
-
val
optionalInputAnnotatorTypes: Array[String]
- Definition Classes
- HasInputAnnotationCols
-
val
outputAnnotatorType: AnnotatorType
Output annotator type : CHUNK
Output annotator type : CHUNK
- Definition Classes
- TextMatcherInternal → HasOutputAnnotatorType
-
final
val
outputCol: Param[String]
- Attributes
- protected
- Definition Classes
- HasOutputAnnotationCol
-
lazy val
params: Array[Param[_]]
- Definition Classes
- Params
-
val
returnChunks: Param[String]
A string parameter that defines which version of the matched chunks should be returned:
"original"
or"matched"
.A string parameter that defines which version of the matched chunks should be returned:
"original"
or"matched"
.- If set to
"original"
(default): the returned chunks reflect the exact text spans as they appeared in the original input. This ensures that thebegin
andend
character indices accurately map to the source text.- If set to
"matched"
: the returned chunks are based on the processed form that triggered the match, such as a stemmed or lemmatized version of the phrase. This can be useful to see which normalized entity was matched, but the character indices (begin
,end
) may not align correctly with the original input text.Use
"original"
if accurate text positioning is important (e.g., for highlighting), and"matched"
if you want to inspect the normalized form used for the match.- Definition Classes
- TextMatcherInternalParams
-
val
safeKeywords: StringArrayParam
A parameter representing an array of keywords that should be preserved during text cleaning, when stopword removal (
cleanStopWords
) is enabled.A parameter representing an array of keywords that should be preserved during text cleaning, when stopword removal (
cleanStopWords
) is enabled.When
cleanStopWords
is set to true, common stopwords are typically removed from the text. However, keywords specified insafeKeywords
will be exempt from removal and retained in the processed text.By default, this parameter is an empty array, meaning no exceptions are made unless explicitly provided.
- Definition Classes
- TextMatcherInternalParams
-
lazy val
safeLemmaDict: Map[String, String]
- Definition Classes
- TextMatcherInternalParams
-
def
save(path: String): Unit
- Definition Classes
- MLWritable
- Annotations
- @Since( "1.6.0" ) @throws( ... )
-
def
set[T](feature: StructFeature[T], value: T): TextMatcherInternal.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
set[K, V](feature: MapFeature[K, V], value: Map[K, V]): TextMatcherInternal.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
set[T](feature: SetFeature[T], value: Set[T]): TextMatcherInternal.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
set[T](feature: ArrayFeature[T], value: Array[T]): TextMatcherInternal.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
-
final
def
set(paramPair: ParamPair[_]): TextMatcherInternal.this.type
- Attributes
- protected
- Definition Classes
- Params
-
final
def
set(param: String, value: Any): TextMatcherInternal.this.type
- Attributes
- protected
- Definition Classes
- Params
-
final
def
set[T](param: Param[T], value: T): TextMatcherInternal.this.type
- Definition Classes
- Params
-
def
setBuildFromTokens(v: Boolean): TextMatcherInternal.this.type
Setter for buildFromTokens param
-
def
setCaseSensitive(v: Boolean): TextMatcherInternal.this.type
Whether to match regardless of case (Default:
true
) -
def
setCleanKeywords(value: ArrayList[String]): TextMatcherInternal.this.type
- Definition Classes
- TextMatcherInternalParams
-
def
setCleanKeywords(values: Array[String]): TextMatcherInternal.this.type
Sets the list of keywords to be cleaned during text processing.
Sets the list of keywords to be cleaned during text processing.
- returns
This instance with the updated configuration for cleaning keywords.
- Definition Classes
- TextMatcherInternalParams
-
def
setCleanStopWords(v: Boolean): TextMatcherInternal.this.type
Sets whether to clean stop words during text processing.
Sets whether to clean stop words during text processing.
- v
Boolean value indicating whether to enable (
true
) or disable (false
) the cleaning of stop words.- returns
This instance with the updated configuration for cleaning stop words.
- Definition Classes
- TextMatcherInternalParams
-
def
setDefault[T](feature: StructFeature[T], value: () ⇒ T): TextMatcherInternal.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
setDefault[K, V](feature: MapFeature[K, V], value: () ⇒ Map[K, V]): TextMatcherInternal.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
setDefault[T](feature: SetFeature[T], value: () ⇒ Set[T]): TextMatcherInternal.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
-
def
setDefault[T](feature: ArrayFeature[T], value: () ⇒ Array[T]): TextMatcherInternal.this.type
- Attributes
- protected
- Definition Classes
- HasFeatures
-
final
def
setDefault(paramPairs: ParamPair[_]*): TextMatcherInternal.this.type
- Attributes
- protected
- Definition Classes
- Params
-
final
def
setDefault[T](param: Param[T], value: T): TextMatcherInternal.this.type
- Attributes
- protected[org.apache.spark.ml]
- Definition Classes
- Params
-
def
setDelimiter(v: String): TextMatcherInternal.this.type
Set the Phrase, Entity delimiter in the entities file (Default:
,
) -
def
setEnableLemmatizer(value: Boolean): TextMatcherInternal.this.type
Enables or disables the lemmatizer for text matching.
Enables or disables the lemmatizer for text matching.
- value
If true, the lemmatizer will be enabled; if false, it will be disabled.
- returns
This TextMatcherInternal instance with the updated lemmatizer setting.
- Definition Classes
- TextMatcherInternalParams
-
def
setEnableStemmer(value: Boolean): TextMatcherInternal.this.type
Enables or disables the use of a stemmer for text processing.
Enables or disables the use of a stemmer for text processing.
- value
Boolean value indicating whether to enable (
true
) or disable (false
) the stemmer.- returns
Instance of this class with updated configuration.
- Definition Classes
- TextMatcherInternalParams
-
def
setEntities(path: String, readAs: Format, options: Map[String, String] = Map("format" -> "text")): TextMatcherInternal.this.type
Provides a file with phrases to match.
Provides a file with phrases to match. Default: Looks up path in configuration.
- path
a path to a file that contains the entities in the specified format.
- readAs
the format of the file, can be one of {ReadAs.TEXT, ReadAs.SPARK}. Defaults to ReadAs.TEXT.
- options
a map of additional parameters. Defaults to
Map("format" -> "text")
.- returns
this
-
def
setEntities(value: ExternalResource): TextMatcherInternal.this.type
Provides a file with phrases to match (Default: Looks up path in configuration)
-
def
setEntityValue(v: String): TextMatcherInternal.this.type
Setter for Value for the entity metadata field
-
def
setExcludePunctuation(v: Boolean): TextMatcherInternal.this.type
Sets the value of the
excludePunctuation
parameter used for text processing.Sets the value of the
excludePunctuation
parameter used for text processing.- v
A boolean value indicating whether to exclude punctuation.
- returns
This instance with the updated
excludePunctuation
configuration.
- Definition Classes
- TextMatcherInternalParams
-
def
setExcludeRegexPatterns(v: Array[String]): TextMatcherInternal.this.type
Sets the regular expression patterns for excluding specific elements during text processing.
Sets the regular expression patterns for excluding specific elements during text processing.
- v
Array of strings where each string represents a regular expression pattern to be used for excluding matching text elements.
- returns
This instance with the updated configuration for exclude regex patterns.
- Definition Classes
- TextMatcherInternalParams
-
final
def
setInputCols(value: String*): TextMatcherInternal.this.type
- Definition Classes
- HasInputAnnotationCols
-
def
setInputCols(value: Array[String]): TextMatcherInternal.this.type
- Definition Classes
- HasInputAnnotationCols
-
def
setLazyAnnotator(value: Boolean): TextMatcherInternal.this.type
- Definition Classes
- CanBeLazy
-
def
setLemmaDict(value: Map[String, String]): TextMatcherInternal.this.type
Sets the internal dictionary used for lemmatization.
Sets the internal dictionary used for lemmatization.
- value
a map where keys are words and values are their corresponding lemmas.
- returns
this
- Definition Classes
- TextMatcherInternalParams
-
def
setLemmatizerDictionary(path: String, keyDelimiter: String, valueDelimiter: String, readAs: Format = ReadAs.TEXT, options: Map[String, String] = Map("format" -> "text")): TextMatcherInternal.this.type
External dictionary to be used by the lemmatizer, which needs
keyDelimiter
andvalueDelimiter
for parsing the resource - def setLemmatizerDictionary(value: ExternalResource): TextMatcherInternal.this.type
-
def
setMergeOverlapping(v: Boolean): TextMatcherInternal.this.type
Whether to merge overlapping matched chunks (Default:
false
) -
final
def
setOutputCol(value: String): TextMatcherInternal.this.type
- Definition Classes
- HasOutputAnnotationCol
-
def
setReturnChunks(v: String): TextMatcherInternal.this.type
Sets the value of the
returnChunks
parameter used for text processing.Sets the value of the
returnChunks
parameter used for text processing.- v
A string value that specifies the configuration for returning chunks.
- returns
This instance with the updated
returnChunks
configuration.
- Definition Classes
- TextMatcherInternalParams
-
def
setSafeKeywords(value: ArrayList[String]): TextMatcherInternal.this.type
- Definition Classes
- TextMatcherInternalParams
-
def
setSafeKeywords(v: Array[String]): TextMatcherInternal.this.type
Sets the list of safe keywords to be used in text processing.
Sets the list of safe keywords to be used in text processing.
- v
Array of strings representing the safe keywords.
- returns
This instance with the updated configuration for safe keywords.
- Definition Classes
- TextMatcherInternalParams
-
def
setShuffleEntitySubTokens(value: Boolean): TextMatcherInternal.this.type
Setter for enableEntityVariations param
-
def
setSkipMatcherAugmentation(value: Boolean): TextMatcherInternal.this.type
Sets whether to skip augmentation for matcher patterns.
Sets whether to skip augmentation for matcher patterns.
- value
If true, matcher patterns won't be augmented with lemmatization, stemming, etc. If false, matcher patterns will be augmented if the corresponding features are enabled.
- returns
This instance with the updated configuration.
- Definition Classes
- TextMatcherInternalParams
-
def
setSkipSourceTextAugmentation(value: Boolean): TextMatcherInternal.this.type
Sets whether to skip augmentation for source text.
Sets whether to skip augmentation for source text.
- value
If true, source text won't be augmented with lemmatization, stemming, etc. If false, source text will be augmented if the corresponding features are enabled.
- returns
This instance with the updated configuration.
- Definition Classes
- TextMatcherInternalParams
-
def
setStopWords(value: ArrayList[String]): TextMatcherInternal.this.type
- Definition Classes
- TextMatcherInternalParams
-
def
setStopWords(v: Array[String]): TextMatcherInternal.this.type
Sets the list of stop words to be used in text processing.
Sets the list of stop words to be used in text processing.
- v
Array of strings representing the stop words.
- returns
This instance with the updated stop words setting.
- Definition Classes
- TextMatcherInternalParams
-
def
setTokenizer(tokenizer: TokenizerModel): TextMatcherInternal.this.type
The Tokenizer to perform tokenization with
-
val
shuffleEntitySubTokens: BooleanParam
Whether to generate and use variations (permutations) of the entity phrases.
Whether to generate and use variations (permutations) of the entity phrases. Defaults to false.
-
val
skipMatcherAugmentation: BooleanParam
A Boolean parameter that controls whether to skip augmentation (lemmatization, stemming, etc.) for matcher patterns.
A Boolean parameter that controls whether to skip augmentation (lemmatization, stemming, etc.) for matcher patterns.
When set to
true
, the matcher patterns won't be augmented with lemmatization, stemming, stopword removal, etc., even if those features are enabled. This applies only to entities/patterns being matched, not the source text.Default value is
false
, meaning matcher patterns will be augmented if the corresponding features are enabled.- Definition Classes
- TextMatcherInternalParams
-
val
skipSourceTextAugmentation: BooleanParam
A Boolean parameter that controls whether to skip augmentation (lemmatization, stemming, etc.) for the source text.
A Boolean parameter that controls whether to skip augmentation (lemmatization, stemming, etc.) for the source text.
When set to
true
, the source text won't be augmented with lemmatization, stemming, stopword removal, etc., even if those features are enabled. This applies only to the source text being analyzed, not the matcher patterns.Default value is
false
, meaning source text will be augmented if the corresponding features are enabled.- Definition Classes
- TextMatcherInternalParams
-
val
stopWords: StringArrayParam
A parameter representing the list of stop words to be filtered out during text processing.
A parameter representing the list of stop words to be filtered out during text processing.
By default, it is set to the English stop words provided by Spark ML.
- Definition Classes
- TextMatcherInternalParams
-
final
def
synchronized[T0](arg0: ⇒ T0): T0
- Definition Classes
- AnyRef
-
def
toString(): String
- Definition Classes
- Identifiable → AnyRef → Any
-
val
tokenizer: StructFeature[TokenizerModel]
The Tokenizer to perform tokenization with
-
def
train(dataset: Dataset[_], recursivePipeline: Option[PipelineModel]): TextMatcherInternalModel
- Definition Classes
- TextMatcherInternal → AnnotatorApproach
-
final
def
transformSchema(schema: StructType): StructType
- Definition Classes
- AnnotatorApproach → PipelineStage
-
def
transformSchema(schema: StructType, logging: Boolean): StructType
- Attributes
- protected
- Definition Classes
- PipelineStage
- Annotations
- @DeveloperApi()
-
val
uid: String
- Definition Classes
- TextMatcherInternal → Identifiable
-
def
validate(schema: StructType): Boolean
- Attributes
- protected
- Definition Classes
- AnnotatorApproach
-
final
def
wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
def
write: MLWriter
- Definition Classes
- ParamsAndFeaturesWritable → DefaultParamsWritable → MLWritable
Inherited from ParamsAndFeaturesWritable
Inherited from TextMatcherInternalParams
Inherited from HasFeatures
Inherited from AnnotatorApproach[TextMatcherInternalModel]
Inherited from CanBeLazy
Inherited from DefaultParamsWritable
Inherited from MLWritable
Inherited from HasOutputAnnotatorType
Inherited from HasOutputAnnotationCol
Inherited from HasInputAnnotationCols
Inherited from Estimator[TextMatcherInternalModel]
Inherited from PipelineStage
Inherited from Logging
Inherited from Params
Inherited from Serializable
Inherited from Serializable
Inherited from Identifiable
Inherited from AnyRef
Inherited from Any
Annotator types
Required input and expected output annotator types
Parameters
A list of (hyper-)parameter keys this annotator can take. Users can set and get the parameter values through setters and getters, respectively.