sparknlp.annotator.ChunkTokenizer

class sparknlp.annotator.ChunkTokenizer[source]

Bases: sparknlp.annotator.Tokenizer

Tokenizes and flattens extracted NER chunks.

The ChunkTokenizer will split the extracted NER CHUNK type Annotations and will create TOKEN type Annotations. The result is then flattened, resulting in a single array.

Input Annotation types

Output Annotation type

CHUNK

TOKEN

Parameters
None

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> sparknlp.common import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> sentenceDetector = SentenceDetector() \
...     .setInputCols(["document"]) \
...     .setOutputCol("sentence")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["sentence"]) \
...     .setOutputCol("token")
>>> entityExtractor = TextMatcher() \
...     .setInputCols(["sentence", "token"]) \
...     .setEntities("src/test/resources/entity-extractor/test-chunks.txt", ReadAs.TEXT) \
...     .setOutputCol("entity")
>>> chunkTokenizer = ChunkTokenizer() \
...     .setInputCols(["entity"]) \
...     .setOutputCol("chunk_token")
>>> pipeline = Pipeline().setStages([
...         documentAssembler,
...         sentenceDetector,
...         tokenizer,
...         entityExtractor,
...         chunkTokenizer
... ])
>>> data = spark.createDataFrame([
...     ["Hello world, my name is Michael, I am an artist and I work at Benezar"],
...     ["Robert, an engineer from Farendell, graduated last year. The other one, Lucas, graduated last week."]
>>> ]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.selectExpr("entity.result as entity" , "chunk_token.result as chunk_token").show(truncate=False)
+-----------------------------------------------+---------------------------------------------------+
|entity                                         |chunk_token                                        |
+-----------------------------------------------+---------------------------------------------------+
|[world, Michael, work at Benezar]              |[world, Michael, work, at, Benezar]                |
|[engineer from Farendell, last year, last week]|[engineer, from, Farendell, last, year, last, week]|
+-----------------------------------------------+---------------------------------------------------+

Methods

__init__()

addContextChars(value)

Adds an additional character to the list used to separate from token boundaries.

addException(value)

Adds an additional word that won't be affected by tokenization rules.

addInfixPattern(value)

Adds an additional regex pattern that match tokens within a single target.

addSplitChars(value)

Adds an additional character to separate from the inside of tokens.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy([extra])

Creates a copy of this instance with the same uid and some extra params.

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap([extra])

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

fit(dataset[, params])

Fits a model to the input dataset with optional parameters.

fitMultiple(dataset, paramMaps)

Fits a model to the input dataset for each param map in paramMaps.

getCaseSensitiveExceptions()

Gets whether to care for case sensitiveness in exceptions.

getContextChars()

Gets character list used to separate from token boundaries.

getExceptions()

Gets words that won't be affected by tokenization rules.

getInfixPatterns()

Gets regex patterns that match tokens within a single target.

getInputCols()

Gets current column names of input annotations.

getLazyAnnotator()

Gets whether Annotator should be evaluated lazily in a RecursivePipeline.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value.

getOutputCol()

Gets output column name of annotations.

getParam(paramName)

Gets a param by its name.

getParamValue(paramName)

Gets the value of a parameter.

getPrefixPattern()

Gets regex with groups and begins with \A to match target prefix.

getSplitChars()

Gets character list used to separate from the inside of tokens.

getSuffixPattern()

Gets regex with groups and ends with \z to match target suffix.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of 'write().save(path)'.

set(param, value)

Sets a parameter in the embedded param map.

setCaseSensitiveExceptions(value)

Sets whether to care for case sensitiveness in exceptions, by default True.

setContextChars(value)

Sets character list used to separate from token boundaries, by default ['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '"', "'"].

setExceptions(value)

Sets words that won't be affected by tokenization rules.

setInfixPatterns(value)

Sets regex patterns that match tokens within a single target.

setInputCols(*value)

Sets column names of input annotations.

setLazyAnnotator(value)

Sets whether Annotator should be evaluated lazily in a RecursivePipeline.

setMaxLength(value)

Sets the maximum allowed legth for each token, by default 99999.

setMinLength(value)

Sets the minimum allowed legth for each token, by default 0.

setOutputCol(value)

Sets output column name of annotations.

setParamValue(paramName)

Sets the value of a parameter.

setPrefixPattern(value)

Sets regex with groups and begins with \A to match target prefix, by default \A([^\s\w\$\.]*).

setSplitChars(value)

Sets character list used to separate from the inside of tokens.

setSplitPattern(value)

Sets pattern to separate from the inside of tokens.

setSuffixPattern(value)

Sets regex with groups and ends with \z to match target suffix, by default ([^\s\w]?)([^\s\w]*)\z.

setTargetPattern(value)

Sets pattern to grab from text as token candidates, by default \S+.

write()

Returns an MLWriter instance for this ML instance.

Attributes

caseSensitiveExceptions

contextChars

exceptions

exceptionsPath

getter_attrs

infixPatterns

inputCols

lazyAnnotator

maxLength

minLength

name

outputCol

params

Returns all params ordered by name.

prefixPattern

splitChars

splitPattern

suffixPattern

targetPattern

addContextChars(value)

Adds an additional character to the list used to separate from token boundaries.

Parameters
valuestr

Additional context character

addException(value)

Adds an additional word that won’t be affected by tokenization rules.

Parameters
valuestr

Additional word that won’t be affected by tokenization rules

addInfixPattern(value)

Adds an additional regex pattern that match tokens within a single target. Groups identify different sub-tokens.

Parameters
valuestr

Regex pattern that match tokens within a single target

addSplitChars(value)

Adds an additional character to separate from the inside of tokens.

Parameters
valuestr

Additional character to separate from the inside of tokens

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

fit(dataset, params=None)

Fits a model to the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params. If a list/tuple of param maps is given, this calls fit on each param map and returns a list of models.

Returns

fitted model(s)

New in version 1.3.0.

fitMultiple(dataset, paramMaps)

Fits a model to the input dataset for each param map in paramMaps.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame.

  • paramMaps – A Sequence of param maps.

Returns

A thread safe iterable which contains one model for each param map. Each call to next(modelIterator) will return (index, model) where model was fit using paramMaps[index]. index values may not be sequential.

New in version 2.3.0.

getCaseSensitiveExceptions()

Gets whether to care for case sensitiveness in exceptions.

Returns
bool

Whether to care for case sensitiveness in exceptions

getContextChars()

Gets character list used to separate from token boundaries.

Returns
List[str]

Character list used to separate from token boundaries

getExceptions()

Gets words that won’t be affected by tokenization rules.

Returns
List[str]

Words that won’t be affected by tokenization rules

getInfixPatterns()

Gets regex patterns that match tokens within a single target. Groups identify different sub-tokens.

Returns
List[str]

The infix patterns

getInputCols()

Gets current column names of input annotations.

getLazyAnnotator()

Gets whether Annotator should be evaluated lazily in a RecursivePipeline.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()

Gets output column name of annotations.

getParam(paramName)

Gets a param by its name.

getParamValue(paramName)

Gets the value of a parameter.

Parameters
paramNamestr

Name of the parameter

getPrefixPattern()

Gets regex with groups and begins with \A to match target prefix.

Returns
str

The prefix pattern

getSplitChars()

Gets character list used to separate from the inside of tokens.

Returns
List[str]

Character list used to separate from the inside of tokens

getSuffixPattern()

Gets regex with groups and ends with \z to match target suffix.

Returns
str

The suffix pattern

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setCaseSensitiveExceptions(value)

Sets whether to care for case sensitiveness in exceptions, by default True.

Parameters
valuebool

Whether to care for case sensitiveness in exceptions

setContextChars(value)

Sets character list used to separate from token boundaries, by default [‘.’, ‘,’, ‘;’, ‘:’, ‘!’, ‘?’, ‘*’, ‘-‘, ‘(‘, ‘)’, ‘”’, “’”].

Parameters
valueList[str]

Character list used to separate from token boundaries

setExceptions(value)

Sets words that won’t be affected by tokenization rules.

Parameters
valueList[str]

Words that won’t be affected by tokenization rules

setInfixPatterns(value)

Sets regex patterns that match tokens within a single target. Groups identify different sub-tokens.

Parameters
valueList[str]

Regex patterns that match tokens within a single target

setInputCols(*value)

Sets column names of input annotations.

Parameters
*valuestr

Input columns for the annotator

setLazyAnnotator(value)

Sets whether Annotator should be evaluated lazily in a RecursivePipeline.

Parameters
valuebool

Whether Annotator should be evaluated lazily in a RecursivePipeline

setMaxLength(value)

Sets the maximum allowed legth for each token, by default 99999.

Parameters
valueint

Maximum allowed legth for each token

setMinLength(value)

Sets the minimum allowed legth for each token, by default 0.

Parameters
valueint

Minimum allowed legth for each token

setOutputCol(value)

Sets output column name of annotations.

Parameters
valuestr

Name of output column

setParamValue(paramName)

Sets the value of a parameter.

Parameters
paramNamestr

Name of the parameter

setPrefixPattern(value)

Sets regex with groups and begins with \A to match target prefix, by default \A([^\s\w\$\.]*).

Parameters
valuestr

Regex with groups and begins with \A to match target prefix

setSplitChars(value)

Sets character list used to separate from the inside of tokens.

Parameters
valueList[str]

Character list used to separate from the inside of tokens

setSplitPattern(value)

Sets pattern to separate from the inside of tokens. Takes priority over splitChars.

Parameters
valuestr

Pattern used to separate from the inside of tokens

setSuffixPattern(value)

Sets regex with groups and ends with \z to match target suffix, by default ([^\s\w]?)([^\s\w]*)\z.

Parameters
valuestr

Regex with groups and ends with \z to match target suffix

setTargetPattern(value)

Sets pattern to grab from text as token candidates, by default \S+.

Parameters
valuestr

Pattern to grab from text as token candidates

uid

A unique id for the object.

write()

Returns an MLWriter instance for this ML instance.