sparknlp_jsl.annotator.splitter.internal_document_splitter#

Module Contents#

Classes#

InternalDocumentSplitter

Annotator which splits large documents into small documents.

class InternalDocumentSplitter(classname='com.johnsnowlabs.nlp.annotators.splitter.InternalDocumentSplitter', java_model=None)#

Bases: sparknlp_jsl.common.AnnotatorModelInternal, sparknlp.annotator.DocumentCharacterTextSplitter

Annotator which splits large documents into small documents.

InternalDocumentSplitter has setSplitMode method to decide how to split documents. If splitMode is ‘recursive’, It takes the separators in order and splits subtexts if they are over the chunk length, considering optional overlap of the chunks.

For example, given chunk size 20 and overlap 5:

"He was, I take it, the most perfect reasoning and observing machine that the world has seen."

["He was, I take it,", "it, the most", "most perfect", "reasoning and", "and observing", "machine that the", "the world has seen."]

Additionally, you can set

  • custom patterns with setSplitPatterns

  • whether patterns should be interpreted as regex with setPatternsAreRegex

  • whether to keep the separators with setKeepSeparators

  • whether to trim whitespaces with setTrimWhitespace

  • whether to explode the splits to individual rows with setExplodeSplits

Input Annotation types

Output Annotation type

Optional Input Annotation types

DOCUMENT

DOCUMENT

DOCUMENT(Sentence), TOKEN

chunkSize

Size of each chunk of text. This param is applicable only for “recursive” splitMode.

chunkOverlap

Length of the overlap between text chunks, by default 0. This param is applicable only for “recursive” splitMode.

splitPatterns

Patterns to split the document. Default for recursive mode [”

“, ” “, “ “, “”], for regex mode [“(?x) (?: [ trfv]*? n ){2} [ trfv]*?”]

patternsAreRegex

Whether to interpret the split patterns as regular expressions, by default True.

keepSeparators

Whether to keep the separators in the final result , by default True. This param is applicable only for “recursive” splitMode.

explodeSplits

Whether to explode split chunks to separate rows , by default False.

trimWhitespace

Whether to trim whitespaces of extracted chunks , by default True.

splitMode
The split mode to determine how text should be segmented. Default: ‘regex’. It should be one of the following values:
  • “char”: Split text based on individual characters.

  • “token”: Split text based on tokens. You should supply tokens from inputCols.

  • “sentence”: Split text based on sentences. You should supply sentences from inputCols.

  • “recursive”: Split text recursively using a specific algorithm.

  • “regex”: Split text based on a regular expression pattern.

sentenceAwareness

whether to split the document by sentence awareness if possible. If true, it can stop the split process before maxLength. If true, you should supply sentences from inputCols. Default: False. This param is not applicable only for “regex” and “recursive” splitMode.

maxLength

The maximum length allowed for spitting. The mode in which the maximum length is specified: - “char”: Maximum length is measured in characters. Default: 512 - “token”: Maximum length is measured in tokens. Default: 128 - “sentence”: Maximum length is measured in sentences. Default: 8

customBoundsStrategy

The custom bounds strategy for text splitting using regular expressions. This param is applicable only for “regex” splitMode.

caseSensitive

Whether to use case sensitive when matching regex, by default False. This param is applicable only for “regex” splitMode.

metaDataFields

Metadata fields to add specified data in columns to the metadata of the split documents. You should set column names to read columns.

enableSentenceIncrement

Whether the sentence index should be incremented in the metadata of the annotator. When set to true, the annotator will increment the sentence index in the metadata for each split documents. Default: False.

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> textDF = spark.read.text(
...    "/home/john/resources/spell/sherlockholmes.txt",
...    wholetext=True
... ).toDF("text")
>>> documentAssembler = DocumentAssembler().setInputCol("text")
>>> textSplitter = InternalDocumentSplitter() \
...     .setInputCols(["document"]) \
...     .setOutputCol("splits") \
...     .setSplitMode("recursive") \
...     .setChunkSize(20000) \
...     .setChunkOverlap(200) \
...     .setExplodeSplits(True)
>>> pipeline = Pipeline().setStages([documentAssembler, textSplitter])
>>> result = pipeline.fit(textDF).transform(textDF)
>>> result.selectExpr(
...       "splits.result",
...       "splits[0].begin",
...       "splits[0].end",
...       "splits[0].end - splits[0].begin as length") \
...     .show(8, truncate = 80)
    +--------------------------------------------------------------------------------+---------------+-------------+------+
    |                                                                          result|splits[0].begin|splits[0].end|length|
    +--------------------------------------------------------------------------------+---------------+-------------+------+
    |[ Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyl...|              0|        19994| 19994|
    |["And Mademoiselle's address?" he asked.
“Is Briony Lodge, Serpentine Aven…| 19798| 39395| 19597|

|[“How did that help you?”

“It was all-important. When a woman thinks that …| 39371| 59242| 19871|

|[“‘But,’ said I, ‘there would be millions of red-headed men who

would apply….| 59166| 77833| 18667|

|[My friend was an enthusiastic musician, being himself not only a

very capab…| 77835| 97769| 19934|

|[“And yet I am not convinced of it,” I answered. “The cases which

come to li…| 97771| 117248| 19477|

|[“Well, she had a slate-coloured, broad-brimmed straw hat, with a

feather of…| 117250| 137242| 19992|

|[“That sounds a little paradoxical.”

“But it is profoundly True. Singulari…| 137244| 157171| 19927|
caseSensitive#
chunkOverlap#
chunkSize#
customBoundsStrategy#
enableSentenceIncrement#
explodeSplits#
getter_attrs = []#
inputAnnotatorTypes#
inputCols#
keepSeparators#
lazyAnnotator#
maxLength#
metaDataFields#
name = 'InternalDocumentSplitter'#
optionalInputAnnotatorTypes#
outputAnnotatorType#
outputCol#
patternsAreRegex#
sentenceAwareness#
skipLPInputColsValidation = True#
splitMode#
splitPatterns#
trimWhitespace#
uid#
clear(param: pyspark.ml.param.Param) None#

Clears a param from the param map if it has been explicitly set.

copy(extra: pyspark.ml._typing.ParamMap | None = None) JP#

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters:

extra (dict, optional) – Extra parameters to copy to the new instance

Returns:

Copy of this instance

Return type:

JavaParams

explainParam(param: str | Param) str#

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams() str#

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap#

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters:

extra (dict, optional) – extra param values

Returns:

merged param map

Return type:

dict

getInputCols()#

Gets current column names of input annotations.

getLazyAnnotator()#

Gets whether Annotator should be evaluated lazily in a RecursivePipeline.

getOrDefault(param: str) Any#
getOrDefault(param: Param[T]) T

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getOutputCol()#

Gets output column name of annotations.

getParam(paramName: str) Param#

Gets a param by its name.

getParamValue(paramName)#

Gets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

hasDefault(param: str | Param[Any]) bool#

Checks whether a param has a default value.

hasParam(paramName: str) bool#

Tests whether this instance contains a param with a given (string) name.

inputColsValidation(value)#
isDefined(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user or has a default value.

isSet(param: str | Param[Any]) bool#

Checks whether a param is explicitly set by user.

classmethod load(path: str) RL#

Reads an ML instance from the input path, a shortcut of read().load(path).

static pretrained(name='document_splitter', lang='en', remote_loc='clinical/models')#

Download a pre-trained InternalDocumentSplitter.

Parameters:
  • name (str) – Name of the pre-trained model, by default “document_splitter”

  • lang (str) – Language of the pre-trained model, by default “en”

  • remote_loc (str) – Remote location of the pre-trained model. If None, use the open-source location. Other values are “clinical/models”, “finance/models”, or “legal/models”.

Returns:

A pre-trained InternalDocumentSplitter

Return type:

InternalDocumentSplitter

classmethod read()#

Returns an MLReader instance for this class.

save(path: str) None#

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param: Param, value: Any) None#

Sets a parameter in the embedded param map.

setCaseSensitive(value: bool)#

Sets whether to use case sensitive when matching regex.

Parameters:

value (bool) – Whether to use case sensitive when matching regex, by default False.

setChunkOverlap(value)#

Sets length of the overlap between text chunks , by default 0.

Parameters:

value (int) – Length of the overlap between text chunks

setChunkSize(value)#

Sets size of each chunk of text.

Parameters:

value (int) – Size of each chunk of text

setCustomBoundsStrategy(value: str)#

Sets the custom bounds strategy for text splitting using regular expressions.

The custom bounds strategy to be set. It should be one of the following values:
  • “none”: No custom bounds are applied.

  • “prepend”: Custom bounds are prepended to the split documents.

  • “append”: Custom bounds are appended to the split documents.

  • Default: “prepend”.

Parameters:

value (str) – The custom bounds strategy for text splitting using regular expressions, by default “prepend”.

setEnableSentenceIncrement(value: bool)#

Sets whether the sentence index should be incremented in the metadata of the annotator.

When set to true, the annotator will increment the sentence index in the metadata for each split documents. Default: False.

Parameters:

value (bool) – Whether the sentence index should be incremented in the metadata of the annotator. Default: False.

setExplodeSplits(value)#

Sets whether to explode split chunks to separate rows , by default False.

Parameters:

value (bool) – Whether to explode split chunks to separate rows

setForceInputTypeValidation(etfm)#
setInputCols(*value)#

Sets column names of input annotations.

Parameters:

*value (str) – Input columns for the annotator

setKeepSeparators(value)#

Sets whether to keep the separators in the final result , by default True.

Parameters:

value (bool) – Whether to keep the separators in the final result

setLazyAnnotator(value)#

Sets whether Annotator should be evaluated lazily in a RecursivePipeline.

Parameters:

value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline

setMaxLength(value: int)#

Sets the maximum length allowed for spitting.

The mode in which the maximum length is specified: - “char”: Maximum length is measured in characters. Default: 512 - “token”: Maximum length is measured in tokens. Default: 128 - “sentence”: Maximum length is measured in sentences. Default: 8

Parameters:

value (int) – The maximum length allowed for spitting.

setMetaDataFields(value)#

Sets metadata fields to add specified data in columns to the metadata of the split documents.

You should set column names to read columns.

Parameters:

value (List[str], list of column names) – Metadata fields to add specified data in columns to the metadata of the split documents.

setOutputCol(value)#

Sets output column name of annotations.

Parameters:

value (str) – Name of output column

setParamValue(paramName)#

Sets the value of a parameter.

Parameters:

paramName (str) – Name of the parameter

setParams()#
setPatternsAreRegex(value)#

Sets whether to interpret the split patterns as regular expressions , by default False.

Parameters:

value (bool) – Whether to interpret the split patterns as regular expressions

setSentenceAwareness(value: bool)#

Sets whether to split the document by sentence awareness if possible.

If true, it can stop the split process before maxLength. If true, you should supply sentences from inputCols. Default: False.

Parameters:

value (bool) – Whether to enable sentence awareness

setSplitMode(value: str)#

Sets the split mode to determine how text should be segmented.

Parameters:

value (str) – The split mode to be set. It should be one of the following values: - “char”: Split text based on individual characters. - “token”: Split text based on tokens. You should supply tokens from inputCols. - “sentence”: Split text based on sentences. You should supply sentences from inputCols. - “recursive”: Split text recursively using a specific algorithm. - “regex”: Split text based on a regular expression pattern. Default: ‘regex’.

setSplitPatterns(value)#

Sets patterns to separate the text by in decreasing priority , by default `[”

“, ” “, “ “, “”]`.

valueList[str]

Patterns to separate the text by in decreasing priority

setTrimWhitespace(value)#

Sets whether to trim whitespaces of extracted chunks , by default True.

Parameters:

value (bool) – Whether to trim whitespaces of extracted chunks

transform(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = None) pyspark.sql.dataframe.DataFrame#

Transforms the input dataset with optional parameters.

New in version 1.3.0.

Parameters:
  • dataset (pyspark.sql.DataFrame) – input dataset

  • params (dict, optional) – an optional param map that overrides embedded params.

Returns:

transformed dataset

Return type:

pyspark.sql.DataFrame

write() JavaMLWriter#

Returns an MLWriter instance for this ML instance.