sparknlp_jsl.annotator.splitter.internal_document_splitter
#
Module Contents#
Classes#
Annotator which splits large documents into small documents. |
- class InternalDocumentSplitter(classname='com.johnsnowlabs.nlp.annotators.splitter.InternalDocumentSplitter', java_model=None)#
Bases:
sparknlp_jsl.common.AnnotatorModelInternal
,sparknlp.annotator.DocumentCharacterTextSplitter
Annotator which splits large documents into small documents.
InternalDocumentSplitter has setSplitMode method to decide how to split documents. If splitMode is ‘recursive’, It takes the separators in order and splits subtexts if they are over the chunk length, considering optional overlap of the chunks.
For example, given chunk size 20 and overlap 5:
"He was, I take it, the most perfect reasoning and observing machine that the world has seen." ["He was, I take it,", "it, the most", "most perfect", "reasoning and", "and observing", "machine that the", "the world has seen."]
Additionally, you can set
custom patterns with setSplitPatterns
whether patterns should be interpreted as regex with setPatternsAreRegex
whether to keep the separators with setKeepSeparators
whether to trim whitespaces with setTrimWhitespace
whether to explode the splits to individual rows with setExplodeSplits
Input Annotation types
Output Annotation type
Optional Input Annotation types
DOCUMENT
DOCUMENT
DOCUMENT(Sentence)
,TOKEN
- chunkSize
Size of each chunk of text. This param is applicable only for “recursive” splitMode.
- chunkOverlap
Length of the overlap between text chunks, by default 0. This param is applicable only for “recursive” splitMode.
- splitPatterns
Patterns to split the document. Default for recursive mode [”
“, ” “, “ “, “”], for regex mode [“(?x) (?: [ trfv]*? n ){2} [ trfv]*?”]
- patternsAreRegex
Whether to interpret the split patterns as regular expressions, by default True.
- keepSeparators
Whether to keep the separators in the final result , by default True. This param is applicable only for “recursive” splitMode.
- explodeSplits
Whether to explode split chunks to separate rows , by default False.
- trimWhitespace
Whether to trim whitespaces of extracted chunks , by default True.
- splitMode
- The split mode to determine how text should be segmented. Default: ‘regex’. It should be one of the following values:
“char”: Split text based on individual characters.
“token”: Split text based on tokens. You should supply tokens from inputCols.
“sentence”: Split text based on sentences. You should supply sentences from inputCols.
“recursive”: Split text recursively using a specific algorithm.
“regex”: Split text based on a regular expression pattern.
- sentenceAwareness
whether to split the document by sentence awareness if possible. If true, it can stop the split process before maxLength. If true, you should supply sentences from inputCols. Default: False. This param is not applicable only for “regex” and “recursive” splitMode.
- maxLength
The maximum length allowed for spitting. The mode in which the maximum length is specified: - “char”: Maximum length is measured in characters. Default: 512 - “token”: Maximum length is measured in tokens. Default: 128 - “sentence”: Maximum length is measured in sentences. Default: 8
- customBoundsStrategy
The custom bounds strategy for text splitting using regular expressions. This param is applicable only for “regex” splitMode.
- caseSensitive
Whether to use case sensitive when matching regex, by default False. This param is applicable only for “regex” splitMode.
- metaDataFields
Metadata fields to add specified data in columns to the metadata of the split documents. You should set column names to read columns.
- enableSentenceIncrement
Whether the sentence index should be incremented in the metadata of the annotator. When set to true, the annotator will increment the sentence index in the metadata for each split documents. Default: False.
>>> import sparknlp >>> from sparknlp.base import * >>> from sparknlp.annotator import * >>> from pyspark.ml import Pipeline >>> textDF = spark.read.text( ... "/home/john/resources/spell/sherlockholmes.txt", ... wholetext=True ... ).toDF("text") >>> documentAssembler = DocumentAssembler().setInputCol("text") >>> textSplitter = InternalDocumentSplitter() \ ... .setInputCols(["document"]) \ ... .setOutputCol("splits") \ ... .setSplitMode("recursive") \ ... .setChunkSize(20000) \ ... .setChunkOverlap(200) \ ... .setExplodeSplits(True) >>> pipeline = Pipeline().setStages([documentAssembler, textSplitter]) >>> result = pipeline.fit(textDF).transform(textDF) >>> result.selectExpr( ... "splits.result", ... "splits[0].begin", ... "splits[0].end", ... "splits[0].end - splits[0].begin as length") \ ... .show(8, truncate = 80) +--------------------------------------------------------------------------------+---------------+-------------+------+ | result|splits[0].begin|splits[0].end|length| +--------------------------------------------------------------------------------+---------------+-------------+------+ |[ Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyl...| 0| 19994| 19994| |["And Mademoiselle's address?" he asked.
- “Is Briony Lodge, Serpentine Aven…| 19798| 39395| 19597|
|[“How did that help you?”
- “It was all-important. When a woman thinks that …| 39371| 59242| 19871|
|[“‘But,’ said I, ‘there would be millions of red-headed men who
- would apply….| 59166| 77833| 18667|
|[My friend was an enthusiastic musician, being himself not only a
- very capab…| 77835| 97769| 19934|
|[“And yet I am not convinced of it,” I answered. “The cases which
- come to li…| 97771| 117248| 19477|
|[“Well, she had a slate-coloured, broad-brimmed straw hat, with a
- feather of…| 117250| 137242| 19992|
|[“That sounds a little paradoxical.”
- “But it is profoundly True. Singulari…| 137244| 157171| 19927|
- caseSensitive#
- chunkOverlap#
- chunkSize#
- customBoundsStrategy#
- enableSentenceIncrement#
- explodeSplits#
- getter_attrs = []#
- inputAnnotatorTypes#
- inputCols#
- keepSeparators#
- lazyAnnotator#
- maxLength#
- metaDataFields#
- name = 'InternalDocumentSplitter'#
- optionalInputAnnotatorTypes#
- outputAnnotatorType#
- outputCol#
- patternsAreRegex#
- sentenceAwareness#
- skipLPInputColsValidation = True#
- splitMode#
- splitPatterns#
- trimWhitespace#
- uid#
- clear(param: pyspark.ml.param.Param) None #
Clears a param from the param map if it has been explicitly set.
- copy(extra: pyspark.ml._typing.ParamMap | None = None) JP #
Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.
- Parameters:
extra (dict, optional) – Extra parameters to copy to the new instance
- Returns:
Copy of this instance
- Return type:
JavaParams
- explainParam(param: str | Param) str #
Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
- explainParams() str #
Returns the documentation of all params with their optionally default values and user-supplied values.
- extractParamMap(extra: pyspark.ml._typing.ParamMap | None = None) pyspark.ml._typing.ParamMap #
Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
- Parameters:
extra (dict, optional) – extra param values
- Returns:
merged param map
- Return type:
dict
- getInputCols()#
Gets current column names of input annotations.
- getLazyAnnotator()#
Gets whether Annotator should be evaluated lazily in a RecursivePipeline.
- getOrDefault(param: str) Any #
- getOrDefault(param: Param[T]) T
Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.
- getOutputCol()#
Gets output column name of annotations.
- getParam(paramName: str) Param #
Gets a param by its name.
- getParamValue(paramName)#
Gets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- hasDefault(param: str | Param[Any]) bool #
Checks whether a param has a default value.
- hasParam(paramName: str) bool #
Tests whether this instance contains a param with a given (string) name.
- inputColsValidation(value)#
- isDefined(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user or has a default value.
- isSet(param: str | Param[Any]) bool #
Checks whether a param is explicitly set by user.
- classmethod load(path: str) RL #
Reads an ML instance from the input path, a shortcut of read().load(path).
- static pretrained(name='document_splitter', lang='en', remote_loc='clinical/models')#
Download a pre-trained InternalDocumentSplitter.
- Parameters:
name (str) – Name of the pre-trained model, by default “document_splitter”
lang (str) – Language of the pre-trained model, by default “en”
remote_loc (str) – Remote location of the pre-trained model. If None, use the open-source location. Other values are “clinical/models”, “finance/models”, or “legal/models”.
- Returns:
A pre-trained InternalDocumentSplitter
- Return type:
- classmethod read()#
Returns an MLReader instance for this class.
- save(path: str) None #
Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
- set(param: Param, value: Any) None #
Sets a parameter in the embedded param map.
- setCaseSensitive(value: bool)#
Sets whether to use case sensitive when matching regex.
- Parameters:
value (bool) – Whether to use case sensitive when matching regex, by default False.
- setChunkOverlap(value)#
Sets length of the overlap between text chunks , by default 0.
- Parameters:
value (int) – Length of the overlap between text chunks
- setChunkSize(value)#
Sets size of each chunk of text.
- Parameters:
value (int) – Size of each chunk of text
- setCustomBoundsStrategy(value: str)#
Sets the custom bounds strategy for text splitting using regular expressions.
- The custom bounds strategy to be set. It should be one of the following values:
“none”: No custom bounds are applied.
“prepend”: Custom bounds are prepended to the split documents.
“append”: Custom bounds are appended to the split documents.
Default: “prepend”.
- Parameters:
value (str) – The custom bounds strategy for text splitting using regular expressions, by default “prepend”.
- setEnableSentenceIncrement(value: bool)#
Sets whether the sentence index should be incremented in the metadata of the annotator.
When set to true, the annotator will increment the sentence index in the metadata for each split documents. Default: False.
- Parameters:
value (bool) – Whether the sentence index should be incremented in the metadata of the annotator. Default: False.
- setExplodeSplits(value)#
Sets whether to explode split chunks to separate rows , by default False.
- Parameters:
value (bool) – Whether to explode split chunks to separate rows
- setForceInputTypeValidation(etfm)#
- setInputCols(*value)#
Sets column names of input annotations.
- Parameters:
*value (str) – Input columns for the annotator
- setKeepSeparators(value)#
Sets whether to keep the separators in the final result , by default True.
- Parameters:
value (bool) – Whether to keep the separators in the final result
- setLazyAnnotator(value)#
Sets whether Annotator should be evaluated lazily in a RecursivePipeline.
- Parameters:
value (bool) – Whether Annotator should be evaluated lazily in a RecursivePipeline
- setMaxLength(value: int)#
Sets the maximum length allowed for spitting.
The mode in which the maximum length is specified: - “char”: Maximum length is measured in characters. Default: 512 - “token”: Maximum length is measured in tokens. Default: 128 - “sentence”: Maximum length is measured in sentences. Default: 8
- Parameters:
value (int) – The maximum length allowed for spitting.
- setMetaDataFields(value)#
Sets metadata fields to add specified data in columns to the metadata of the split documents.
You should set column names to read columns.
- Parameters:
value (List[str], list of column names) – Metadata fields to add specified data in columns to the metadata of the split documents.
- setOutputCol(value)#
Sets output column name of annotations.
- Parameters:
value (str) – Name of output column
- setParamValue(paramName)#
Sets the value of a parameter.
- Parameters:
paramName (str) – Name of the parameter
- setParams()#
- setPatternsAreRegex(value)#
Sets whether to interpret the split patterns as regular expressions , by default False.
- Parameters:
value (bool) – Whether to interpret the split patterns as regular expressions
- setSentenceAwareness(value: bool)#
Sets whether to split the document by sentence awareness if possible.
If true, it can stop the split process before maxLength. If true, you should supply sentences from inputCols. Default: False.
- Parameters:
value (bool) – Whether to enable sentence awareness
- setSplitMode(value: str)#
Sets the split mode to determine how text should be segmented.
- Parameters:
value (str) – The split mode to be set. It should be one of the following values: - “char”: Split text based on individual characters. - “token”: Split text based on tokens. You should supply tokens from inputCols. - “sentence”: Split text based on sentences. You should supply sentences from inputCols. - “recursive”: Split text recursively using a specific algorithm. - “regex”: Split text based on a regular expression pattern. Default: ‘regex’.
- setSplitPatterns(value)#
Sets patterns to separate the text by in decreasing priority , by default `[”
“, ” “, “ “, “”]`.
- valueList[str]
Patterns to separate the text by in decreasing priority
- setTrimWhitespace(value)#
Sets whether to trim whitespaces of extracted chunks , by default True.
- Parameters:
value (bool) – Whether to trim whitespaces of extracted chunks
- transform(dataset: pyspark.sql.dataframe.DataFrame, params: pyspark.ml._typing.ParamMap | None = None) pyspark.sql.dataframe.DataFrame #
Transforms the input dataset with optional parameters.
New in version 1.3.0.
- Parameters:
dataset (
pyspark.sql.DataFrame
) – input datasetparams (dict, optional) – an optional param map that overrides embedded params.
- Returns:
transformed dataset
- Return type:
- write() JavaMLWriter #
Returns an MLWriter instance for this ML instance.