package splitter
Ordering
- Alphabetic
Visibility
- Public
- All
Type Members
-
trait
DocumentSplitterParams extends Params
A trait that contains all the params that InternalDocumentSplitter has.
A trait that contains all the params that InternalDocumentSplitter has.
- See also
- case class Entity(begin: Int, end: Int, part: String) extends Product with Serializable
-
class
InternalDocumentSplitter extends DocumentCharacterTextSplitter with DocumentSplitterParams with CheckLicense
Annotator which splits large documents into small documents.
Annotator which splits large documents into small documents.
InternalDocumentSplitter has setSplitMode method to decide how to split documents.
If splitMode is recursive, It takes the separators in order and splits subtexts if they are over the chunk length, considering optional overlap of the chunks. For example, given chunk size 20 and overlap 5:
He was, I take it, the most perfect reasoning and observing machine that the world has seen. ["He was, I take it,", "it, the most", "most perfect", "reasoning and", "and observing", "machine that the", "the world has seen."]
Additionally, you can set
- custom patterns with setSplitPatterns
- whether patterns should be interpreted as regex with setPatternsAreRegex
- whether to keep the separators with setKeepSeparators
- whether to trim whitespaces with setTrimWhitespace
- whether to explode the splits to individual rows with setExplodeSplits
Example
val textDF = spark.read .option("wholetext", "true") .text("src/test/resources/spell/sherlockholmes.txt") .toDF("text") val documentAssembler = new DocumentAssembler().setInputCol("text") val textSplitter = new InternalDocumentSplitter() .setInputCols("document") .setOutputCol("splits") .setSplitMode("recursive") .setChunkSize(20000) .setChunkOverlap(200) .setExplodeSplits(true) val pipeline = new Pipeline().setStages(Array(documentAssembler, textSplitter)) val result = pipeline.fit(textDF).transform(textDF) result .selectExpr( "splits.result", "splits[0].begin", "splits[0].end", "splits[0].end - splits[0].begin as length") .show(8, truncate = 80) +--------------------------------------------------------------------------------+---------------+-------------+------+ | result|splits[0].begin|splits[0].end|length| +--------------------------------------------------------------------------------+---------------+-------------+------+ |[ Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyl...| 0| 19994| 19994| |["And Mademoiselle's address?" he asked.\n\n"Is Briony Lodge, Serpentine Aven...| 19798| 39395| 19597| |["How did that help you?"\n\n"It was all-important. When a woman thinks that ...| 39371| 59242| 19871| |["'But,' said I, 'there would be millions of red-headed men who\nwould apply....| 59166| 77833| 18667| |[My friend was an enthusiastic musician, being himself not only a\nvery capab...| 77835| 97769| 19934| |["And yet I am not convinced of it," I answered. "The cases which\ncome to li...| 97771| 117248| 19477| |["Well, she had a slate-coloured, broad-brimmed straw hat, with a\nfeather of...| 117250| 137242| 19992| |["That sounds a little paradoxical."\n\n"But it is profoundly true. Singulari...| 137244| 157171| 19927| +--------------------------------------------------------------------------------+---------------+-------------+------+
- trait ReadablePretrainedInternalDocumentSplitter extends DefaultParamsReadable[InternalDocumentSplitter] with HasPretrained[InternalDocumentSplitter]
Value Members
-
object
InternalDocumentSplitter extends ReadablePretrainedInternalDocumentSplitter with Serializable
This is the companion object of InternalDocumentSplitter.
This is the companion object of InternalDocumentSplitter. Please refer to that class for the documentation.