Packages

package splitter

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. All

Type Members

  1. trait DocumentSplitterParams extends Params

    A trait that contains all the params that InternalDocumentSplitter has.

    A trait that contains all the params that InternalDocumentSplitter has.

    See also

    InternalDocumentSplitter

  2. case class Entity(begin: Int, end: Int, part: String) extends Product with Serializable
  3. class InternalDocumentSplitter extends DocumentCharacterTextSplitter with DocumentSplitterParams with CheckLicense

    Annotator which splits large documents into small documents.

    Annotator which splits large documents into small documents.

    InternalDocumentSplitter has setSplitMode method to decide how to split documents.

    If splitMode is recursive, It takes the separators in order and splits subtexts if they are over the chunk length, considering optional overlap of the chunks. For example, given chunk size 20 and overlap 5:

    He was, I take it, the most perfect reasoning and observing machine that the world has seen.
    
    ["He was, I take it,", "it, the most", "most perfect", "reasoning and", "and observing", "machine that the", "the world has seen."]

    Additionally, you can set

    Example

    val textDF =
      spark.read
        .option("wholetext", "true")
        .text("src/test/resources/spell/sherlockholmes.txt")
        .toDF("text")
    
    val documentAssembler = new DocumentAssembler().setInputCol("text")
    val textSplitter = new InternalDocumentSplitter()
      .setInputCols("document")
      .setOutputCol("splits")
      .setSplitMode("recursive")
      .setChunkSize(20000)
      .setChunkOverlap(200)
      .setExplodeSplits(true)
    
    val pipeline = new Pipeline().setStages(Array(documentAssembler, textSplitter))
    val result = pipeline.fit(textDF).transform(textDF)
    
    result
      .selectExpr(
        "splits.result",
        "splits[0].begin",
        "splits[0].end",
        "splits[0].end - splits[0].begin as length")
      .show(8, truncate = 80)
    +--------------------------------------------------------------------------------+---------------+-------------+------+
    |                                                                          result|splits[0].begin|splits[0].end|length|
    +--------------------------------------------------------------------------------+---------------+-------------+------+
    |[ Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyl...|              0|        19994| 19994|
    |["And Mademoiselle's address?" he asked.\n\n"Is Briony Lodge, Serpentine Aven...|          19798|        39395| 19597|
    |["How did that help you?"\n\n"It was all-important. When a woman thinks that ...|          39371|        59242| 19871|
    |["'But,' said I, 'there would be millions of red-headed men who\nwould apply....|          59166|        77833| 18667|
    |[My friend was an enthusiastic musician, being himself not only a\nvery capab...|          77835|        97769| 19934|
    |["And yet I am not convinced of it," I answered. "The cases which\ncome to li...|          97771|       117248| 19477|
    |["Well, she had a slate-coloured, broad-brimmed straw hat, with a\nfeather of...|         117250|       137242| 19992|
    |["That sounds a little paradoxical."\n\n"But it is profoundly true. Singulari...|         137244|       157171| 19927|
    +--------------------------------------------------------------------------------+---------------+-------------+------+

Value Members

  1. object InternalDocumentSplitter extends DefaultParamsReadable[InternalDocumentSplitter] with Serializable

    This is the companion object of InternalDocumentSplitter.

    This is the companion object of InternalDocumentSplitter. Please refer to that class for the documentation.

Ungrouped