Packages

class InternalDocumentSplitter extends DocumentCharacterTextSplitter with DocumentSplitterParams with CheckLicense

Annotator which splits large documents into small documents.

InternalDocumentSplitter has setSplitMode method to decide how to split documents.

If splitMode is recursive, It takes the separators in order and splits subtexts if they are over the chunk length, considering optional overlap of the chunks. For example, given chunk size 20 and overlap 5:

He was, I take it, the most perfect reasoning and observing machine that the world has seen.

["He was, I take it,", "it, the most", "most perfect", "reasoning and", "and observing", "machine that the", "the world has seen."]

Additionally, you can set

Example

val textDF =
  spark.read
    .option("wholetext", "true")
    .text("src/test/resources/spell/sherlockholmes.txt")
    .toDF("text")

val documentAssembler = new DocumentAssembler().setInputCol("text")
val textSplitter = new InternalDocumentSplitter()
  .setInputCols("document")
  .setOutputCol("splits")
  .setSplitMode("recursive")
  .setChunkSize(20000)
  .setChunkOverlap(200)
  .setExplodeSplits(true)

val pipeline = new Pipeline().setStages(Array(documentAssembler, textSplitter))
val result = pipeline.fit(textDF).transform(textDF)

result
  .selectExpr(
    "splits.result",
    "splits[0].begin",
    "splits[0].end",
    "splits[0].end - splits[0].begin as length")
  .show(8, truncate = 80)
+--------------------------------------------------------------------------------+---------------+-------------+------+
|                                                                          result|splits[0].begin|splits[0].end|length|
+--------------------------------------------------------------------------------+---------------+-------------+------+
|[ Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyl...|              0|        19994| 19994|
|["And Mademoiselle's address?" he asked.\n\n"Is Briony Lodge, Serpentine Aven...|          19798|        39395| 19597|
|["How did that help you?"\n\n"It was all-important. When a woman thinks that ...|          39371|        59242| 19871|
|["'But,' said I, 'there would be millions of red-headed men who\nwould apply....|          59166|        77833| 18667|
|[My friend was an enthusiastic musician, being himself not only a\nvery capab...|          77835|        97769| 19934|
|["And yet I am not convinced of it," I answered. "The cases which\ncome to li...|          97771|       117248| 19477|
|["Well, she had a slate-coloured, broad-brimmed straw hat, with a\nfeather of...|         117250|       137242| 19992|
|["That sounds a little paradoxical."\n\n"But it is profoundly true. Singulari...|         137244|       157171| 19927|
+--------------------------------------------------------------------------------+---------------+-------------+------+
Linear Supertypes
CheckLicense, DocumentSplitterParams, DocumentCharacterTextSplitter, HasSimpleAnnotate[DocumentCharacterTextSplitter], AnnotatorModel[DocumentCharacterTextSplitter], CanBeLazy, RawAnnotator[DocumentCharacterTextSplitter], HasOutputAnnotationCol, HasInputAnnotationCols, HasOutputAnnotatorType, ParamsAndFeaturesWritable, HasFeatures, DefaultParamsWritable, MLWritable, Model[DocumentCharacterTextSplitter], Transformer, PipelineStage, Logging, Params, Serializable, Serializable, Identifiable, AnyRef, Any
Ordering
  1. Grouped
  2. Alphabetic
  3. By Inheritance
Inherited
  1. InternalDocumentSplitter
  2. CheckLicense
  3. DocumentSplitterParams
  4. DocumentCharacterTextSplitter
  5. HasSimpleAnnotate
  6. AnnotatorModel
  7. CanBeLazy
  8. RawAnnotator
  9. HasOutputAnnotationCol
  10. HasInputAnnotationCols
  11. HasOutputAnnotatorType
  12. ParamsAndFeaturesWritable
  13. HasFeatures
  14. DefaultParamsWritable
  15. MLWritable
  16. Model
  17. Transformer
  18. PipelineStage
  19. Logging
  20. Params
  21. Serializable
  22. Serializable
  23. Identifiable
  24. AnyRef
  25. Any
  1. Hide All
  2. Show All
Visibility
  1. Public
  2. All

Instance Constructors

  1. new InternalDocumentSplitter()
  2. new InternalDocumentSplitter(uid: String)

    uid

    required uid for storing annotator to disk

Type Members

  1. type AnnotationContent = Seq[Row]
    Attributes
    protected
    Definition Classes
    AnnotatorModel
  2. type AnnotatorType = String
    Definition Classes
    HasOutputAnnotatorType

Value Members

  1. final def !=(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  2. final def ##(): Int
    Definition Classes
    AnyRef → Any
  3. final def $[T](param: Param[T]): T
    Attributes
    protected
    Definition Classes
    Params
  4. def $$[T](feature: StructFeature[T]): T
    Attributes
    protected
    Definition Classes
    HasFeatures
  5. def $$[K, V](feature: MapFeature[K, V]): Map[K, V]
    Attributes
    protected
    Definition Classes
    HasFeatures
  6. def $$[T](feature: SetFeature[T]): Set[T]
    Attributes
    protected
    Definition Classes
    HasFeatures
  7. def $$[T](feature: ArrayFeature[T]): Array[T]
    Attributes
    protected
    Definition Classes
    HasFeatures
  8. final def ==(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  9. def _transform(dataset: Dataset[_], recursivePipeline: Option[PipelineModel]): DataFrame
    Attributes
    protected
    Definition Classes
    AnnotatorModel
  10. def afterAnnotate(dataset: DataFrame): DataFrame
    Definition Classes
    InternalDocumentSplitter → DocumentCharacterTextSplitter → AnnotatorModel
  11. def annotate(annotations: Seq[Annotation]): Seq[Annotation]
    Definition Classes
    InternalDocumentSplitter → DocumentCharacterTextSplitter → HasSimpleAnnotate
  12. final def asInstanceOf[T0]: T0
    Definition Classes
    Any
  13. def beforeAnnotate(dataset: Dataset[_]): Dataset[_]
    Attributes
    protected
    Definition Classes
    AnnotatorModel
  14. val caseSensitive: BooleanParam

    Whether to use case sensitive when matching regex (Default: false)

    Whether to use case sensitive when matching regex (Default: false)

    Definition Classes
    DocumentSplitterParams
  15. final def checkSchema(schema: StructType, inputAnnotatorType: String): Boolean
    Attributes
    protected
    Definition Classes
    HasInputAnnotationCols
  16. def checkValidEnvironment(spark: Option[SparkSession], scopes: Seq[String]): Unit
    Definition Classes
    CheckLicense
  17. def checkValidScope(scope: String): Unit
    Definition Classes
    CheckLicense
  18. def checkValidScopeAndEnvironment(scope: String, spark: Option[SparkSession], checkLp: Boolean): Unit
    Definition Classes
    CheckLicense
  19. def checkValidScopesAndEnvironment(scopes: Seq[String], spark: Option[SparkSession], checkLp: Boolean): Unit
    Definition Classes
    CheckLicense
  20. val chunkOverlap: IntParam
    Definition Classes
    DocumentCharacterTextSplitter
  21. val chunkSize: IntParam
    Definition Classes
    DocumentCharacterTextSplitter
  22. final def clear(param: Param[_]): InternalDocumentSplitter.this.type
    Definition Classes
    Params
  23. def clone(): AnyRef
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()
  24. def copy(extra: ParamMap): DocumentCharacterTextSplitter
    Definition Classes
    RawAnnotator → Model → Transformer → PipelineStage → Params
  25. def copyValues[T <: Params](to: T, extra: ParamMap): T
    Attributes
    protected
    Definition Classes
    Params
  26. val customBoundsStrategy: Param[String]

    Sets the custom bounds strategy for text parsing using regular expressions.

    Sets the custom bounds strategy for text parsing using regular expressions.

    Definition Classes
    DocumentSplitterParams
  27. final def defaultCopy[T <: Params](extra: ParamMap): T
    Attributes
    protected
    Definition Classes
    Params
  28. def dfAnnotate: UserDefinedFunction
    Definition Classes
    HasSimpleAnnotate
  29. val enableSentenceIncrement: BooleanParam

    Controls whether the sentence index should be incremented in the metadata of the annotator.

    Controls whether the sentence index should be incremented in the metadata of the annotator. When set to true, the annotator will increment the sentence index in the metadata for each split documents. Default: false

    Definition Classes
    DocumentSplitterParams
  30. final def eq(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  31. def equals(arg0: Any): Boolean
    Definition Classes
    AnyRef → Any
  32. def explainParam(param: Param[_]): String
    Definition Classes
    Params
  33. def explainParams(): String
    Definition Classes
    Params
  34. val explodeSplits: BooleanParam
    Definition Classes
    DocumentCharacterTextSplitter
  35. def extraValidate(structType: StructType): Boolean
    Attributes
    protected
    Definition Classes
    RawAnnotator
  36. def extraValidateMsg: String
    Attributes
    protected
    Definition Classes
    RawAnnotator
  37. final def extractParamMap(): ParamMap
    Definition Classes
    Params
  38. final def extractParamMap(extra: ParamMap): ParamMap
    Definition Classes
    Params
  39. val features: ArrayBuffer[Feature[_, _, _]]
    Definition Classes
    HasFeatures
  40. def finalize(): Unit
    Attributes
    protected[lang]
    Definition Classes
    AnyRef
    Annotations
    @throws( classOf[java.lang.Throwable] )
  41. def get[T](feature: StructFeature[T]): Option[T]
    Attributes
    protected
    Definition Classes
    HasFeatures
  42. def get[K, V](feature: MapFeature[K, V]): Option[Map[K, V]]
    Attributes
    protected
    Definition Classes
    HasFeatures
  43. def get[T](feature: SetFeature[T]): Option[Set[T]]
    Attributes
    protected
    Definition Classes
    HasFeatures
  44. def get[T](feature: ArrayFeature[T]): Option[Array[T]]
    Attributes
    protected
    Definition Classes
    HasFeatures
  45. final def get[T](param: Param[T]): Option[T]
    Definition Classes
    Params
  46. def getCaseSensitive: Boolean

    Gets whether to use case sensitive when matching values (Default: false)

    Gets whether to use case sensitive when matching values (Default: false)

    Definition Classes
    DocumentSplitterParams
  47. def getChunkOverlap: Int
    Definition Classes
    DocumentCharacterTextSplitter
  48. def getChunkSize: Int
    Definition Classes
    DocumentCharacterTextSplitter
  49. final def getClass(): Class[_]
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  50. def getCustomBoundsStrategy: String

    Gets customBoundsStrategy param

    Definition Classes
    DocumentSplitterParams
  51. final def getDefault[T](param: Param[T]): Option[T]
    Definition Classes
    Params
  52. def getEnableSentenceIncrement: Boolean

    Gets whether the sentence index should be incremented in the metadata of the annotator.

    Gets whether the sentence index should be incremented in the metadata of the annotator.

    Definition Classes
    DocumentSplitterParams
  53. def getExplodeSplits: Boolean
    Definition Classes
    DocumentCharacterTextSplitter
  54. def getInputCols: Array[String]
    Definition Classes
    HasInputAnnotationCols
  55. def getKeepSeparators: Boolean
    Definition Classes
    DocumentCharacterTextSplitter
  56. def getLazyAnnotator: Boolean
    Definition Classes
    CanBeLazy
  57. def getMaxLength: Int

    Gets maxLength param

    Gets maxLength param

    Definition Classes
    DocumentSplitterParams
  58. def getMetaDataFields: Array[String]

    Gets metaDataFields param

    Gets metaDataFields param

    Definition Classes
    DocumentSplitterParams
  59. final def getOrDefault[T](param: Param[T]): T
    Definition Classes
    Params
  60. final def getOutputCol: String
    Definition Classes
    HasOutputAnnotationCol
  61. def getParam(paramName: String): Param[Any]
    Definition Classes
    Params
  62. def getPatternsAreRegex: Boolean
    Definition Classes
    DocumentCharacterTextSplitter
  63. def getSentenceAwareness: Boolean

    Gets sentenceAwareness param

    Gets sentenceAwareness param

    Definition Classes
    DocumentSplitterParams
  64. def getSplitMode: String

    Gets splitMode param

    Gets splitMode param

    Definition Classes
    DocumentSplitterParams
  65. def getSplitPatterns: Array[String]

    Gets splitPatterns param

    Gets splitPatterns param

    Definition Classes
    InternalDocumentSplitter → DocumentCharacterTextSplitter
  66. def getTrimWhitespace: Boolean
    Definition Classes
    DocumentCharacterTextSplitter
  67. def handleMetaDataFields(df: DataFrame, outputCol: String, columnNames: Array[String]): DataFrame
  68. final def hasDefault[T](param: Param[T]): Boolean
    Definition Classes
    Params
  69. def hasParam(paramName: String): Boolean
    Definition Classes
    Params
  70. def hasParent: Boolean
    Definition Classes
    Model
  71. def hashCode(): Int
    Definition Classes
    AnyRef → Any
    Annotations
    @native()
  72. def initializeLogIfNecessary(isInterpreter: Boolean, silent: Boolean): Boolean
    Attributes
    protected
    Definition Classes
    Logging
  73. def initializeLogIfNecessary(isInterpreter: Boolean): Unit
    Attributes
    protected
    Definition Classes
    Logging
  74. val inputAnnotatorTypes: Array[String]
    Definition Classes
    InternalDocumentSplitter → DocumentCharacterTextSplitter → HasInputAnnotationCols
  75. final val inputCols: StringArrayParam
    Attributes
    protected
    Definition Classes
    HasInputAnnotationCols
  76. final def isDefined(param: Param[_]): Boolean
    Definition Classes
    Params
  77. final def isInstanceOf[T0]: Boolean
    Definition Classes
    Any
  78. final def isSet(param: Param[_]): Boolean
    Definition Classes
    Params
  79. def isTraceEnabled(): Boolean
    Attributes
    protected
    Definition Classes
    Logging
  80. val keepSeparators: BooleanParam
    Definition Classes
    DocumentCharacterTextSplitter
  81. val lazyAnnotator: BooleanParam
    Definition Classes
    CanBeLazy
  82. def log: Logger
    Attributes
    protected
    Definition Classes
    Logging
  83. def logDebug(msg: ⇒ String, throwable: Throwable): Unit
    Attributes
    protected
    Definition Classes
    Logging
  84. def logDebug(msg: ⇒ String): Unit
    Attributes
    protected
    Definition Classes
    Logging
  85. def logError(msg: ⇒ String, throwable: Throwable): Unit
    Attributes
    protected
    Definition Classes
    Logging
  86. def logError(msg: ⇒ String): Unit
    Attributes
    protected
    Definition Classes
    Logging
  87. def logInfo(msg: ⇒ String, throwable: Throwable): Unit
    Attributes
    protected
    Definition Classes
    Logging
  88. def logInfo(msg: ⇒ String): Unit
    Attributes
    protected
    Definition Classes
    Logging
  89. def logName: String
    Attributes
    protected
    Definition Classes
    Logging
  90. def logTrace(msg: ⇒ String, throwable: Throwable): Unit
    Attributes
    protected
    Definition Classes
    Logging
  91. def logTrace(msg: ⇒ String): Unit
    Attributes
    protected
    Definition Classes
    Logging
  92. def logWarning(msg: ⇒ String, throwable: Throwable): Unit
    Attributes
    protected
    Definition Classes
    Logging
  93. def logWarning(msg: ⇒ String): Unit
    Attributes
    protected
    Definition Classes
    Logging
  94. val maxLength: IntParam

    The maximum length for text parsing based on the specified mode.

    The maximum length for text parsing based on the specified mode.

    Definition Classes
    DocumentSplitterParams
  95. val metaDataFields: StringArrayParam

    Metadata fields to add specified data in columns to the metadata of the split documents.

    Metadata fields to add specified data in columns to the metadata of the split documents. You should set column names to read columns. Default: Array.empty

    Definition Classes
    DocumentSplitterParams
  96. def msgHelper(schema: StructType): String
    Attributes
    protected
    Definition Classes
    HasInputAnnotationCols
  97. final def ne(arg0: AnyRef): Boolean
    Definition Classes
    AnyRef
  98. final def notify(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  99. final def notifyAll(): Unit
    Definition Classes
    AnyRef
    Annotations
    @native()
  100. def onWrite(path: String, spark: SparkSession): Unit
    Attributes
    protected
    Definition Classes
    ParamsAndFeaturesWritable
  101. val optionalInputAnnotatorTypes: Array[String]
    Definition Classes
    InternalDocumentSplitter → HasInputAnnotationCols
  102. val outputAnnotatorType: AnnotatorType
    Definition Classes
    InternalDocumentSplitter → DocumentCharacterTextSplitter → HasOutputAnnotatorType
  103. final val outputCol: Param[String]
    Attributes
    protected
    Definition Classes
    HasOutputAnnotationCol
  104. lazy val params: Array[Param[_]]
    Definition Classes
    Params
  105. var parent: Estimator[DocumentCharacterTextSplitter]
    Definition Classes
    Model
  106. val patternsAreRegex: BooleanParam
    Definition Classes
    DocumentCharacterTextSplitter
  107. def save(path: String): Unit
    Definition Classes
    MLWritable
    Annotations
    @Since( "1.6.0" ) @throws( ... )
  108. val sentenceAwareness: BooleanParam

    Whether to split document by sentence awareness if possible.

    Whether to split document by sentence awareness if possible. If true, it can stop the split process before maxLength. If true, You should supply sentences from inputCols. Default: false.

    Definition Classes
    DocumentSplitterParams
  109. def set[T](feature: StructFeature[T], value: T): InternalDocumentSplitter.this.type
    Attributes
    protected
    Definition Classes
    HasFeatures
  110. def set[K, V](feature: MapFeature[K, V], value: Map[K, V]): InternalDocumentSplitter.this.type
    Attributes
    protected
    Definition Classes
    HasFeatures
  111. def set[T](feature: SetFeature[T], value: Set[T]): InternalDocumentSplitter.this.type
    Attributes
    protected
    Definition Classes
    HasFeatures
  112. def set[T](feature: ArrayFeature[T], value: Array[T]): InternalDocumentSplitter.this.type
    Attributes
    protected
    Definition Classes
    HasFeatures
  113. final def set(paramPair: ParamPair[_]): InternalDocumentSplitter.this.type
    Attributes
    protected
    Definition Classes
    Params
  114. final def set(param: String, value: Any): InternalDocumentSplitter.this.type
    Attributes
    protected
    Definition Classes
    Params
  115. final def set[T](param: Param[T], value: T): InternalDocumentSplitter.this.type
    Definition Classes
    Params
  116. def setCaseSensitive(value: Boolean): InternalDocumentSplitter.this.type

    Whether to use case sensitive when matching regex (Default: false)

    Whether to use case sensitive when matching regex (Default: false)

    Definition Classes
    DocumentSplitterParams
  117. def setChunkOverlap(value: Int): InternalDocumentSplitter.this.type
    Definition Classes
    DocumentCharacterTextSplitter
  118. def setChunkSize(value: Int): InternalDocumentSplitter.this.type
    Definition Classes
    DocumentCharacterTextSplitter
  119. def setCustomBoundsStrategy(value: String): InternalDocumentSplitter.this.type

    Sets the custom bounds strategy for text parsing using regular expressions.

    Sets the custom bounds strategy for text parsing using regular expressions.

    value

    The custom bounds strategy to be set. It should be one of the following values:

    • "none": No custom bounds are applied.
    • "prepend": Custom bounds are prepended to the split documents.
    • "append": Custom bounds are appended to the split documents.
    • Default: "prepend".
    Definition Classes
    DocumentSplitterParams
  120. def setDefault[T](feature: StructFeature[T], value: () ⇒ T): InternalDocumentSplitter.this.type
    Attributes
    protected
    Definition Classes
    HasFeatures
  121. def setDefault[K, V](feature: MapFeature[K, V], value: () ⇒ Map[K, V]): InternalDocumentSplitter.this.type
    Attributes
    protected
    Definition Classes
    HasFeatures
  122. def setDefault[T](feature: SetFeature[T], value: () ⇒ Set[T]): InternalDocumentSplitter.this.type
    Attributes
    protected
    Definition Classes
    HasFeatures
  123. def setDefault[T](feature: ArrayFeature[T], value: () ⇒ Array[T]): InternalDocumentSplitter.this.type
    Attributes
    protected
    Definition Classes
    HasFeatures
  124. final def setDefault(paramPairs: ParamPair[_]*): InternalDocumentSplitter.this.type
    Attributes
    protected
    Definition Classes
    Params
  125. final def setDefault[T](param: Param[T], value: T): InternalDocumentSplitter.this.type
    Attributes
    protected[org.apache.spark.ml]
    Definition Classes
    Params
  126. def setEnableSentenceIncrement(value: Boolean): InternalDocumentSplitter.this.type

    Controls whether the sentence index should be incremented in the metadata of the annotator.

    Controls whether the sentence index should be incremented in the metadata of the annotator. When set to true, the annotator will increment the sentence index in the metadata for each split documents. Default: false

    Definition Classes
    DocumentSplitterParams
  127. def setExplodeSplits(value: Boolean): InternalDocumentSplitter.this.type
    Definition Classes
    DocumentCharacterTextSplitter
  128. def setInputCols(value: Array[String]): InternalDocumentSplitter.this.type
    Definition Classes
    InternalDocumentSplitter → HasInputAnnotationCols
  129. final def setInputCols(value: String*): InternalDocumentSplitter.this.type
    Definition Classes
    HasInputAnnotationCols
  130. def setKeepSeparators(value: Boolean): InternalDocumentSplitter.this.type
    Definition Classes
    DocumentCharacterTextSplitter
  131. def setLazyAnnotator(value: Boolean): InternalDocumentSplitter.this.type
    Definition Classes
    CanBeLazy
  132. def setMaxLength(value: Int): InternalDocumentSplitter.this.type

    Sets the maximum length for text parsing based on the specified mode.

    Sets the maximum length for text parsing based on the specified mode.

    Definition Classes
    DocumentSplitterParams
  133. def setMetaDataFields(value: Array[String]): InternalDocumentSplitter.this.type

    Sets metadata fields to add specified data in columns to the metadata of the split documents.

    Sets metadata fields to add specified data in columns to the metadata of the split documents. You should set column names to read columns. Default: Array.empty

    Definition Classes
    DocumentSplitterParams
  134. final def setOutputCol(value: String): InternalDocumentSplitter.this.type
    Definition Classes
    HasOutputAnnotationCol
  135. def setParent(parent: Estimator[DocumentCharacterTextSplitter]): DocumentCharacterTextSplitter
    Definition Classes
    Model
  136. def setPatternsAreRegex(value: Boolean): InternalDocumentSplitter.this.type
    Definition Classes
    DocumentCharacterTextSplitter
  137. def setSentenceAwareness(value: Boolean): InternalDocumentSplitter.this.type

    Sets whether to split document by sentence awareness if possible.

    Sets whether to split document by sentence awareness if possible. If true, it can stop the split process before maxLength. If true, You should supply sentences from inputCols. Default: false.

    Definition Classes
    DocumentSplitterParams
  138. def setSplitMode(value: String): InternalDocumentSplitter.this.type

    Sets the split mode to determine how text should be segmented.

    Sets the split mode to determine how text should be segmented. Default: 'regex'

    value

    The split mode to be set. It should be one of the following values:

    • "char": Split text based on individual characters.
    • "token": Split text based on tokens. You should supply tokens from inputCols.
    • "sentence": Split text based on sentences. You should supply sentences from inputCols.
    • "recursive": Split text recursively using a specific algorithm.
    • "regex": Split text based on a regular expression pattern.
    Definition Classes
    DocumentSplitterParams
  139. def setSplitPatterns(value: Array[String]): InternalDocumentSplitter.this.type
    Definition Classes
    DocumentCharacterTextSplitter
  140. def setTrimWhitespace(value: Boolean): InternalDocumentSplitter.this.type
    Definition Classes
    DocumentCharacterTextSplitter
  141. val splitMode: Param[String]

    The split mode to determine how text should be segmented.

    The split mode to determine how text should be segmented. Default: 'regex'

    Definition Classes
    DocumentSplitterParams
  142. val splitPatterns: StringArrayParam
    Definition Classes
    DocumentCharacterTextSplitter
  143. final def synchronized[T0](arg0: ⇒ T0): T0
    Definition Classes
    AnyRef
  144. def toString(): String
    Definition Classes
    Identifiable → AnyRef → Any
  145. final def transform(dataset: Dataset[_]): DataFrame
    Definition Classes
    AnnotatorModel → Transformer
  146. def transform(dataset: Dataset[_], paramMap: ParamMap): DataFrame
    Definition Classes
    Transformer
    Annotations
    @Since( "2.0.0" )
  147. def transform(dataset: Dataset[_], firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): DataFrame
    Definition Classes
    Transformer
    Annotations
    @Since( "2.0.0" ) @varargs()
  148. final def transformSchema(schema: StructType): StructType
    Definition Classes
    RawAnnotator → PipelineStage
  149. def transformSchema(schema: StructType, logging: Boolean): StructType
    Attributes
    protected
    Definition Classes
    PipelineStage
    Annotations
    @DeveloperApi()
  150. def trimAndAdjust(entities: Seq[Entity]): Seq[Entity]
    Attributes
    protected
  151. val trimWhitespace: BooleanParam
    Definition Classes
    DocumentCharacterTextSplitter
  152. val uid: String
    Definition Classes
    InternalDocumentSplitter → DocumentCharacterTextSplitter → Identifiable
  153. def validate(schema: StructType): Boolean
    Attributes
    protected
    Definition Classes
    RawAnnotator
  154. final def wait(): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  155. final def wait(arg0: Long, arg1: Int): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... )
  156. final def wait(arg0: Long): Unit
    Definition Classes
    AnyRef
    Annotations
    @throws( ... ) @native()
  157. def wrapColumnMetadata(col: Column): Column
    Attributes
    protected
    Definition Classes
    RawAnnotator
  158. def write: MLWriter
    Definition Classes
    ParamsAndFeaturesWritable → DefaultParamsWritable → MLWritable

Inherited from CheckLicense

Inherited from DocumentSplitterParams

Inherited from DocumentCharacterTextSplitter

Inherited from HasSimpleAnnotate[DocumentCharacterTextSplitter]

Inherited from AnnotatorModel[DocumentCharacterTextSplitter]

Inherited from CanBeLazy

Inherited from RawAnnotator[DocumentCharacterTextSplitter]

Inherited from HasOutputAnnotationCol

Inherited from HasInputAnnotationCols

Inherited from HasOutputAnnotatorType

Inherited from ParamsAndFeaturesWritable

Inherited from HasFeatures

Inherited from DefaultParamsWritable

Inherited from MLWritable

Inherited from Model[DocumentCharacterTextSplitter]

Inherited from Transformer

Inherited from PipelineStage

Inherited from Logging

Inherited from Params

Inherited from Serializable

Inherited from Serializable

Inherited from Identifiable

Inherited from AnyRef

Inherited from Any

Parameters

A list of (hyper-)parameter keys this annotator can take. Users can set and get the parameter values through setters and getters, respectively.

Members

Parameter setters

Parameter getters