sparknlp.base.DocumentAssembler

class sparknlp.base.DocumentAssembler[source]

Bases: sparknlp.internal.AnnotatorTransformer

Prepares data into a format that is processable by Spark NLP.

This is the entry point for every Spark NLP pipeline. The DocumentAssembler can read either a String column or an Array[String]. Additionally, setCleanupMode() can be used to pre-process the text (Default: disabled). For possible options please refer the parameters section.

For more extended examples on document pre-processing see the Spark NLP Workshop.

Input Annotation types

Output Annotation type

NONE

DOCUMENT

Parameters
inputCol

Input column name

outputCol

Output column name

idCol

Name of String type column for row id.

metadataCol

Name of Map type column with metadata information

calculationsCol

Name of float vector map column to use for embeddings and other representations.

cleanupMode

How to cleanup the document , by default disabled. Possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from pyspark.ml import Pipeline
>>> data = spark.createDataFrame([["Spark NLP is an open-source text processing library."]]).toDF("text")
>>> documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
>>> result = documentAssembler.transform(data)
>>> result.select("document").show(truncate=False)
+----------------------------------------------------------------------------------------------+
|document                                                                                      |
+----------------------------------------------------------------------------------------------+
|[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]|
+----------------------------------------------------------------------------------------------+
>>> result.select("document").printSchema()
root
|-- document: array (nullable = True)
|    |-- element: struct (containsNull = True)
|    |    |-- annotatorType: string (nullable = True)
|    |    |-- begin: integer (nullable = False)
|    |    |-- end: integer (nullable = False)
|    |    |-- result: string (nullable = True)
|    |    |-- metadata: map (nullable = True)
|    |    |    |-- key: string
|    |    |    |-- value: string (valueContainsNull = True)
|    |    |-- embeddings: array (nullable = True)
|    |    |    |-- element: float (containsNull = False)

Methods

__init__()

clear(param)

Clears a param from the param map if it has been explicitly set.

copy([extra])

Creates a copy of this instance with the same uid and some extra params.

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap([extra])

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value.

getParam(paramName)

Gets a param by its name.

getParamValue(paramName)

Gets the value of a parameter.

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of 'write().save(path)'.

set(param, value)

Sets a parameter in the embedded param map.

setCalculationsCol(value)

Sets name of float vector map column to use for embeddings and other representations.

setCleanupMode(value)

Sets how to cleanup the document, by default disabled.

setIdCol(value)

Sets name of string type column for row id.

setInputCol(value)

Sets input column name.

setMetadataCol(value)

Sets name for Map type column with metadata information.

setOutputCol(value)

Sets output column name.

setParamValue(paramName)

Sets the value of a parameter.

setParams()

transform(dataset[, params])

Transforms the input dataset with optional parameters.

write()

Returns an MLWriter instance for this ML instance.

Attributes

calculationsCol

cleanupMode

getter_attrs

idCol

inputCol

metadataCol

name

outputCol

params

Returns all params ordered by name.

clear(param)

Clears a param from the param map if it has been explicitly set.

copy(extra=None)

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extra – Extra parameters to copy to the new instance

Returns

Copy of this instance

explainParam(param)

Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams()

Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra=None)

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extra – extra param values

Returns

merged param map

getOrDefault(param)

Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getParam(paramName)

Gets a param by its name.

getParamValue(paramName)

Gets the value of a parameter.

Parameters
paramNamestr

Name of the parameter

hasDefault(param)

Checks whether a param has a default value.

hasParam(paramName)

Tests whether this instance contains a param with a given (string) name.

isDefined(param)

Checks whether a param is explicitly set by user or has a default value.

isSet(param)

Checks whether a param is explicitly set by user.

classmethod load(path)

Reads an ML instance from the input path, a shortcut of read().load(path).

property params

Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

classmethod read()

Returns an MLReader instance for this class.

save(path)

Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param, value)

Sets a parameter in the embedded param map.

setCalculationsCol(value)[source]

Sets name of float vector map column to use for embeddings and other representations.

Parameters
valuestr

Name of the calculations column

setCleanupMode(value)[source]

Sets how to cleanup the document, by default disabled. Possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full

Parameters
valuestr

Cleanup mode

setIdCol(value)[source]

Sets name of string type column for row id.

Parameters
valuestr

Name of the Id Column

setInputCol(value)[source]

Sets input column name.

Parameters
valuestr

Name of the input column

setMetadataCol(value)[source]

Sets name for Map type column with metadata information.

Parameters
valuestr

Name of the metadata column

setOutputCol(value)[source]

Sets output column name.

Parameters
valuestr

Name of the Output Column

setParamValue(paramName)

Sets the value of a parameter.

Parameters
paramNamestr

Name of the parameter

transform(dataset, params=None)

Transforms the input dataset with optional parameters.

Parameters
  • dataset – input dataset, which is an instance of pyspark.sql.DataFrame

  • params – an optional param map that overrides embedded params.

Returns

transformed dataset

New in version 1.3.0.

uid

A unique id for the object.

write()

Returns an MLWriter instance for this ML instance.