DocumentAssembler: Getting data in

In order to get through the NLP process, we need to get raw data annotated. There is a special transformer that does this for us: it creates the first annotation of type Document which may be used by annotators down the road. It can read either a String column or an Array[String]

Settable parameters are:

  • setInputCol()
  • setOutputCol()
  • setIdCol() -> OPTIONAL: Sring type column with id information
  • setMetadataCol() -> OPTIONAL: Map type column with metadata information
  • setCleanupMode(disabled) -> Cleaning up options, possible values:
    • disabled: Source kept as original.
    • inplace: removes new lines and tabs.
    • inplace_full: removes new lines and tabs but also those which were converted to strings (i.e. \n)
    • shrink: removes new lines and tabs, plus merging multiple spaces and blank lines to a single space.
    • shrink_full: removews new lines and tabs, including stringified values, plus shrinking spaces and blank lines.


Refer to the DocumentAssembler Scala docs for more details on the API.

import com.johnsnowlabs.nlp.*;
import com.johnsnowlabs.nlp.annotators.*;
DocumentAssembler documentAssembler = new DocumentAssembler()
from sparknlp.annotator import *
from sparknlp.common import *
from sparknlp.base import *
from import Pipeline
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document") \
import com.johnsnowlabs.nlp._
import com.johnsnowlabs.nlp.annotators._
val documentAssembler = new DocumentAssembler()

TokenAssembler: Getting data reshaped

This transformer reconstructs a Document type annotation from tokens, usually after these have been normalized, lemmatized, normalized, spell checked, etc, in order to use this document annotation in further annotators.

Settable parameters are:

  • setInputCol(inputs:Array(String))
  • setOutputCol(output:String)
  • setPreservePosition(preservePosition:bool): Whether to preserve the actual position of the tokens or reduce them to one space


Refer to the TokenAssembler Scala docs for more details on the API.

TokenAssembler token_assembler = new TokenAssembler()
    .setInputCols(["sentence", "token"])
token_assembler = TokenAssembler() \
    .setInputCols(["sentence", "token"]) \
val token_assembler = new TokenAssembler()
    .setInputCols(Array("sentence", "token"))


Converts DOCUMENT type annotations into CHUNK type with the contents of a chunkCol. Chunk text must be contained within input DOCUMENT. May be either StringType or ArrayType[StringType] (using isArray Param) Useful for annotators that require a CHUNK type input.

Settable parameters are:

  • setInputCol()
  • setOutputCol()
  • setIsArray(bool) -> Whether the target chunkCol is ArrayType<StringType>
  • setChunkCol(string) -> String or StringArray column with the chunks that belong to the inputCol target
  • setStartCol(string) -> Target INT column pointing to the token index (split by white space)
  • setStartColByTokenIndex(bool) -> Whether to use token index by whitespace or character index in startCol
  • setFailOnMissing(bool) -> Whether to fail when a chunk is not found within inputCol
  • setLowerCase(bool) -> whether to increase matching by lowercasing everything before matching


Refer to the Doc2Chunk Scala docs for more details on the API.

chunker = Doc2Chunk()\
val chunker = new Doc2Chunk()


Converts a CHUNK type column back into DOCUMENT. Useful when trying to re-tokenize or do further analysis on a CHUNK result.

Settable parameters are:

  • setInputCol()
  • setOutputCol()


Refer to the Chunk2Doc Scala docs for more details on the API.

chunk_doc = Chunk2Doc()\
val chunk_doc = new Chunk2Doc()


Once we have our NLP pipeline ready to go, we might want to use our annotation results somewhere else where it is easy to use. The Finisher outputs annotation(s) values into string.

Settable parameters are:

  • setInputCols()
  • setOutputCols()
  • setCleanAnnotations(True) -> Whether to remove intermediate annotations
  • setValueSplitSymbol(“#”) -> split values within an annotation character
  • setAnnotationSplitSymbol(“@”) -> split values between annotations character
  • setIncludeMetadata(False) -> Whether to include metadata keys. Sometimes useful in some annotations
  • setOutputAsArray(False) -> Whether to output as Array. Useful as input for other Spark transformers.


Refer to the Finisher Scala docs for more details on the API.

finisher = Finisher() \
    .setInputCols(["token"]) \
    .setIncludeMetadata(True) # set to False to remove metadata
val finisher = new Finisher()
    .setIncludeMetadata(true) // set to False to remove metadata


This transformer is designed to deal with embedding annotators: WordEmbeddings, BertEmbeddings, SentenceEmbeddingd, and ChunkEmbeddings. By using EmbeddingsFinisher you can easily transform your embeddings into array of floats or Vectors which are compatible with Spark ML functions such as LDA, K-mean, Random Forest classifier or any other functions that require featureCol.

Settable parameters are:

  • setInputCols()
  • setOutputCols()
  • setCleanAnnotations(True) -> Whether to remove and cleanup the rest of the annotators (columns)
  • setOutputAsVector(False) -> if enabled, it will output the embeddings as Vectors instead of arrays


Refer to the EmbeddingsFinisher Scala docs for more details on the API.

embeddings_finisher = EmbeddingsFinisher() \
            .setInputCols("sentence_embeddings") \
            .setOutputCols("sentence_embeddings_vectors") \
val embeddingsFinisher = new EmbeddingsFinisher()
      .setInputCols("sentence_embeddings", "embeddings")
      .setOutputCols("finished_sentence_embeddings", "finished_embeddings")
Last updated