SparkNLP - Quick Start

Last updated:


SparkNLP runs on top of Apache Spark 2.1.2, it is advised to have basic knowledge of the framework and a working environment before using Spark-NLP. Refer to its documentation to get started with Spark.

To start using the library, execute any of the following lines depending on your desired use case:

spark-shell --packages JohnSnowLabs:spark-nlp:1.4.0
pyspark --packages JohnSnowLabs:spark-nlp:1.4.0
spark-submit --packages JohnSnowLabs:spark-nlp:1.4.0

Another way to use the library is by appending jar file into spark classpath, which can be downloaded here then, run spark-shell or spark-submit with appropriate --jars /path/to/spark-nlp_2.11-1.4.0.jar to use the library in spark.

For further alternatives and documentation check out our README page in GitHub.


Spark ML provides a set of Machine Learning applications, and it's logic consists of two main components: Estimators and Transformers. The first, have a method called fit() which secures and trains a piece of data to such application, and a Transformer, which is generally the result of a fitting process, applies changes to the the target dataset. These components have been embedded to be applicable to Spark NLP. Pipelines are a mechanism that allow multiple estimators and transformers within a single workflow, allowing multiple chained transformations along a Machine Learning task. Refer to SparkML library for more information.


An annotation is the basic form of the result of a Spark-NLP operation. It's structure is made of:

  • annotatorType: which annotator generated this annotation
  • begin: the begin of the matched content relative to raw-text
  • end: the end of the matched content relative to raw-text
  • metadata: content of matched result and additional information

This object is automatically generated by annotators after a transform process. No manual work is required. But it must be understood in order to use it efficiently.


Annotators are the spearhead of NLP functionalities in SparkNLP. There are two forms of annotators:

  • Annotator Approaches: Are those who represent a Spark ML Estimator and require a training stage. They have a function called fit(data) which trains a model based on some data. They produce the second type of annotator which is an annotator model or transformer.
  • Annotator Model: They are spark models or transformers, meaning they have a transform(data) function which take a dataset and add to it a column with the result of this annotation. All transformers are additive, meaning they append to current data, never replace or delete previous information.

Both forms of annotators can be included in a Pipeline and will automatically go through all stages in the provided order and transform the data accordingly. A Pipeline is turned into a PipelineModel after the fit() stage. Either before or after can be saved and re-loaded to disk at any time.

Common Functions

  • setInputCols(column_names): Takes a list of column names of annotations required by this annotator
  • setOutputCol(column_name): Defines the name of the column containing the result of this annotator. Use this name as an input for other annotators requiring the annotations of this one.

First Example

Annotator types

Every annotator has a type. Those annotators that share a type, can be used interchangeably, meaning you could you use any of them when needed. For example, when a token type annotator is required by another annotator, such as a sentiment analysis annotator, you can either provide a normalized token or a lemma, as both are of type token.

DocumentAssembler: Getting data in

In order to get through the NLP process, we need to get raw data annotated. There is a special transformer that does this for us: the DocumentAssembler, it creates the first annotation of type Document which may be used by annotators down the road

import com.johnsnowlabs.nlp._
import com.johnsnowlabs.nlp.annotators._

val documentAssembler = new DocumentAssembler()

Sentence detection and tokenization

In this quick example, we now proceed to identify the sentences in each of our document lines. SentenceDetector requires a Document annotation, which is provided by the DocumentAssembler output, and it's itself a Document type token. The Tokenizer requires a Document annotation type, meaning it works both with DocumentAssembler or SentenceDetector output, in here, we use the sentence output.

import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
val sentenceDetector = new SentenceDetector()

val regexTokenizer = new Tokenizer()

Using Pipelines

Now we want to put all this together and retrieve the results, we use a Pipeline for this. We also include another special transformer, called Finisher to show tokens in a human language

val finisher = new Finisher()

val pipeline = new Pipeline()