Getting Started

Spark NLP Cheat Sheet

This cheat sheet can be used as a quick reference on how to set up your environment:

# Install Spark NLP from PyPI
pip install spark-nlp==3.3.1

# Install Spark NLP from Anaconda/Conda
conda install -c johnsnowlabs spark-nlp==3.3.1

# Load Spark NLP with Spark Shell
spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.1

# Load Spark NLP with PySpark
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.1

# Load Spark NLP with Spark Submit
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.1

# Load Spark NLP as external JAR after compiling and building Spark NLP by `sbt assembly`
spark-shell --jar spark-nlp-assembly-3.3.1

Requirements

Spark NLP is built on top of Apache Spark 3.x. For using Spark NLP you need:

  • Java 8

  • Apache Spark 3.1.x (or 3.0.x, or 2.4.x, or 2.3.x)

  • Python 3.8.x if you are using PySpark 3.x

    • Python 3.6.x and 3.7.x if you are using PySpark 2.3.x or 2.4.x

It is recommended to have basic knowledge of the framework and a working environment before using Spark NLP. Please refer to Spark documentation to get started with Spark.

Installation

First, let’s make sure the installed java version is Java 8 (Oracle or OpenJDK):

java -version
# openjdk version "1.8.0_292"

Using Conda

Let’s create a new conda environment to manage all the dependencies there.

Then we can create a new environment sparknlp and install the spark-nlp package with pip:

conda create -n sparknlp python=3.8 -y
conda activate sparknlp
conda install -c johnsnowlabs spark-nlp==3.3.1 pyspark==3.0.3 jupyter

Now you should be ready to create a jupyter notebook with Spark NLP running:

jupyter notebook

Using Virtualenv

We can also create a Python Virtualenv:

virtualenv sparknlp --python=python3.8 # depends on how your Python installation is set up
source sparknlp/bin/activate
pip install spark-nlp==3.3.1 pyspark==3.0.3 jupyter

Now you should be ready to create a jupyter notebook with Spark NLP running:

jupyter notebook

Starting a Spark NLP Session from Python

A Spark session for Spark NLP can be created (or retrieved) by using sparknlp.start():

import sparknlp
spark = sparknlp.start()

If you need to manually start SparkSession because you have other configurations and sparknlp.start() is not including them, you can manually start the SparkSession with:

spark = SparkSession.builder \
    .appName("Spark NLP")\
    .master("local[4]")\
    .config("spark.driver.memory","16G")\
    .config("spark.driver.maxResultSize", "0") \
    .config("spark.kryoserializer.buffer.max", "2000M")\
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.1")\
    .getOrCreate()