SparkNLP - Articles & Additional Resources

Last updated:

Additonal Resources

Browse through our collection of videos, blogs to deepen your knowledge and experience with spark-nlp


Comparing production-grade NLP libraries: Training Spark-NLP and spaCy pipelines

By Saif Addin Ellafi|February 28, 2018

Comparing production-grade NLP libraries: Running Spark-NLP and spaCy pipelines

By Saif Addin Ellafi|February 28, 2018

Comparing production-grade NLP libraries: Accuracy, performance, and scalability

By Saif Addin Ellafi|February 28, 2018

Introducing the Natural Language Processing Library for Apache Spark

By David Talby|October 19, 2017


What are the requirements of SparkNLP?

The library works mainly on top of Spark, over time we have included the use of Amazon AWS free API, Google TensorFlow and Facebook RocksDB. Make sure you have a working Spark environment and supply SparkNLP as a jar to the SparkSession classpath. The rest should be handled automatically.

Does SparkNLP rely on any other NLP library?

No, SparkNLP is self contained and all algorithms are implemented within the code base.

What do I need to learn in order to use the library?

Either Scala or Python, and then, mostly Spark and SparkML. SparkNLP uses the same logic and syntax than any other machine learning transformer in Spark, and can be included within the same pipelines. So only some review on the examples and you can get going.

Can I save trained models or pipelines?

Yes, the same way you would do it for any other Spark ML component.

Is PySpark NLP equally good than Scala counterpart?

The answer to this question compares to Spark in Scala vs Python. We rely on the exact same implementation technique. Spark NLP doesn't use python UDF so serialization happens on the JVM and not in inter-process communication. Feature-wise, it is about the same, with a few exceptions in some utility classes (such as the ResourceHelper or Annotation implicit functions)

Can I contribute?

Yes! Any kind of contribution is welcome, feedback, ideas, management, documentation, testing, corpus for training and testing, development or even code review. Refer to the contribute page for more information.

Installation and infrastructure

Does Spark-NLP work with Jupyter Notebooks / Apache Zeppelin?

Yes, check out our get started page or checkout GitHub's README page in our library.

Does Spark-NLP work with Databricks, Cloudera, Azure, |insert-your-preferred-framework-here|?

Spark-NLP has been tested on various frameworks and should work as expected. Sometimes a few tricks here and there are required but feel free to report any issue or jump into our Slack channel for help.

When using Spark-NLP in python I get ‘JavaPackage’ object is not callable error.

This means your python setup is not finding the Spark-NLP jar in the JVM classpath. Make sure you properly use --jars or --packages for your defined framework.

Library Concepts

What are annotator types?

Each annotator has a type that may be shared with other annotators. Whenever an annotator requires another annotator by a type, it means you can provide in inputCols any annotator’s output column that has such type, for instance Normalizer or SpellChecker are both token type annotators and either or both may be used for a Sentiment Analysis model.

When should I use LightPipelines?

Sometimes you need to test only a few sentences or run against a very small dataset that fits on a single machine. For these scenarios, LightPipelines are much faster than Spark's Pipeline, since there isn't a driver-executor relationship going on. LightPiplines rely on multi-threaded parallel computation that make them much faster, particularly for real-time or streaming demands. Roughly estimates show that LightPipelines are worth it when working with below 50k rows (depends on the size of sentences and the pipeline itself)

Should I always use RecursivePipelines instead of SparkML Pipelines?

As of this writing, there isn't much benefit for RecursivePipelines, but, the answer is ideally yes. Even though only a few annotators utilize RecursivePipelines internally, the default behavior falls back to that of the common Pipeline, so there's nothing to lose.

How do word embeddings work?

We use an internal cluster-supported technique to index a database of vectors. We then inject the information into all spark workers to be able to locally resolve at a high-speed all word-embedding mappings. RocksDB Indexing is utilized.

Models and Support

Are there any pretrained models to work with?

Yes, there is a pretrained API that automatically downloads pre-trained models from our S3 servers. Links for downloading for offline use are available in the GitHub README. Notice all models are not meant for production use, but barely as an example for experimentation. Only English models are available. Finally, aside from Pipelines, all downloadable models are specific stages of a pipeline. There is no pretrained model "for everything". Spark-NLP is a modular library.

I need help, is there any support or chat?

We use Slack. Check our homepage for requesting an invite. John Snow Labs offers commercial support for the library with data, pretrained models and infrastructure help (Big Data and Security). Find out more in


I am getting a Java Core Dump when running OCR transformation using Tesseract

Add LC_ALL=C environment variable

Getting 'org.apache.pdfbox.filter.MissingImageReaderException: Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed' when running an OCR transformation

add '--packages com.github.jai-imageio:jai-imageio-jpeg2000:1.3.0' to your Spark job. This library is non-free thus we can't include it as a Spark-NLP dependency by default