The most widely used NLP library in the enterprise

Source:2020 NLP Industry Survey, by Gradient Flow.
100% Open Source
Including pre-trained models and pipelines
Natively scalable
The only NLP library built natively on Apache Spark
Multiple Languages
Full Python, Scala, and Java support
Transformers at Scale
Spark NLP is the only open-source NLP library in production that offers state-of-the-art transformers such as
BERT, ALBERT, ELECTRA, XLNet, DistilBERT, RoBERTa, XLM-RoBERTa, Longformer, ELMO, Universal Sentence Encoder, Google T5, MarianMT, and OpenAI GPT2 not only to Python, and R
but also to JVM ecosystem (Java, Scala, and Kotlin) at scale by extending Apache Spark natively
# Install Spark NLP from PyPI
$ pip install spark-nlp==3.4.4 pyspark==3.1.2
# Install Spark NLP from Anaconda/Conda
$ conda install -c johnsnowlabs spark-nlp
# Load Spark NLP with Spark Shell
$ spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.4
# Load Spark NLP with PySpark
$ pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.4
# Load Spark NLP with Spark Submit
$ spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.4
# Load Spark NLP as an external Fat JAR
$ spark-shell --jars spark-nlp-assembly-3.4.4.jar
Right Out of The Box
Spark NLP ships with many NLP features, pre-trained models and pipelines
NLP Features
- Tokenization
- Word Segmentation
- Stop Words Removal
- Normalizer
- Stemmer
- Lemmatizer
- NGrams
- Regex Matching
- Text Matching
- Chunking
- Date Matcher
- Part-of-speech tagging
- Sentence Detector (DL models)
- Dependency parsing
- Sentiment Detection (ML models)
- Spell Checker (ML & DL models)
- Doc2Vec Embeddings (Word2Vec)
- Word2Vec Embeddings (Word2Vec)
- Word Embeddings (GloVe & Word2Vec)
- BERT Embeddings
- DistilBERT Embeddings
- CamemBERT Embeddings
- RoBERTa Embeddings
- DeBERTa Embeddings
- XLM-RoBERTa Embeddings
- Longformer Embeddings
- ALBERT Embeddings
- XLNet Embeddings
- ELMO Embeddings
- Universal Sentence Encoder
- Sentence Embeddings
- Chunk Embeddings
- Neural Machine Translation (MarianMT)
- Text-To-Text Transfer Transformer (Google T5)
- Generative Pre-trained Transformer 2 (OpenAI GPT-2)
- Unsupervised keywords extraction
- Language Detection & Identification (up to 375 languages)
- Multi-class Text Classification (DL model)
- Multi-label Text Classification (DL model)
- Multi-class Sentiment Analysis (DL model)
- BERT for Token & Sequence Classification
- DistilBERT for Token & Sequence Classification
- ALBERT for Token & Sequence Classification
- RoBERTa for Token & Sequence Classification
- DeBERTa for Token & Sequence Classification
- XLM-RoBERTa for Token & Sequence Classification
- XLNet for Token & Sequence Classification
- Longformer for Token & Sequence Classification
- Named entity recognition (DL model)
- Easy TensorFlow integration
- GPU Support
- Full integration with Spark ML functions
- 3200+ pre-trained models in 200+ languages!
- 1700+ pre-trained pipelines in 200+ languages!
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.pretrained import PretrainedPipeline
import sparknlp
# Start Spark Session with Spark NLP
spark = sparknlp.start()
# Download a pre-trained pipeline
pipeline = PretrainedPipeline('explain_document_dl', lang='en')
# Annotate your testing dataset
result = pipeline.annotate("The Mona Lisa is a 16th century oil painting created by Leonardo. It's held at the Louvre in Paris.")
# What's in the pipeline
list(result.keys())
Output: ['entities', 'stem', 'checked', 'lemma', 'document', 'pos', 'token', 'ner', 'embeddings', 'sentence']
# Check the results
result['entities']
Output: ['Mona Lisa', 'Leonardo', 'Louvre', 'Paris']
Benchmark
Spark NLP 3.x obtained the best performing academic peer-reviewed results
Training NER
- State-of-the-art Deep Learning algorithms
- Achieve high accuracy within a few minutes
- Achieve high accuracy with a few lines of codes
- Blazing fast training
- Use CPU or GPU
- 80+ Pretrained Embeddings including GloVe, Word2Vec, BERT, DistilBERT, CamemBERT, RoBERTa, DeBERTa, XLM-RoBERTa, Longformer, ELMO, ELECTRA, ALBERT, XLNet, BioBERT, etc.
- Multi-lingual NER models in Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, German, Hebrew, Italian, Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, and Urdu
SYSTEM | YEAR | LANGUAGE | CONLL ‘03 |
---|---|---|---|
Spark NLP v3 | 2021 | Python/Scala/Java/R | 93 (test F1) 95 (dev F1) |
spaCy v3 | 2021 |
Python | 91.6 |
Stanza (StanfordNLP) | 2020 |
Python | 92.1 |
Flair | 2018 | Python | 93.1 |
CoreNLP | 2015 | Java | 89.6 |
SYSTEM | YEAR | LANGUAGE | ONTONOTES |
---|---|---|---|
Spark NLP v3 | 2021 | Python/Scala/Java/R | 90.0 (test F1) 92.5 (dev F1) |
spaCy RoBERTa | 2020 |
Python | 89.7 (dev F1) |
Stanza (StanfordNLP) | 2020 |
Python | 88.8 (dev F1) |
Flair | 2018 | Python | 89.7 |
Trusted By










































