Spark NLP: State of the Art Natural Language Processing

The first production grade versions of the latest deep learning NLP research

The most widely used NLP library in the enterprise

Source:2020 NLP Industry Survey, by Gradient Flow.

100% Open Source

Including pre-trained models and pipelines

Natively scalable

The only NLP library built natively on Apache Spark

Multiple Languages

Full Python, Scala, and Java support

Transformers at Scale

Spark NLP is the only open-source NLP library in production that offers state-of-the-art transformers such as BERT, ALBERT, ELECTRA, XLNet, DistilBERT, RoBERTa, XLM-RoBERTa, Longformer, ELMO, Universal Sentence Encoder, Google T5, and MarianMT not only to Python, and R but also to JVM ecosystem (Java, Scala, and Kotlin) at scale by extending Apache Spark natively

Quick and Easy

Spark NLP is available on PyPI, Conda, and Maven
    # Install Spark NLP from PyPI
    $ pip install spark-nlp==3.3.4 pyspark==3.1.2

    # Install Spark NLP from Anaconda/Conda
    $ conda install -c johnsnowlabs spark-nlp

    # Load Spark NLP with Spark Shell
    $ spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.4

    # Load Spark NLP with PySpark
    $ pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.4

    # Load Spark NLP with Spark Submit
    $ spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.4

    # Load Spark NLP as external JAR after comiling and bulding Spark NLP by `sbt assembly`
    $ spark-shell --jar spark-nlp-assembly-3.3.4

Right Out of The Box

Spark NLP ships with many NLP features, pre-trained models and pipelines

NLP Features

  • Tokenization
  • Word Segmentation
  • Stop Words Removal
  • Normalizer
  • Stemmer
  • Lemmatizer
  • NGrams
  • Regex Matching
  • Text Matching
  • Chunking
  • Date Matcher
  • Part-of-speech tagging
  • Sentence Detector (DL models)
  • Dependency parsing
  • Sentiment Detection (ML models)
  • Spell Checker (ML & DL models)
  • Word Embeddings (GloVe & Word2Vec)
  • Doc2Vec Embeddings (Word2Vec)
  • BERT Embeddings
  • DistilBERT Embeddings
  • RoBERTa Embeddings
  • XLM-RoBERTa Embeddings
  • Longformer Embeddings
  • ALBERT Embeddings
  • XLNet Embeddings
  • ELMO Embeddings
  • Universal Sentence Encoder
  • Sentence Embeddings
  • Chunk Embeddings
  • Neural Machine Translation (MarianMT)
  • Text-To-Text Transfer Transformer (Google T5)
  • Unsupervised keywords extraction
  • Language Detection & Identification (up to 375 languages)
  • Multi-class Text Classification (DL model)
  • Multi-label Text Classification (DL model)
  • Multi-class Sentiment Analysis (DL model)
  • BERT for Sequence Classification
  • DistilBERT for Sequence Classification
  • BERT for Token Classification
  • DistilBERT for Token Classification
  • ALBERT for Token Classification
  • RoBERTa for Token Classification
  • XLM-RoBERTa for Token Classification
  • XLNet for Token Classification
  • Longformer for Token Classification
  • Named entity recognition (DL model)
  • Easy TensorFlow integration
  • GPU Support
  • Full integration with Spark ML functions
  • 2000+ pre-trained models in 200+ languages!
  • 1700+ pre-trained pipelines in 200+ languages!
    from sparknlp.base import *
    from sparknlp.annotator import *
    from sparknlp.pretrained import PretrainedPipeline
    import sparknlp

    # Start Spark Session with Spark NLP
    spark = sparknlp.start()

    # Download a pre-trained pipeline
    pipeline = PretrainedPipeline('explain_document_dl', lang='en')

    # Annotate your testing dataset
    result = pipeline.annotate("The Mona Lisa is a 16th century oil painting created by Leonardo. It's held at the Louvre in Paris.")

    # What's in the pipeline
    Output: ['entities', 'stem', 'checked', 'lemma', 'document', 'pos', 'token', 'ner', 'embeddings', 'sentence']

    # Check the results
    Output: ['Mona Lisa', 'Leonardo', 'Louvre', 'Paris']


Spark NLP 3.x obtained the best performing academic peer-reviewed results

Training NER

  • State-of-the-art Deep Learning algorithms
  • Achieve high accuracy within a few minutes
  • Achieve high accuracy with a few lines of codes
  • Blazing fast training
  • Use CPU or GPU
  • 80+ Pretrained Embeddings including GloVe, Word2Vec, BERT, DistilBERT, RoBERTa, XLM-RoBERTa, Longformer, ELMO, ELECTRA, ALBERT, XLNet, BioBERT, etc.
  • Multi-lingual NER models in Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, German, Hebrew, Italian, Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, and Urdu
Spark NLP v3 2021 Python/Scala/Java/R 93.2 (test F1)
95 (dev F1)
spaCy v3 2021
Python 91.6
Stanza (StanfordNLP) 2020
Python 92.1
Flair 2018 Python 93.1
CoreNLP 2015 Java 89.6
Spark NLP v3 2021 Python/Scala/Java/R 90.0 (test F1)
92.5 (dev F1)
spaCy RoBERTa 2020
Python 89.7 (dev F1)
Stanza (StanfordNLP) 2020
Python 88.8 (dev F1)
Flair 2018 Python 89.7

Trusted By

Active Community Support