Spark NLP: State of the Art
Natural Language Processing

The first production grade versions of the latest deep learning NLP research

The most widely used NLP library in the enterprise

Backed by O'Reilly's most recent "AI Adoption in the Enterprise" survey in February

100% Open Source

Including pre-trained models and pipelines

Natively scalable

The only NLP library built natively on Apache Spark

Multiple Languages

Full Python, Scala, and Java support

Quick and Easy

Spark NLP is available on PyPI, Conda, Maven, and Spark Packages

# Install Spark NLP from PyPI
$ pip install spark-nlp==2.1.0

# Install Spark NLP from Anacodna/Conda
$ conda install -c johnsnowlabs spark-nlp

# Load Spark NLP with Spark Shell
$ spark-shell --packages JohnSnowLabs:spark-nlp:2.1.0

# Load Spark NLP with PySpark
$ pyspark --packages JohnSnowLabs:spark-nlp:2.1.0

# Load Spark NLP with Spark Submit
$ spark-submit --packages JohnSnowLabs:spark-nlp:2.1.0

# Load Spark NLP as external JAR after comiling and bulding Spark NLP by `sbt assembly`
$ spark-shell --jar spark-nlp-assembly-2.1.0
            

Right Out of The Box

Spark NLP ships with many NLP features, pre-trained models and pipelines

NLP Features

  • Tokenization
  • Normalizer
  • Stemmer
  • Lemmatizer
  • Regex Matching
  • Text Matching
  • Chunking
  • Date Matcher
  • Part-of-speech tagging
  • Sentence Detector
  • Sentiment Detection (ML model)
  • Spell Checker (ML and DL models)
  • Word Embeddings (BERT and GloVe)
  • Named entity recognition
  • Dependency parsing (Labeled/unlabled)
  • Easy TensorFlow integration
  • +30 pre-trained models for 4 languages
  • +15 pre-trained pipelines!
# Import Spark NLP            
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.embeddings import *
import sparknlp

# Start Spark Session with Spark NLP
spark = sparknlp.start()

# Download a pre-trained pipeline 
pipeline = PretrainedPipeline('explain_document_dl', lang='en')

# Your testing dataset
text = """
The Mona Lisa is a 16th century oil painting created by Leonardo. 
It's held at the Louvre in Paris.
"""

# Annotate your testing dataset
result = pipeline.annotate(text)

# What's in the pipeline
list(result.keys())
Output: ['entities', 'stem', 'checked', 'lemma', 'document',
'pos', 'token', 'ner', 'embeddings', 'sentence']

# Check the results
result['entities']
Output: ['Mona Lisa', 'Leonardo', 'Louvre', 'Paris']
            

Active Community Support