Annotators

 

How to read this section

All annotators in Spark NLP share a common interface, this is:

  • Annotation: Annotation(annotatorType, begin, end, result, meta-data, embeddings)
  • AnnotatorType: some annotators share a type. This is not only figurative, but also tells about the structure of the metadata map in the Annotation. This is the one referred in the input and output of annotators.
  • Inputs: Represents how many and which annotator types are expected in setInputCols(). These are column names of output of other annotators in the DataFrames.
  • Output Represents the type of the output in the column setOutputCol().

There are two types of Annotators:

  • Approach: AnnotatorApproach extend Estimators, which are meant to be trained through fit()
  • Model: AnnotatorModel extend from Transformers, which are meant to transform DataFrames through transform()

Model suffix is explicitly stated when the annotator is the result of a training process. Some annotators, such as Tokenizer are transformers, but do not contain the word Model since they are not trained annotators.

Model annotators have a pretrained() on it’s static object, to retrieve the public pre-trained version of a model.

  • pretrained(name, language, extra_location) -> by default, pre-trained will bring a default model, sometimes we offer more than one model, in this case, you may have to use name, language or extra location to download them.

Available Annotators

Annotator Description Version
BigTextMatcher Annotator to match exact phrases (by token) provided in a file against a Document. Opensource
Chunk2Doc Converts a CHUNK type column back into DOCUMENT. Useful when trying to re-tokenize or do further analysis on a CHUNK result. Opensource
ChunkEmbeddings This annotator utilizes WordEmbeddings, BertEmbeddings etc. to generate chunk embeddings from either Chunker, NGramGenerator, or NerConverter outputs. Opensource
Chunker This annotator matches a pattern of part-of-speech tags in order to return meaningful phrases from document. Opensource
ClassifierDL ClassifierDL for generic Multi-class Text Classification. Opensource
ContextSpellChecker Implements a deep-learning based Noisy Channel Model Spell Algorithm. Opensource
DateMatcher Matches standard date formats into a provided format. Opensource
DependencyParser Unlabeled parser that finds a grammatical relation between two words in a sentence. Opensource
Doc2Chunk Converts DOCUMENT type annotations into CHUNK type with the contents of a chunkCol. Opensource
DocumentAssembler Prepares data into a format that is processable by Spark NLP. This is the entry point for every Spark NLP pipeline. Opensource
DocumentNormalizer Annotator which normalizes raw text from tagged text, e.g. scraped web pages or xml documents, from document type columns into Sentence. Opensource
EmbeddingsFinisher Extracts embeddings from Annotations into a more easily usable form. Opensource
Finisher Converts annotation results into a format that easier to use. It is useful to extract the results from Spark NLP Pipelines. Opensource
LanguageDetectorDL Language Identification and Detection by using CNN and RNN architectures in TensorFlow. Opensource
Lemmatizer Finds lemmas out of words with the objective of returning a base dictionary word. Opensource
MultiClassifierDL Multi-label Text Classification. Opensource
MultiDateMatcher Matches standard date formats into a provided format. Opensource
NGramGenerator A feature transformer that converts the input array of strings (annotatorType TOKEN) into an array of n-grams (annotatorType CHUNK). Opensource
NerConverter Converts a IOB or IOB2 representation of NER to a user-friendly one, by associating the tokens of recognized entities and their label. Opensource
NerCrf Extracts Named Entities based on a CRF Model. Opensource
NerDL This Named Entity recognition annotator is a generic NER model based on Neural Networks. Opensource
NerOverwriter Overwrites entities of specified strings. Opensource
Normalizer Removes all dirty characters from text following a regex pattern and transforms words based on a provided dictionary. Opensource
NorvigSweeting Spellchecker Retrieves tokens and makes corrections automatically if not found in an English dictionary. Opensource
POSTagger (Part of speech tagger) Averaged Perceptron model to tag words part-of-speech. Opensource
RecursiveTokenizer Tokenizes raw text recursively based on a handful of definable rules. Opensource
RegexMatcher Uses a reference file to match a set of regular expressions and associate them with a provided identifier. Opensource
RegexTokenizer A tokenizer that splits text by a regex pattern. Opensource
SentenceDetector Detects sentence boundaries using any provided approach. Opensource
SentenceDetectorDL Detects sentence boundaries using a deep learning approach. Opensource
SentenceEmbeddings Converts the results from WordEmbeddings, BertEmbeddings, or ElmoEmbeddings into sentence or document embeddings by either summing up or averaging all the word embeddings in a sentence or a document (depending on the inputCols). Opensource
SentimentDL Annotator for multi-class sentiment analysis. Opensource
SentimentDetector Rule based sentiment detector, which calculates a score based on predefined keywords. Opensource
Stemmer Returns hard-stems out of words with the objective of retrieving the meaningful part of the word. Opensource
StopWordsCleaner This annotator takes a sequence of strings (e.g. the output of a Tokenizer, Normalizer, Lemmatizer, and Stemmer) and drops all the stop words from the input sequences. Opensource
SymmetricDelete Spellchecker Symmetric Delete spelling correction algorithm. Opensource
TextMatcher Matches exact phrases (by token) provided in a file against a Document. Opensource
Token2Chunk Converts TOKEN type Annotations to CHUNK type. Opensource
TokenAssembler This transformer reconstructs a DOCUMENT type annotation from tokens, usually after these have been normalized, lemmatized, normalized, spell checked, etc, in order to use this document annotation in further annotators. Opensource
Tokenizer Tokenizes raw text into word pieces, tokens. Identifies tokens with tokenization open standards. A few rules will help customizing it if defaults do not fit user needs. Opensource
TypedDependencyParser Labeled parser that finds a grammatical relation between two words in a sentence. Opensource
ViveknSentiment Sentiment analyser inspired by the algorithm by Vivek Narayanan. Opensource
WordEmbeddings Word Embeddings lookup annotator that maps tokens to vectors. Opensource
WordSegmenter Tokenizes non-english or non-whitespace separated texts. Opensource
Yake Unsupervised, Corpus-Independent, Domain and Language-Independent and Single-Document keyword extraction. Opensource

Available Transformers

Additionally, these transformers are available to generate embeddings.

Transformer Description Version
AlbertEmbeddings ALBERT: A Lite BERT for Self-supervised Learning of Language Representations Opensource
BertEmbeddings Token-level embeddings using BERT. BERT (Bidirectional Encoder Representations from Transformers) provides dense vector representations for natural language by using a deep, pre-trained neural network with the Transformer architecture. Opensource
BertSentenceEmbeddings Sentence-level embeddings using BERT. BERT (Bidirectional Encoder Representations from Transformers) provides dense vector representations for natural language by using a deep, pre-trained neural network with the Transformer architecture. Opensource
DistilBertEmbeddings DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. Opensource
ElmoEmbeddings Word embeddings from ELMo (Embeddings from Language Models), a language model trained on the 1 Billion Word Benchmark. Opensource
MarianTransformer Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. Opensource
RoBertaEmbeddings RoBERTa: A Robustly Optimized BERT Pretraining Approach Opensource
T5Transformer T5 reconsiders all NLP tasks into a unified text-to-text-format where the input and output are always text strings, in contrast to BERT-style models that can only output either a class label or a span of the input. Opensource
UniversalSentenceEncoder The Universal Sentence Encoder encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks. Opensource
XlmRoBertaEmbeddings XlmRoBerta is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl Opensource
XlnetEmbeddings XLNet is a new unsupervised language representation learning method based on a novel generalized permutation language modeling objective. Opensource

BigTextMatcher

Annotator to match exact phrases (by token) provided in a file against a Document.

A text file of predefined phrases must be provided with setStoragePath.

In contrast to the normal TextMatcher, the BigTextMatcher is designed for large corpora.

For extended examples of usage, see the BigTextMatcherTestSpec.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CHUNK

API: BigTextMatcher Source: BigTextMatcher
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline
# In this example, the entities file is of the form
#
# ...
# dolore magna aliqua
# lorem ipsum dolor. sit
# laborum
# ...
#
# where each line represents an entity phrase to be extracted.

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols("document") \
    .setOutputCol("token")

data = spark.createDataFrame([["Hello dolore magna aliqua. Lorem ipsum dolor. sit in laborum"]]).toDF("text")
entityExtractor = BigTextMatcher() \
    .setInputCols("document", "token") \
    .setStoragePath("src/test/resources/entity-extractor/test-phrases.txt", ReadAs.TEXT) \
    .setOutputCol("entity") \
    .setCaseSensitive(False)

pipeline = Pipeline().setStages([documentAssembler, tokenizer, entityExtractor])
results = pipeline.fit(data).transform(data)
results.selectExpr("explode(entity)").show(truncate=False)
+--------------------------------------------------------------------+
|col                                                                 |
+--------------------------------------------------------------------+
|[chunk, 6, 24, dolore magna aliqua, [sentence -> 0, chunk -> 0], []]|
|[chunk, 53, 59, laborum, [sentence -> 0, chunk -> 1], []]           |
+--------------------------------------------------------------------+
// In this example, the entities file is of the form
//
// ...
// dolore magna aliqua
// lorem ipsum dolor. sit
// laborum
// ...
//
// where each line represents an entity phrase to be extracted.
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.Tokenizer
import com.johnsnowlabs.nlp.annotator.BigTextMatcher
import com.johnsnowlabs.nlp.util.io.ReadAs
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val data = Seq("Hello dolore magna aliqua. Lorem ipsum dolor. sit in laborum").toDF("text")
val entityExtractor = new BigTextMatcher()
  .setInputCols("document", "token")
  .setStoragePath("src/test/resources/entity-extractor/test-phrases.txt", ReadAs.TEXT)
  .setOutputCol("entity")
  .setCaseSensitive(false)

val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, entityExtractor))
val results = pipeline.fit(data).transform(data)
results.selectExpr("explode(entity)").show(false)
+--------------------------------------------------------------------+
|col                                                                 |
+--------------------------------------------------------------------+
|[chunk, 6, 24, dolore magna aliqua, [sentence -> 0, chunk -> 0], []]|
|[chunk, 53, 59, laborum, [sentence -> 0, chunk -> 1], []]           |
+--------------------------------------------------------------------+

Chunk2Doc

Converts a CHUNK type column back into DOCUMENT. Useful when trying to re-tokenize or do further analysis on a CHUNK result.

For more extended examples on document pre-processing see the Spark NLP Workshop.

Input Annotator Types: CHUNK

Output Annotator Type: DOCUMENT

API: Chunk2Doc Source: Chunk2Doc
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline
from sparknlp.pretrained import PretrainedPipeline
# Location entities are extracted and converted back into `DOCUMENT` type for further processing

data = spark.createDataFrame([[1, "New York and New Jersey aren't that far apart actually."]]).toDF("id", "text")

# Extracts Named Entities amongst other things
pipeline = PretrainedPipeline("explain_document_dl")

chunkToDoc = Chunk2Doc().setInputCols("entities").setOutputCol("chunkConverted")
explainResult = pipeline.transform(data)

result = chunkToDoc.transform(explainResult)
result.selectExpr("explode(chunkConverted)").show(truncate=False)
+------------------------------------------------------------------------------+
|col                                                                           |
+------------------------------------------------------------------------------+
|[document, 0, 7, New York, [entity -> LOC, sentence -> 0, chunk -> 0], []]    |
|[document, 13, 22, New Jersey, [entity -> LOC, sentence -> 0, chunk -> 1], []]|
+------------------------------------------------------------------------------+
// Location entities are extracted and converted back into `DOCUMENT` type for further processing
import spark.implicits._
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
import com.johnsnowlabs.nlp.Chunk2Doc

val data = Seq((1, "New York and New Jersey aren't that far apart actually.")).toDF("id", "text")

// Extracts Named Entities amongst other things
val pipeline = PretrainedPipeline("explain_document_dl")

val chunkToDoc = new Chunk2Doc().setInputCols("entities").setOutputCol("chunkConverted")
val explainResult = pipeline.transform(data)

val result = chunkToDoc.transform(explainResult)
result.selectExpr("explode(chunkConverted)").show(false)
+------------------------------------------------------------------------------+
|col                                                                           |
+------------------------------------------------------------------------------+
|[document, 0, 7, New York, [entity -> LOC, sentence -> 0, chunk -> 0], []]    |
|[document, 13, 22, New Jersey, [entity -> LOC, sentence -> 0, chunk -> 1], []]|
+------------------------------------------------------------------------------+

ChunkEmbeddings

This annotator utilizes WordEmbeddings, BertEmbeddings etc. to generate chunk embeddings from either Chunker, NGramGenerator, or NerConverter outputs.

For extended examples of usage, see the Spark NLP Workshop and the ChunkEmbeddingsTestSpec.

Input Annotator Types: CHUNK, WORD_EMBEDDINGS

Output Annotator Type: WORD_EMBEDDINGS

API: ChunkEmbeddings Source: ChunkEmbeddings
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

# Extract the Embeddings from the NGrams
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

nGrams = NGramGenerator() \
    .setInputCols(["token"]) \
    .setOutputCol("chunk") \
    .setN(2)

embeddings = WordEmbeddingsModel.pretrained() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings") \
    .setCaseSensitive(False)

# Convert the NGram chunks into Word Embeddings
chunkEmbeddings = ChunkEmbeddings() \
    .setInputCols(["chunk", "embeddings"]) \
    .setOutputCol("chunk_embeddings") \
    .setPoolingStrategy("AVERAGE")

pipeline = Pipeline() \
    .setStages([
      documentAssembler,
      sentence,
      tokenizer,
      nGrams,
      embeddings,
      chunkEmbeddings
    ])

data = spark.createDataFrame([["This is a sentence."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(chunk_embeddings) as result") \
    .select("result.annotatorType", "result.result", "result.embeddings") \
    .show(5, 80)
+---------------+----------+--------------------------------------------------------------------------------+
|  annotatorType|    result|                                                                      embeddings|
+---------------+----------+--------------------------------------------------------------------------------+
|word_embeddings|   This is|[-0.55661, 0.42829502, 0.86661, -0.409785, 0.06316501, 0.120775, -0.0732005, ...|
|word_embeddings|      is a|[-0.40674996, 0.22938299, 0.50597, -0.288195, 0.555655, 0.465145, 0.140118, 0...|
|word_embeddings|a sentence|[0.17417, 0.095253006, -0.0530925, -0.218465, 0.714395, 0.79860497, 0.0129999...|
|word_embeddings|sentence .|[0.139705, 0.177955, 0.1887775, -0.45545, 0.20030999, 0.461557, -0.07891501, ...|
+---------------+----------+--------------------------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import com.johnsnowlabs.nlp.annotators.{NGramGenerator, Tokenizer}
import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel
import com.johnsnowlabs.nlp.embeddings.ChunkEmbeddings
import org.apache.spark.ml.Pipeline

// Extract the Embeddings from the NGrams
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val nGrams = new NGramGenerator()
  .setInputCols("token")
  .setOutputCol("chunk")
  .setN(2)

val embeddings = WordEmbeddingsModel.pretrained()
  .setInputCols("sentence", "token")
  .setOutputCol("embeddings")
  .setCaseSensitive(false)

// Convert the NGram chunks into Word Embeddings
val chunkEmbeddings = new ChunkEmbeddings()
  .setInputCols("chunk", "embeddings")
  .setOutputCol("chunk_embeddings")
  .setPoolingStrategy("AVERAGE")

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    sentence,
    tokenizer,
    nGrams,
    embeddings,
    chunkEmbeddings
  ))

val data = Seq("This is a sentence.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(chunk_embeddings) as result")
  .select("result.annotatorType", "result.result", "result.embeddings")
  .show(5, 80)
+---------------+----------+--------------------------------------------------------------------------------+
|  annotatorType|    result|                                                                      embeddings|
+---------------+----------+--------------------------------------------------------------------------------+
|word_embeddings|   This is|[-0.55661, 0.42829502, 0.86661, -0.409785, 0.06316501, 0.120775, -0.0732005, ...|
|word_embeddings|      is a|[-0.40674996, 0.22938299, 0.50597, -0.288195, 0.555655, 0.465145, 0.140118, 0...|
|word_embeddings|a sentence|[0.17417, 0.095253006, -0.0530925, -0.218465, 0.714395, 0.79860497, 0.0129999...|
|word_embeddings|sentence .|[0.139705, 0.177955, 0.1887775, -0.45545, 0.20030999, 0.461557, -0.07891501, ...|
+---------------+----------+--------------------------------------------------------------------------------+

Chunker

This annotator matches a pattern of part-of-speech tags in order to return meaningful phrases from document. Extracted part-of-speech tags are mapped onto the sentence, which can then be parsed by regular expressions. The part-of-speech tags are wrapped by angle brackets <> to be easily distinguishable in the text itself. This example sentence will result in the form:

"Peter Pipers employees are picking pecks of pickled peppers."
"<NNP><NNP><NNS><VBP><VBG><NNS><IN><JJ><NNS><.>"

To then extract these tags, regexParsers need to be set with e.g.:

val chunker = new Chunker()
  .setInputCols("sentence", "pos")
  .setOutputCol("chunk")
  .setRegexParsers(Array("<NNP>+", "<NNS>+"))

When defining the regular expressions, tags enclosed in angle brackets are treated as groups, so here specifically "<NNP>+" means 1 or more nouns in succession. Additional patterns can also be set with addRegexParsers.

For more extended examples see the Spark NLP Workshop and the ChunkerTestSpec.

Input Annotator Types: DOCUMENT, POS

Output Annotator Type: CHUNK

API: Chunker Source: Chunker
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols("document") \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

POSTag = PerceptronModel.pretrained() \
    .setInputCols("document", "token") \
    .setOutputCol("pos")

chunker = Chunker() \
    .setInputCols("sentence", "pos") \
    .setOutputCol("chunk") \
    .setRegexParsers(["<NNP>+", "<NNS>+"])

pipeline = Pipeline() \
    .setStages([
      documentAssembler,
      sentence,
      tokenizer,
      POSTag,
      chunker
    ])

data = spark.createDataFrame([["Peter Pipers employees are picking pecks of pickled peppers."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(chunk) as result").show(truncate=False)
+-------------------------------------------------------------+
|result                                                       |
+-------------------------------------------------------------+
|[chunk, 0, 11, Peter Pipers, [sentence -> 0, chunk -> 0], []]|
|[chunk, 13, 21, employees, [sentence -> 0, chunk -> 1], []]  |
|[chunk, 35, 39, pecks, [sentence -> 0, chunk -> 2], []]      |
|[chunk, 52, 58, peppers, [sentence -> 0, chunk -> 3], []]    |
+-------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.{Chunker, Tokenizer}
import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val POSTag = PerceptronModel.pretrained()
  .setInputCols("document", "token")
  .setOutputCol("pos")

val chunker = new Chunker()
  .setInputCols("sentence", "pos")
  .setOutputCol("chunk")
  .setRegexParsers(Array("<NNP>+", "<NNS>+"))

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    sentence,
    tokenizer,
    POSTag,
    chunker
  ))

val data = Seq("Peter Pipers employees are picking pecks of pickled peppers.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(chunk) as result").show(false)
+-------------------------------------------------------------+
|result                                                       |
+-------------------------------------------------------------+
|[chunk, 0, 11, Peter Pipers, [sentence -> 0, chunk -> 0], []]|
|[chunk, 13, 21, employees, [sentence -> 0, chunk -> 1], []]  |
|[chunk, 35, 39, pecks, [sentence -> 0, chunk -> 2], []]      |
|[chunk, 52, 58, peppers, [sentence -> 0, chunk -> 3], []]    |
+-------------------------------------------------------------+

ClassifierDL

Trains a ClassifierDL for generic Multi-class Text Classification.

ClassifierDL uses the state-of-the-art Universal Sentence Encoder as an input for text classifications. The ClassifierDL annotator uses a deep learning model (DNNs) we have built inside TensorFlow and supports up to 100 classes.

For instantiated/pretrained models, see ClassifierDLModel.

Notes:

For extended examples of usage, see the Spark NLP Workshop [1] [2] and the ClassifierDLTestSpec.

Input Annotator Types: SENTENCE_EMBEDDINGS

Output Annotator Type: CATEGORY

Note: This annotator accepts a label column of a single item in either type of String, Int, Float, or Double. UniversalSentenceEncoder, BertSentenceEmbeddings, or SentenceEmbeddings can be used for the inputCol

API: ClassifierDLApproach Source: ClassifierDLApproach
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline
# In this example, the training data `"sentiment.csv"` has the form of
#
# text,label
# This movie is the best movie I have wached ever! In my opinion this movie can win an award.,0
# This was a terrible movie! The acting was bad really bad!,1
# ...
#
# Then traning can be done like so:

smallCorpus = spark.read.option("header","True").csv("src/test/resources/classifier/sentiment.csv")

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

useEmbeddings = UniversalSentenceEncoder.pretrained() \
    .setInputCols("document") \
    .setOutputCol("sentence_embeddings")

docClassifier = ClassifierDLApproach() \
    .setInputCols("sentence_embeddings") \
    .setOutputCol("category") \
    .setLabelColumn("label") \
    .setBatchSize(64) \
    .setMaxEpochs(20) \
    .setLr(5e-3) \
    .setDropout(0.5)

pipeline = Pipeline() \
    .setStages(
      [
        documentAssembler,
        useEmbeddings,
        docClassifier
      ]
    )

pipelineModel = pipeline.fit(smallCorpus)
// In this example, the training data `"sentiment.csv"` has the form of
//
// text,label
// This movie is the best movie I have wached ever! In my opinion this movie can win an award.,0
// This was a terrible movie! The acting was bad really bad!,1
// ...
//
// Then traning can be done like so:
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder
import com.johnsnowlabs.nlp.annotators.classifier.dl.ClassifierDLApproach
import org.apache.spark.ml.Pipeline

val smallCorpus = spark.read.option("header","true").csv("src/test/resources/classifier/sentiment.csv")

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val useEmbeddings = UniversalSentenceEncoder.pretrained()
  .setInputCols("document")
  .setOutputCol("sentence_embeddings")

val docClassifier = new ClassifierDLApproach()
  .setInputCols("sentence_embeddings")
  .setOutputCol("category")
  .setLabelColumn("label")
  .setBatchSize(64)
  .setMaxEpochs(20)
  .setLr(5e-3f)
  .setDropout(0.5f)

val pipeline = new Pipeline()
  .setStages(
    Array(
      documentAssembler,
      useEmbeddings,
      docClassifier
    )
  )

val pipelineModel = pipeline.fit(smallCorpus)

ContextSpellChecker

Trains a deep-learning based Noisy Channel Model Spell Algorithm. Correction candidates are extracted combining context information and word information.

For instantiated/pretrained models, see ContextSpellCheckerModel.

Spell Checking is a sequence to sequence mapping problem. Given an input sequence, potentially containing a certain number of errors, ContextSpellChecker will rank correction sequences according to three things:

  1. Different correction candidates for each word — word level.
  2. The surrounding text of each word, i.e. it’s context — sentence level.
  3. The relative cost of different correction candidates according to the edit operations at the character level it requires — subword level.

For an in-depth explanation of the module see the article Applying Context Aware Spell Checking in Spark NLP.

For extended examples of usage, see the article Training a Contextual Spell Checker for Italian Language, the Spark NLP Workshop and the ContextSpellCheckerTestSpec.

Input Annotator Types: TOKEN

Output Annotator Type: TOKEN

API: ContextSpellCheckerApproach Source: ContextSpellCheckerApproach
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline
# For this example, we use the first Sherlock Holmes book as the training dataset.


documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")


tokenizer = Tokenizer() \
    .setInputCols("document") \
    .setOutputCol("token")

spellChecker = ContextSpellCheckerApproach() \
    .setInputCols("token") \
    .setOutputCol("corrected") \
    .setWordMaxDistance(3) \
    .setBatchSize(24) \
    .setEpochs(8) \
    .setLanguageModelClasses(1650)  # dependant on vocabulary size
    # .addVocabClass("_NAME_", names) # Extra classes for correction could be added like this

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    spellChecker
])

path = "sherlockholmes.txt"
dataset = spark.read.text(path) \
    .toDF("text")
pipelineModel = pipeline.fit(dataset)
// For this example, we use the first Sherlock Holmes book as the training dataset.
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.spell.context.ContextSpellCheckerApproach

import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")


val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val spellChecker = new ContextSpellCheckerApproach()
  .setInputCols("token")
  .setOutputCol("corrected")
  .setWordMaxDistance(3)
  .setBatchSize(24)
  .setEpochs(8)
  .setLanguageModelClasses(1650)  // dependant on vocabulary size
  // .addVocabClass("_NAME_", names) // Extra classes for correction could be added like this

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  spellChecker
))

val path = "src/test/resources/spell/sherlockholmes.txt"
val dataset = spark.sparkContext.textFile(path)
  .toDF("text")
val pipelineModel = pipeline.fit(dataset)

DateMatcher

Matches standard date formats into a provided format.

Reads from different forms of date and time expressions and converts them to a provided date format.

Extracts only one date per document. Use with sentence detector to find matches in each sentence. To extract multiple dates from a document, please use the MultiDateMatcher.

Reads the following kind of dates:

"1978-01-28", "1984/04/02,1/02/1980", "2/28/79", "The 31st of April in the year 2008",
"Fri, 21 Nov 1997", "Jan 21, ‘97", "Sun", "Nov 21", "jan 1st", "next thursday",
"last wednesday", "today", "tomorrow", "yesterday", "next week", "next month",
"next year", "day after", "the day before", "0600h", "06:00 hours", "6pm", "5:30 a.m.",
"at 5", "12:59", "23:59", "1988/11/23 6pm", "next week at 7.30", "5 am tomorrow"

For example "The 31st of April in the year 2008" will be converted into 2008/04/31.

Pretrained pipelines are available for this module, see Pipelines.

For extended examples of usage, see the Spark NLP Workshop and the DateMatcherTestSpec.

Input Annotator Types: DOCUMENT

Output Annotator Type: DATE

API: DateMatcher Source: DateMatcher
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

date = DateMatcher() \
    .setInputCols("document") \
    .setOutputCol("date") \
    .setAnchorDateYear(2020) \
    .setAnchorDateMonth(1) \
    .setAnchorDateDay(11) \
    .setDateFormat("yyyy/MM/dd")

pipeline = Pipeline().setStages([
    documentAssembler,
    date
])

data = spark.createDataFrame([["Fri, 21 Nov 1997"], ["next week at 7.30"], ["see you a day after"]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("date").show(truncate=False)
+-------------------------------------------------+
|date                                             |
+-------------------------------------------------+
|[[date, 5, 15, 1997/11/21, [sentence -> 0], []]] |
|[[date, 0, 8, 2020/01/18, [sentence -> 0], []]]  |
|[[date, 10, 18, 2020/01/12, [sentence -> 0], []]]|
+-------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.DateMatcher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val date = new DateMatcher()
  .setInputCols("document")
  .setOutputCol("date")
  .setAnchorDateYear(2020)
  .setAnchorDateMonth(1)
  .setAnchorDateDay(11)
  .setDateFormat("yyyy/MM/dd")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  date
))

val data = Seq("Fri, 21 Nov 1997", "next week at 7.30", "see you a day after").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("date").show(false)
+-------------------------------------------------+
|date                                             |
+-------------------------------------------------+
|[[date, 5, 15, 1997/11/21, [sentence -> 0], []]] |
|[[date, 0, 8, 2020/01/18, [sentence -> 0], []]]  |
|[[date, 10, 18, 2020/01/12, [sentence -> 0], []]]|
+-------------------------------------------------+

DependencyParser

Trains an unlabeled parser that finds a grammatical relations between two words in a sentence.

For instantiated/pretrained models, see DependencyParserModel.

Dependency parser provides information about word relationship. For example, dependency parsing can tell you what the subjects and objects of a verb are, as well as which words are modifying (describing) the subject. This can help you find precise answers to specific questions.

The required training data can be set in two different ways (only one can be chosen for a particular model):

Apart from that, no additional training data is needed.

See DependencyParserApproachTestSpec for further reference on how to use this API.

Input Annotator Types: DOCUMENT, POS, TOKEN

Output Annotator Type: DEPENDENCY

API: DependencyParserApproach Source: DependencyParserApproach
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols("document") \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols("sentence") \
    .setOutputCol("token")

posTagger = PerceptronModel.pretrained() \
    .setInputCols("sentence", "token") \
    .setOutputCol("pos")

dependencyParserApproach = DependencyParserApproach() \
    .setInputCols("sentence", "pos", "token") \
    .setOutputCol("dependency") \
    .setDependencyTreeBank("src/test/resources/parser/unlabeled/dependency_treebank")

pipeline = Pipeline().setStages([
    documentAssembler,
    sentence,
    tokenizer,
    posTagger,
    dependencyParserApproach
])

# Additional training data is not needed, the dependency parser relies on the dependency tree bank / CoNLL-U only.
emptyDataSet = .empty[String].toDF("text")
pipelineModel = pipeline.fit(emptyDataSet)
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel
import com.johnsnowlabs.nlp.annotators.parser.dep.DependencyParserApproach
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val posTagger = PerceptronModel.pretrained()
  .setInputCols("sentence", "token")
  .setOutputCol("pos")

val dependencyParserApproach = new DependencyParserApproach()
  .setInputCols("sentence", "pos", "token")
  .setOutputCol("dependency")
  .setDependencyTreeBank("src/test/resources/parser/unlabeled/dependency_treebank")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentence,
  tokenizer,
  posTagger,
  dependencyParserApproach
))

// Additional training data is not needed, the dependency parser relies on the dependency tree bank / CoNLL-U only.
val emptyDataSet = Seq.empty[String].toDF("text")
val pipelineModel = pipeline.fit(emptyDataSet)

Doc2Chunk

Converts DOCUMENT type annotations into CHUNK type with the contents of a chunkCol. Chunk text must be contained within input DOCUMENT. May be either StringType or ArrayType[StringType] (using setIsArray). Useful for annotators that require a CHUNK type input.

For more extended examples on document pre-processing see the Spark NLP Workshop.

Input Annotator Types: DOCUMENT

Output Annotator Type: CHUNK

API: Doc2Chunk Source: Doc2Chunk
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
chunkAssembler = Doc2Chunk() \
    .setInputCols("document") \
    .setChunkCol("target") \
    .setOutputCol("chunk") \
    .setIsArray(True)

data = spark.createDataFrame([[
    "Spark NLP is an open-source text processing library for advanced natural language processing.",
      ["Spark NLP", "text processing library", "natural language processing"]
]]).toDF("text", "target")

pipeline = Pipeline().setStages([documentAssembler, chunkAssembler]).fit(data)
result = pipeline.transform(data)

result.selectExpr("chunk.result", "chunk.annotatorType").show(truncate=False)
+-----------------------------------------------------------------+---------------------+
|result                                                           |annotatorType        |
+-----------------------------------------------------------------+---------------------+
|[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]|
+-----------------------------------------------------------------+---------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.{Doc2Chunk, DocumentAssembler}
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")
val chunkAssembler = new Doc2Chunk()
  .setInputCols("document")
  .setChunkCol("target")
  .setOutputCol("chunk")
  .setIsArray(true)

val data = Seq(
  ("Spark NLP is an open-source text processing library for advanced natural language processing.",
    Seq("Spark NLP", "text processing library", "natural language processing"))
).toDF("text", "target")

val pipeline = new Pipeline().setStages(Array(documentAssembler, chunkAssembler)).fit(data)
val result = pipeline.transform(data)

result.selectExpr("chunk.result", "chunk.annotatorType").show(false)
+-----------------------------------------------------------------+---------------------+
|result                                                           |annotatorType        |
+-----------------------------------------------------------------+---------------------+
|[Spark NLP, text processing library, natural language processing]|[chunk, chunk, chunk]|
+-----------------------------------------------------------------+---------------------+

DocumentAssembler

Prepares data into a format that is processable by Spark NLP. This is the entry point for every Spark NLP pipeline. The DocumentAssembler can read either a String column or an Array[String]. Additionally, setCleanupMode can be used to pre-process the text (Default: disabled). For possible options please refer the parameters section.

For more extended examples on document pre-processing see the Spark NLP Workshop.

Input Annotator Types: NONE

Output Annotator Type: DOCUMENT

API: DocumentAssembler Source: DocumentAssembler
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

data = spark.createDataFrame([["Spark NLP is an open-source text processing library."]]).toDF("text")
documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")

result = documentAssembler.transform(data)

result.select("document").show(truncate=False)
+----------------------------------------------------------------------------------------------+
|document                                                                                      |
+----------------------------------------------------------------------------------------------+
|[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]|
+----------------------------------------------------------------------------------------------+

result.select("document").printSchema
root
 |-- document: array (nullable = True)
 |    |-- element: struct (containsNull = True)
 |    |    |-- annotatorType: string (nullable = True)
 |    |    |-- begin: integer (nullable = False)
 |    |    |-- end: integer (nullable = False)
 |    |    |-- result: string (nullable = True)
 |    |    |-- metadata: map (nullable = True)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = True)
 |    |    |-- embeddings: array (nullable = True)
 |    |    |    |-- element: float (containsNull = False)
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler

val data = Seq("Spark NLP is an open-source text processing library.").toDF("text")
val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")

val result = documentAssembler.transform(data)

result.select("document").show(false)
+----------------------------------------------------------------------------------------------+
|document                                                                                      |
+----------------------------------------------------------------------------------------------+
|[[document, 0, 51, Spark NLP is an open-source text processing library., [sentence -> 0], []]]|
+----------------------------------------------------------------------------------------------+

result.select("document").printSchema
root
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)

DocumentNormalizer

Annotator which normalizes raw text from tagged text, e.g. scraped web pages or xml documents, from document type columns into Sentence. Removes all dirty characters from text following one or more input regex patterns. Can apply not wanted character removal with a specific policy. Can apply lower case normalization.

For extended examples of usage, see the Spark NLP Workshop.

Input Annotator Types: DOCUMENT

Output Annotator Type: DOCUMENT

API: DocumentNormalizer Source: DocumentNormalizer
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

cleanUpPatterns = ["<[^>]>"]

documentNormalizer = DocumentNormalizer() \
    .setInputCols("document") \
    .setOutputCol("normalizedDocument") \
    .setAction("clean") \
    .setPatterns(cleanUpPatterns) \
    .setReplacement(" ") \
    .setPolicy("pretty_all") \
    .setLowercase(True)

pipeline = Pipeline().setStages([
    documentAssembler,
    documentNormalizer
])

text = """
<div id="theworldsgreatest" class='my-right my-hide-small my-wide toptext' style="font-family:'Segoe UI',Arial,sans-serif">
    THE WORLD'S LARGEST WEB DEVELOPER SITE
    <h1 style="font-size:300%;">THE WORLD'S LARGEST WEB DEVELOPER SITE</h1>
    <p style="font-size:160%;">Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum..</p>
</div>

</div>"""
data = spark.createDataFrame([[text]]).toDF("text")
pipelineModel = pipeline.fit(data)

result = pipelineModel.transform(data)
result.selectExpr("normalizedDocument.result").show(truncate=False)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[ the world's largest web developer site the world's largest web developer site lorem ipsum is simply dummy text of the printing and typesetting industry. lorem ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. it has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. it was popularised in the 1960s with the release of letraset sheets containing lorem ipsum passages, and more recently with desktop publishing software like aldus pagemaker including versions of lorem ipsum..]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.DocumentNormalizer
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val cleanUpPatterns = Array("<[^>]>")

val documentNormalizer = new DocumentNormalizer()
  .setInputCols("document")
  .setOutputCol("normalizedDocument")
  .setAction("clean")
  .setPatterns(cleanUpPatterns)
  .setReplacement(" ")
  .setPolicy("pretty_all")
  .setLowercase(true)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  documentNormalizer
))

val text =
  """
<div id="theworldsgreatest" class='my-right my-hide-small my-wide toptext' style="font-family:'Segoe UI',Arial,sans-serif">
  THE WORLD'S LARGEST WEB DEVELOPER SITE
  <h1 style="font-size:300%;">THE WORLD'S LARGEST WEB DEVELOPER SITE</h1>
  <p style="font-size:160%;">Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum..</p>
</div>

</div>"""
val data = Seq(text).toDF("text")
val pipelineModel = pipeline.fit(data)

val result = pipelineModel.transform(data)
result.selectExpr("normalizedDocument.result").show(truncate=false)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[ the world's largest web developer site the world's largest web developer site lorem ipsum is simply dummy text of the printing and typesetting industry. lorem ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. it has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. it was popularised in the 1960s with the release of letraset sheets containing lorem ipsum passages, and more recently with desktop publishing software like aldus pagemaker including versions of lorem ipsum..]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

EmbeddingsFinisher

Extracts embeddings from Annotations into a more easily usable form.

This is useful for example: WordEmbeddings, BertEmbeddings, SentenceEmbeddings and ChunkEmbeddings.

By using EmbeddingsFinisher you can easily transform your embeddings into array of floats or vectors which are compatible with Spark ML functions such as LDA, K-mean, Random Forest classifier or any other functions that require featureCol.

For more extended examples see the Spark NLP Workshop.

Input Annotator Types: EMBEDDINGS

Output Annotator Type: NONE

API: EmbeddingsFinisher Source: EmbeddingsFinisher
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols("document") \
    .setOutputCol("token")

normalizer = Normalizer() \
    .setInputCols("token") \
    .setOutputCol("normalized")

stopwordsCleaner = StopWordsCleaner() \
    .setInputCols("normalized") \
    .setOutputCol("cleanTokens") \
    .setCaseSensitive(False)

gloveEmbeddings = WordEmbeddingsModel.pretrained() \
    .setInputCols("document", "cleanTokens") \
    .setOutputCol("embeddings") \
    .setCaseSensitive(False)

embeddingsFinisher = EmbeddingsFinisher() \
    .setInputCols("embeddings") \
    .setOutputCols("finished_sentence_embeddings") \
    .setOutputAsVector(True) \
    .setCleanAnnotations(False)

data = spark.createDataFrame([["Spark NLP is an open-source text processing library."]]) \
    .toDF("text")
pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    normalizer,
    stopwordsCleaner,
    gloveEmbeddings,
    embeddingsFinisher
]).fit(data)

result = pipeline.transform(data)
resultWithSize = result.selectExpr("explode(finished_sentence_embeddings) as embeddings")

resultWithSize.show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                      embeddings|
+--------------------------------------------------------------------------------+
|[0.1619900017976761,0.045552998781204224,-0.03229299932718277,-0.685609996318...|
|[-0.42416998744010925,1.1378999948501587,-0.5717899799346924,-0.5078899860382...|
|[0.08621499687433243,-0.15772999823093414,-0.06067200005054474,0.395359992980...|
|[-0.4970499873161316,0.7164199948310852,0.40119001269340515,-0.05761000141501...|
|[-0.08170200139284134,0.7159299850463867,-0.20677000284194946,0.0295659992843...|
+--------------------------------------------------------------------------------+
import spark.implicits._
import org.apache.spark.ml.Pipeline
import com.johnsnowlabs.nlp.{DocumentAssembler, EmbeddingsFinisher}
import com.johnsnowlabs.nlp.annotator.{Normalizer, StopWordsCleaner, Tokenizer, WordEmbeddingsModel}

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val normalizer = new Normalizer()
  .setInputCols("token")
  .setOutputCol("normalized")

val stopwordsCleaner = new StopWordsCleaner()
  .setInputCols("normalized")
  .setOutputCol("cleanTokens")
  .setCaseSensitive(false)

val gloveEmbeddings = WordEmbeddingsModel.pretrained()
  .setInputCols("document", "cleanTokens")
  .setOutputCol("embeddings")
  .setCaseSensitive(false)

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("embeddings")
  .setOutputCols("finished_sentence_embeddings")
  .setOutputAsVector(true)
  .setCleanAnnotations(false)

val data = Seq("Spark NLP is an open-source text processing library.")
  .toDF("text")
val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  normalizer,
  stopwordsCleaner,
  gloveEmbeddings,
  embeddingsFinisher
)).fit(data)

val result = pipeline.transform(data)
val resultWithSize = result.selectExpr("explode(finished_sentence_embeddings)")
  .map { row =>
    val vector = row.getAs[org.apache.spark.ml.linalg.DenseVector](0)
    (vector.size, vector)
  }.toDF("size", "vector")

resultWithSize.show(5, 80)
+----+--------------------------------------------------------------------------------+
|size|                                                                          vector|
+----+--------------------------------------------------------------------------------+
| 100|[0.1619900017976761,0.045552998781204224,-0.03229299932718277,-0.685609996318...|
| 100|[-0.42416998744010925,1.1378999948501587,-0.5717899799346924,-0.5078899860382...|
| 100|[0.08621499687433243,-0.15772999823093414,-0.06067200005054474,0.395359992980...|
| 100|[-0.4970499873161316,0.7164199948310852,0.40119001269340515,-0.05761000141501...|
| 100|[-0.08170200139284134,0.7159299850463867,-0.20677000284194946,0.0295659992843...|
+----+--------------------------------------------------------------------------------+

Finisher

Converts annotation results into a format that easier to use. It is useful to extract the results from Spark NLP Pipelines. The Finisher outputs annotation(s) values into String.

For more extended examples on document pre-processing see the Spark NLP Workshop.

Input Annotator Types: ANY

Output Annotator Type: NONE

API: Finisher Source: Finisher
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline
from sparknlp.pretrained import PretrainedPipeline

data = spark.createDataFrame([[1, "New York and New Jersey aren't that far apart actually."]]).toDF("id", "text")

# Extracts Named Entities amongst other things
pipeline = PretrainedPipeline("explain_document_dl")

finisher = Finisher().setInputCols("entities").setOutputCols("output")
explainResult = pipeline.transform(data)

explainResult.selectExpr("explode(entities)").show(truncate=False)
+------------------------------------------------------------------------------------------------------------------------------------------------------+
|entities                                                                                                                                              |
+------------------------------------------------------------------------------------------------------------------------------------------------------+
|[[chunk, 0, 7, New York, [entity -> LOC, sentence -> 0, chunk -> 0], []], [chunk, 13, 22, New Jersey, [entity -> LOC, sentence -> 0, chunk -> 1], []]]|
+------------------------------------------------------------------------------------------------------------------------------------------------------+

result = finisher.transform(explainResult)
result.select("output").show(truncate=False)
+----------------------+
|output                |
+----------------------+
|[New York, New Jersey]|
+----------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
import com.johnsnowlabs.nlp.Finisher

val data = Seq((1, "New York and New Jersey aren't that far apart actually.")).toDF("id", "text")

// Extracts Named Entities amongst other things
val pipeline = PretrainedPipeline("explain_document_dl")

val finisher = new Finisher().setInputCols("entities").setOutputCols("output")
val explainResult = pipeline.transform(data)

explainResult.selectExpr("explode(entities)").show(false)
+------------------------------------------------------------------------------------------------------------------------------------------------------+
|entities                                                                                                                                              |
+------------------------------------------------------------------------------------------------------------------------------------------------------+
|[[chunk, 0, 7, New York, [entity -> LOC, sentence -> 0, chunk -> 0], []], [chunk, 13, 22, New Jersey, [entity -> LOC, sentence -> 0, chunk -> 1], []]]|
+------------------------------------------------------------------------------------------------------------------------------------------------------+

val result = finisher.transform(explainResult)
result.select("output").show(false)
+----------------------+
|output                |
+----------------------+
|[New York, New Jersey]|
+----------------------+

LanguageDetectorDL

Language Identification and Detection by using CNN and RNN architectures in TensorFlow.

LanguageDetectorDL is an annotator that detects the language of documents or sentences depending on the inputCols. The models are trained on large datasets such as Wikipedia and Tatoeba. Depending on the language (how similar the characters are), the LanguageDetectorDL works best with text longer than 140 characters. The output is a language code in Wiki Code style.

Pretrained models can be loaded with pretrained of the companion object:

Val languageDetector = LanguageDetectorDL.pretrained()
  .setInputCols("sentence")
  .setOutputCol("language")

The default model is "ld_wiki_tatoeba_cnn_21", default language is "xx" (meaning multi-lingual), if no values are provided. For available pretrained models please see the Models Hub.

For extended examples of usage, see the Spark NLP Workshop And the LanguageDetectorDLTestSpec.

Input Annotator Types: DOCUMENT

Output Annotator Type: LANGUAGE

API: LanguageDetectorDL Source: LanguageDetectorDL
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

languageDetector = LanguageDetectorDL.pretrained() \
    .setInputCols("document") \
    .setOutputCol("language")

pipeline = Pipeline() \
    .setStages([
      documentAssembler,
      languageDetector
    ])

data = spark.createDataFrame([
    ["Spark NLP is an open-source text processing library for advanced natural language processing for the Python, Java and Scala programming languages."],
    ["Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala."],
    ["Spark NLP ist eine Open-Source-Textverarbeitungsbibliothek für fortgeschrittene natürliche Sprachverarbeitung für die Programmiersprachen Python, Java und Scala."]
]).toDF("text")
result = pipeline.fit(data).transform(data)

result.select("language.result").show(truncate=False)
+------+
|result|
+------+
|[en]  |
|[fr]  |
|[de]  |
+------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.ld.dl.LanguageDetectorDL
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val languageDetector = LanguageDetectorDL.pretrained()
  .setInputCols("document")
  .setOutputCol("language")

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    languageDetector
  ))

val data = Seq(
  "Spark NLP is an open-source text processing library for advanced natural language processing for the Python, Java and Scala programming languages.",
  "Spark NLP est une bibliothèque de traitement de texte open source pour le traitement avancé du langage naturel pour les langages de programmation Python, Java et Scala.",
  "Spark NLP ist eine Open-Source-Textverarbeitungsbibliothek für fortgeschrittene natürliche Sprachverarbeitung für die Programmiersprachen Python, Java und Scala."
).toDF("text")
val result = pipeline.fit(data).transform(data)

result.select("language.result").show(false)
+------+
|result|
+------+
|[en]  |
|[fr]  |
|[de]  |
+------+

Lemmatizer

Class to find lemmas out of words with the objective of returning a base dictionary word. Retrieves the significant part of a word. A dictionary of predefined lemmas must be provided with setDictionary. The dictionary can be set as a delimited text file. Pretrained models can be loaded with LemmatizerModel.pretrained.

For available pretrained models please see the Models Hub. For extended examples of usage, see the Spark NLP Workshop and the LemmatizerTestSpec.

Input Annotator Types: TOKEN

Output Annotator Type: TOKEN

API: Lemmatizer Source: Lemmatizer
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline
# In this example, the lemma dictionary `lemmas_small.txt` has the form of
#
# ...
# pick	->	pick	picks	picking	picked
# peck	->	peck	pecking	pecked	pecks
# pickle	->	pickle	pickles	pickled	pickling
# pepper	->	pepper	peppers	peppered	peppering
# ...
#
# where each key is delimited by `->` and values are delimited by `\t`

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

lemmatizer = Lemmatizer() \
    .setInputCols(["token"]) \
    .setOutputCol("lemma") \
    .setDictionary("src/test/resources/lemma-corpus-small/lemmas_small.txt", "->", "\t")

pipeline = Pipeline() \
    .setStages([
      documentAssembler,
      sentenceDetector,
      tokenizer,
      lemmatizer
    ])

data = spark.createDataFrame([["Peter Pipers employees are picking pecks of pickled peppers."]]) \
    .toDF("text")

result = pipeline.fit(data).transform(data)
result.selectExpr("lemma.result").show(truncate=False)
+------------------------------------------------------------------+
|result                                                            |
+------------------------------------------------------------------+
|[Peter, Pipers, employees, are, pick, peck, of, pickle, pepper, .]|
+------------------------------------------------------------------+
// In this example, the lemma dictionary `lemmas_small.txt` has the form of
//
// ...
// pick	->	pick	picks	picking	picked
// peck	->	peck	pecking	pecked	pecks
// pickle	->	pickle	pickles	pickled	pickling
// pepper	->	pepper	peppers	peppered	peppering
// ...
//
// where each key is delimited by `->` and values are delimited by `\t`
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.Tokenizer
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.annotators.Lemmatizer
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val lemmatizer = new Lemmatizer()
  .setInputCols(Array("token"))
  .setOutputCol("lemma")
  .setDictionary("src/test/resources/lemma-corpus-small/lemmas_small.txt", "->", "\t")

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    sentenceDetector,
    tokenizer,
    lemmatizer
  ))

val data = Seq("Peter Pipers employees are picking pecks of pickled peppers.")
  .toDF("text")

val result = pipeline.fit(data).transform(data)
result.selectExpr("lemma.result").show(false)
+------------------------------------------------------------------+
|result                                                            |
+------------------------------------------------------------------+
|[Peter, Pipers, employees, are, pick, peck, of, pickle, pepper, .]|
+------------------------------------------------------------------+

MultiClassifierDL

Trains a MultiClassifierDL for Multi-label Text Classification.

MultiClassifierDL uses a Bidirectional GRU with a convolutional model that we have built inside TensorFlow and supports up to 100 classes.

For instantiated/pretrained models, see MultiClassifierDLModel.

The input to MultiClassifierDL are Sentence Embeddings such as the state-of-the-art UniversalSentenceEncoder, BertSentenceEmbeddings, or SentenceEmbeddings.

In machine learning, multi-label classification and the strongly related problem of multi-output classification are variants of the classification problem where multiple labels may be assigned to each instance. Multi-label classification is a generalization of multiclass classification, which is the single-label problem of categorizing instances into precisely one of more than two classes; in the multi-label problem there is no constraint on how many of the classes the instance can be assigned to. Formally, multi-label classification is the problem of finding a model that maps inputs x to binary vectors y (assigning a value of 0 or 1 for each element (label) in y).

Notes:

For extended examples of usage, see the Spark NLP Workshop and the MultiClassifierDLTestSpec.

Input Annotator Types: SENTENCE_EMBEDDINGS

Output Annotator Type: CATEGORY

Note: This annotator accepts a label column of a single item in either type of String, Int, Float, or Double. UniversalSentenceEncoder, BertSentenceEmbeddings, or SentenceEmbeddings can be used for the inputCol

API: MultiClassifierDLApproach Source: MultiClassifierDLApproach
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline
# In this example, the training data has the form
#
# +----------------+--------------------+--------------------+
# |              id|                text|              labels|
# +----------------+--------------------+--------------------+
# |ed58abb40640f983|PN NewsYou mean ... |             [toxic]|
# |a1237f726b5f5d89|Dude.  Place the ...|   [obscene, insult]|
# |24b0d6c8733c2abe|Thanks  - thanks ...|            [insult]|
# |8c4478fb239bcfc0|" Gee, 5 minutes ...|[toxic, obscene, ...|
# +----------------+--------------------+--------------------+

# Process training data to create text with associated array of labels

trainDataset.printSchema()
# root
#  |-- id: string (nullable = true)
#  |-- text: string (nullable = true)
#  |-- labels: array (nullable = true)
#  |    |-- element: string (containsNull = true)


# Then create pipeline for training
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document") \
    .setCleanupMode("shrink")

embeddings = UniversalSentenceEncoder.pretrained() \
    .setInputCols("document") \
    .setOutputCol("embeddings")

docClassifier = MultiClassifierDLApproach() \
    .setInputCols("embeddings") \
    .setOutputCol("category") \
    .setLabelColumn("labels") \
    .setBatchSize(128) \
    .setMaxEpochs(10) \
    .setLr(1e-3) \
    .setThreshold(0.5) \
    .setValidationSplit(0.1)

pipeline = Pipeline() \
    .setStages(
      [
        documentAssembler,
        embeddings,
        docClassifier
      ]
    )

pipelineModel = pipeline.fit(trainDataset)
// In this example, the training data has the form (Note: labels can be arbitrary)
//
// mr,ref
// "name[Alimentum], area[city centre], familyFriendly[no], near[Burger King]",Alimentum is an adult establish found in the city centre area near Burger King.
// "name[Alimentum], area[city centre], familyFriendly[yes]",Alimentum is a family-friendly place in the city centre.
// ...
//
// It needs some pre-processing first, so the labels are of type `Array[String]`. This can be done like so:
import spark.implicits._
import com.johnsnowlabs.nlp.annotators.classifier.dl.MultiClassifierDLApproach
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder
import org.apache.spark.ml.Pipeline
import org.apache.spark.sql.functions.{col, udf}

// Process training data to create text with associated array of labels
def splitAndTrim = udf { labels: String =>
  labels.split(", ").map(x=>x.trim)
}

val smallCorpus = spark.read
  .option("header", true)
  .option("inferSchema", true)
  .option("mode", "DROPMALFORMED")
  .csv("src/test/resources/classifier/e2e.csv")
  .withColumn("labels", splitAndTrim(col("mr")))
  .withColumn("text", col("ref"))
  .drop("mr")

smallCorpus.printSchema()
// root
// |-- ref: string (nullable = true)
// |-- labels: array (nullable = true)
// |    |-- element: string (containsNull = true)

// Then create pipeline for training
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")
  .setCleanupMode("shrink")

val embeddings = UniversalSentenceEncoder.pretrained()
  .setInputCols("document")
  .setOutputCol("embeddings")

val docClassifier = new MultiClassifierDLApproach()
  .setInputCols("embeddings")
  .setOutputCol("category")
  .setLabelColumn("labels")
  .setBatchSize(128)
  .setMaxEpochs(10)
  .setLr(1e-3f)
  .setThreshold(0.5f)
  .setValidationSplit(0.1f)

val pipeline = new Pipeline()
  .setStages(
    Array(
      documentAssembler,
      embeddings,
      docClassifier
    )
  )

val pipelineModel = pipeline.fit(smallCorpus)

MultiDateMatcher

Matches standard date formats into a provided format.

Reads the following kind of dates:

"1978-01-28", "1984/04/02,1/02/1980", "2/28/79", "The 31st of April in the year 2008",
"Fri, 21 Nov 1997", "Jan 21, ‘97", "Sun", "Nov 21", "jan 1st", "next thursday",
"last wednesday", "today", "tomorrow", "yesterday", "next week", "next month",
"next year", "day after", "the day before", "0600h", "06:00 hours", "6pm", "5:30 a.m.",
"at 5", "12:59", "23:59", "1988/11/23 6pm", "next week at 7.30", "5 am tomorrow"

For example "The 31st of April in the year 2008" will be converted into 2008/04/31.

For extended examples of usage, see the Spark NLP Workshop and the MultiDateMatcherTestSpec.

Input Annotator Types: DOCUMENT

Output Annotator Type: DATE

API: MultiDateMatcher Source: MultiDateMatcher
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

date = MultiDateMatcher() \
    .setInputCols("document") \
    .setOutputCol("date") \
    .setAnchorDateYear(2020) \
    .setAnchorDateMonth(1) \
    .setAnchorDateDay(11) \
    .setDateFormat("yyyy/MM/dd")

pipeline = Pipeline().setStages([
    documentAssembler,
    date
])

data = spark.createDataFrame([["I saw him yesterday and he told me that he will visit us next week"]]) \
    .toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(date) as dates").show(truncate=False)
+-----------------------------------------------+
|dates                                          |
+-----------------------------------------------+
|[date, 57, 65, 2020/01/18, [sentence -> 0], []]|
|[date, 10, 18, 2020/01/10, [sentence -> 0], []]|
+-----------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.MultiDateMatcher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val date = new MultiDateMatcher()
  .setInputCols("document")
  .setOutputCol("date")
  .setAnchorDateYear(2020)
  .setAnchorDateMonth(1)
  .setAnchorDateDay(11)
  .setDateFormat("yyyy/MM/dd")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  date
))

val data = Seq("I saw him yesterday and he told me that he will visit us next week")
  .toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(date) as dates").show(false)
+-----------------------------------------------+
|dates                                          |
+-----------------------------------------------+
|[date, 57, 65, 2020/01/18, [sentence -> 0], []]|
|[date, 10, 18, 2020/01/10, [sentence -> 0], []]|
+-----------------------------------------------+

NGramGenerator

A feature transformer that converts the input array of strings (annotatorType TOKEN) into an array of n-grams (annotatorType CHUNK). Null values in the input array are ignored. It returns an array of n-grams where each n-gram is represented by a space-separated string of words.

When the input is empty, an empty array is returned. When the input array length is less than n (number of elements per n-gram), no n-grams are returned.

For more extended examples see the Spark NLP Workshop and the NGramGeneratorTestSpec.

Input Annotator Types: TOKEN

Output Annotator Type: CHUNK

API: NGramGenerator Source: NGramGenerator
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

nGrams = NGramGenerator() \
    .setInputCols(["token"]) \
    .setOutputCol("ngrams") \
    .setN(2)

pipeline = Pipeline().setStages([
      documentAssembler,
      sentence,
      tokenizer,
      nGrams
    ])

data = spark.createDataFrame([["This is my sentence."]]).toDF("text")
results = pipeline.fit(data).transform(data)

results.selectExpr("explode(ngrams) as result").show(truncate=False)
+------------------------------------------------------------+
|result                                                      |
+------------------------------------------------------------+
|[chunk, 0, 6, This is, [sentence -> 0, chunk -> 0], []]     |
|[chunk, 5, 9, is my, [sentence -> 0, chunk -> 1], []]       |
|[chunk, 8, 18, my sentence, [sentence -> 0, chunk -> 2], []]|
|[chunk, 11, 19, sentence ., [sentence -> 0, chunk -> 3], []]|
+------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.NGramGenerator
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val nGrams = new NGramGenerator()
  .setInputCols("token")
  .setOutputCol("ngrams")
  .setN(2)

val pipeline = new Pipeline().setStages(Array(
    documentAssembler,
    sentence,
    tokenizer,
    nGrams
  ))

val data = Seq("This is my sentence.").toDF("text")
val results = pipeline.fit(data).transform(data)

results.selectExpr("explode(ngrams) as result").show(false)
+------------------------------------------------------------+
|result                                                      |
+------------------------------------------------------------+
|[chunk, 0, 6, This is, [sentence -> 0, chunk -> 0], []]     |
|[chunk, 5, 9, is my, [sentence -> 0, chunk -> 1], []]       |
|[chunk, 8, 18, my sentence, [sentence -> 0, chunk -> 2], []]|
|[chunk, 11, 19, sentence ., [sentence -> 0, chunk -> 3], []]|
+------------------------------------------------------------+

NerConverter

Converts a IOB or IOB2 representation of NER to a user-friendly one, by associating the tokens of recognized entities and their label. Results in CHUNK Annotation type.

NER chunks can then be filtered by setting a whitelist with setWhiteList. Chunks with no associated entity (tagged “O”) are filtered.

See also Inside–outside–beginning (tagging) for more information.

Input Annotator Types: DOCUMENT, TOKEN, NAMED_ENTITY

Output Annotator Type: CHUNK

API: NerConverter Source: NerConverter
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline
# This is a continuation of the example of the NerDLModel. See that class
# on how to extract the entities.
# The output of the NerDLModel follows the Annotator schema and can be converted like so:
#
# result.selectExpr("explode(ner)").show(truncate=False)
# +----------------------------------------------------+
# |col                                                 |
# +----------------------------------------------------+
# |[named_entity, 0, 2, B-ORG, [word -> U.N], []]      |
# |[named_entity, 3, 3, O, [word -> .], []]            |
# |[named_entity, 5, 12, O, [word -> official], []]    |
# |[named_entity, 14, 18, B-PER, [word -> Ekeus], []]  |
# |[named_entity, 20, 24, O, [word -> heads], []]      |
# |[named_entity, 26, 28, O, [word -> for], []]        |
# |[named_entity, 30, 36, B-LOC, [word -> Baghdad], []]|
# |[named_entity, 37, 37, O, [word -> .], []]          |
# +----------------------------------------------------+
#
# After the converter is used:
converter = NerConverter() \
    .setInputCols(["sentence", "token", "ner"]) \
    .setOutputCol("entities")

converter.transform(result).selectExpr("explode(entities)").show(truncate=False)
+------------------------------------------------------------------------+
|col                                                                     |
+------------------------------------------------------------------------+
|[chunk, 0, 2, U.N, [entity -> ORG, sentence -> 0, chunk -> 0], []]      |
|[chunk, 14, 18, Ekeus, [entity -> PER, sentence -> 0, chunk -> 1], []]  |
|[chunk, 30, 36, Baghdad, [entity -> LOC, sentence -> 0, chunk -> 2], []]|
+------------------------------------------------------------------------+
// This is a continuation of the example of the [[com.johnsnowlabs.nlp.annotators.ner.dl.NerDLModel NerDLModel]]. See that class
// on how to extract the entities.
// The output of the NerDLModel follows the Annotator schema and can be converted like so:
//
// result.selectExpr("explode(ner)").show(false)
// +----------------------------------------------------+
// |col                                                 |
// +----------------------------------------------------+
// |[named_entity, 0, 2, B-ORG, [word -> U.N], []]      |
// |[named_entity, 3, 3, O, [word -> .], []]            |
// |[named_entity, 5, 12, O, [word -> official], []]    |
// |[named_entity, 14, 18, B-PER, [word -> Ekeus], []]  |
// |[named_entity, 20, 24, O, [word -> heads], []]      |
// |[named_entity, 26, 28, O, [word -> for], []]        |
// |[named_entity, 30, 36, B-LOC, [word -> Baghdad], []]|
// |[named_entity, 37, 37, O, [word -> .], []]          |
// +----------------------------------------------------+
//
// After the converter is used:
val converter = new NerConverter()
  .setInputCols("sentence", "token", "ner")
  .setOutputCol("entities")
  .setPreservePosition(false)

converter.transform(result).selectExpr("explode(entities)").show(false)
+------------------------------------------------------------------------+
|col                                                                     |
+------------------------------------------------------------------------+
|[chunk, 0, 2, U.N, [entity -> ORG, sentence -> 0, chunk -> 0], []]      |
|[chunk, 14, 18, Ekeus, [entity -> PER, sentence -> 0, chunk -> 1], []]  |
|[chunk, 30, 36, Baghdad, [entity -> LOC, sentence -> 0, chunk -> 2], []]|
+------------------------------------------------------------------------+

NerCrf

Algorithm for training a Named Entity Recognition Model

For instantiated/pretrained models, see NerCrfModel.

This Named Entity recognition annotator allows for a generic model to be trained by utilizing a CRF machine learning algorithm. The training data should be a labeled Spark Dataset, e.g. CoNLL 2003 IOB with Annotation type columns. The data should have columns of type DOCUMENT, TOKEN, POS, WORD_EMBEDDINGS and an additional label column of annotator type NAMED_ENTITY. Excluding the label, this can be done with for example

Optionally the user can provide an entity dictionary file with setExternalFeatures for better accuracy.

For extended examples of usage, see the Spark NLP Workshop and the NerCrfApproachTestSpec.

Input Annotator Types: DOCUMENT, TOKEN, POS, WORD_EMBEDDINGS

Output Annotator Type: NAMED_ENTITY

API: NerCrfApproach Source: NerCrfApproach
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline
# This CoNLL dataset already includes the sentence, token, pos and label column with their respective annotator types.
# If a custom dataset is used, these need to be defined.

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

embeddings = WordEmbeddingsModel.pretrained() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings") \
    .setCaseSensitive(False)

nerTagger = NerCrfApproach() \
    .setInputCols(["sentence", "token", "pos", "embeddings"]) \
    .setLabelColumn("label") \
    .setMinEpochs(1) \
    .setMaxEpochs(3) \
    .setC0(34) \
    .setL2(3.0) \
    .setOutputCol("ner")

pipeline = Pipeline().setStages([
    documentAssembler,
    embeddings,
    nerTagger
])


conll = CoNLL()
trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train")

pipelineModel = pipeline.fit(trainingData)
// This CoNLL dataset already includes the sentence, token, pos and label column with their respective annotator types.
// If a custom dataset is used, these need to be defined.
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel
import com.johnsnowlabs.nlp.annotator.NerCrfApproach
import com.johnsnowlabs.nlp.training.CoNLL
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val embeddings = WordEmbeddingsModel.pretrained()
  .setInputCols("sentence", "token")
  .setOutputCol("embeddings")
  .setCaseSensitive(false)

val nerTagger = new NerCrfApproach()
  .setInputCols("sentence", "token", "pos", "embeddings")
  .setLabelColumn("label")
  .setMinEpochs(1)
  .setMaxEpochs(3)
  .setC0(34)
  .setL2(3.0)
  .setOutputCol("ner")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  embeddings,
  nerTagger
))


val conll = CoNLL()
val trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train")

val pipelineModel = pipeline.fit(trainingData)

NerDL

This Named Entity recognition annotator allows to train generic NER model based on Neural Networks.

The architecture of the neural network is a Char CNNs - BiLSTM - CRF that achieves state-of-the-art in most datasets.

For instantiated/pretrained models, see NerDLModel.

The training data should be a labeled Spark Dataset, in the format of CoNLL 2003 IOB with Annotation type columns. The data should have columns of type DOCUMENT, TOKEN, WORD_EMBEDDINGS and an additional label column of annotator type NAMED_ENTITY. Excluding the label, this can be done with for example

For extended examples of usage, see the Spark NLP Workshop and the NerDLSpec.

Input Annotator Types: DOCUMENT, TOKEN, WORD_EMBEDDINGS

Output Annotator Type: NAMED_ENTITY

API: NerDLApproach Source: NerDLApproach
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

# First extract the prerequisites for the NerDLApproach
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

embeddings = BertEmbeddings.pretrained() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

# Then the training can start
nerTagger = NerDLApproach() \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setLabelColumn("label") \
    .setOutputCol("ner") \
    .setMaxEpochs(1) \
    .setRandomSeed(0) \
    .setVerbose(0)

pipeline = Pipeline().setStages([
    documentAssembler,
    sentence,
    tokenizer,
    embeddings,
    nerTagger
])

# We use the text and labels from the CoNLL dataset
conll = CoNLL()
trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train")

pipelineModel = pipeline.fit(trainingData)
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import com.johnsnowlabs.nlp.embeddings.BertEmbeddings
import com.johnsnowlabs.nlp.annotators.ner.dl.NerDLApproach
import com.johnsnowlabs.nlp.training.CoNLL
import org.apache.spark.ml.Pipeline

// First extract the prerequisites for the NerDLApproach
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val embeddings = BertEmbeddings.pretrained()
  .setInputCols("sentence", "token")
  .setOutputCol("embeddings")

// Then the training can start
val nerTagger = new NerDLApproach()
.setInputCols("sentence", "token", "embeddings")
.setLabelColumn("label")
.setOutputCol("ner")
.setMaxEpochs(1)
.setRandomSeed(0)
.setVerbose(0)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentence,
  tokenizer,
  embeddings,
  nerTagger
))

// We use the text and labels from the CoNLL dataset
val conll = CoNLL()
val trainingData = conll.readDataset(spark, "src/test/resources/conll2003/eng.train")

val pipelineModel = pipeline.fit(trainingData)

NerOverwriter

Overwrites entities of specified strings.

The input for this Annotator have to be entities that are already extracted, Annotator type NAMED_ENTITY. The strings specified with setStopWords will have new entities assigned to, specified with setNewResult.

Input Annotator Types: NAMED_ENTITY

Output Annotator Type: NAMED_ENTITY

API: NerOverwriter Source: NerOverwriter
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

# First extract the prerequisite Entities
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

embeddings = WordEmbeddingsModel.pretrained() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("bert")

nerTagger = NerDLModel.pretrained() \
    .setInputCols(["sentence", "token", "bert"]) \
    .setOutputCol("ner")

pipeline = Pipeline().setStages([
    documentAssembler,
    sentence,
    tokenizer,
    embeddings,
    nerTagger
])

data = spark.createDataFrame([["Spark NLP Crosses Five Million Downloads, John Snow Labs Announces."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(ner)").show(truncate=False)
# +------------------------------------------------------+
# |col                                                   |
# +------------------------------------------------------+
# |[named_entity, 0, 4, B-ORG, [word -> Spark], []]      |
# |[named_entity, 6, 8, I-ORG, [word -> NLP], []]        |
# |[named_entity, 10, 16, O, [word -> Crosses], []]      |
# |[named_entity, 18, 21, O, [word -> Five], []]         |
# |[named_entity, 23, 29, O, [word -> Million], []]      |
# |[named_entity, 31, 39, O, [word -> Downloads], []]    |
# |[named_entity, 40, 40, O, [word -> ,], []]            |
# |[named_entity, 42, 45, B-ORG, [word -> John], []]     |
# |[named_entity, 47, 50, I-ORG, [word -> Snow], []]     |
# |[named_entity, 52, 55, I-ORG, [word -> Labs], []]     |
# |[named_entity, 57, 65, I-ORG, [word -> Announces], []]|
# |[named_entity, 66, 66, O, [word -> .], []]            |
# +------------------------------------------------------+

# The recognized entities can then be overwritten
nerOverwriter = NerOverwriter() \
    .setInputCols(["ner"]) \
    .setOutputCol("ner_overwritten") \
    .setStopWords(["Million"]) \
    .setNewResult("B-CARDINAL")

nerOverwriter.transform(result).selectExpr("explode(ner_overwritten)").show(truncate=False)
+---------------------------------------------------------+
|col                                                      |
+---------------------------------------------------------+
|[named_entity, 0, 4, B-ORG, [word -> Spark], []]         |
|[named_entity, 6, 8, I-ORG, [word -> NLP], []]           |
|[named_entity, 10, 16, O, [word -> Crosses], []]         |
|[named_entity, 18, 21, O, [word -> Five], []]            |
|[named_entity, 23, 29, B-CARDINAL, [word -> Million], []]|
|[named_entity, 31, 39, O, [word -> Downloads], []]       |
|[named_entity, 40, 40, O, [word -> ,], []]               |
|[named_entity, 42, 45, B-ORG, [word -> John], []]        |
|[named_entity, 47, 50, I-ORG, [word -> Snow], []]        |
|[named_entity, 52, 55, I-ORG, [word -> Labs], []]        |
|[named_entity, 57, 65, I-ORG, [word -> Announces], []]   |
|[named_entity, 66, 66, O, [word -> .], []]               |
+---------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel
import com.johnsnowlabs.nlp.annotators.ner.dl.NerDLModel
import com.johnsnowlabs.nlp.annotators.ner.NerOverwriter
import org.apache.spark.ml.Pipeline

// First extract the prerequisite Entities
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val embeddings = WordEmbeddingsModel.pretrained()
  .setInputCols("sentence", "token")
  .setOutputCol("bert")

val nerTagger = NerDLModel.pretrained()
  .setInputCols("sentence", "token", "bert")
  .setOutputCol("ner")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentence,
  tokenizer,
  embeddings,
  nerTagger
))

val data = Seq("Spark NLP Crosses Five Million Downloads, John Snow Labs Announces.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(ner)").show(false)
/
+------------------------------------------------------+
|col                                                   |
+------------------------------------------------------+
|[named_entity, 0, 4, B-ORG, [word -> Spark], []]      |
|[named_entity, 6, 8, I-ORG, [word -> NLP], []]        |
|[named_entity, 10, 16, O, [word -> Crosses], []]      |
|[named_entity, 18, 21, O, [word -> Five], []]         |
|[named_entity, 23, 29, O, [word -> Million], []]      |
|[named_entity, 31, 39, O, [word -> Downloads], []]    |
|[named_entity, 40, 40, O, [word -> ,], []]            |
|[named_entity, 42, 45, B-ORG, [word -> John], []]     |
|[named_entity, 47, 50, I-ORG, [word -> Snow], []]     |
|[named_entity, 52, 55, I-ORG, [word -> Labs], []]     |
|[named_entity, 57, 65, I-ORG, [word -> Announces], []]|
|[named_entity, 66, 66, O, [word -> .], []]            |
+------------------------------------------------------+
/
// The recognized entities can then be overwritten
val nerOverwriter = new NerOverwriter()
  .setInputCols("ner")
  .setOutputCol("ner_overwritten")
  .setStopWords(Array("Million"))
  .setNewResult("B-CARDINAL")

nerOverwriter.transform(result).selectExpr("explode(ner_overwritten)").show(false)
+---------------------------------------------------------+
|col                                                      |
+---------------------------------------------------------+
|[named_entity, 0, 4, B-ORG, [word -> Spark], []]         |
|[named_entity, 6, 8, I-ORG, [word -> NLP], []]           |
|[named_entity, 10, 16, O, [word -> Crosses], []]         |
|[named_entity, 18, 21, O, [word -> Five], []]            |
|[named_entity, 23, 29, B-CARDINAL, [word -> Million], []]|
|[named_entity, 31, 39, O, [word -> Downloads], []]       |
|[named_entity, 40, 40, O, [word -> ,], []]               |
|[named_entity, 42, 45, B-ORG, [word -> John], []]        |
|[named_entity, 47, 50, I-ORG, [word -> Snow], []]        |
|[named_entity, 52, 55, I-ORG, [word -> Labs], []]        |
|[named_entity, 57, 65, I-ORG, [word -> Announces], []]   |
|[named_entity, 66, 66, O, [word -> .], []]               |
+---------------------------------------------------------+

Normalizer

Annotator that cleans out tokens. Requires stems, hence tokens. Removes all dirty characters from text following a regex pattern and transforms words based on a provided dictionary

For extended examples of usage, see the Spark NLP Workshop.

Input Annotator Types: TOKEN

Output Annotator Type: TOKEN

API: Normalizer Source: Normalizer
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normalized") \
    .setLowercase(True) \
    .setCleanupPatterns(["""[^\w\d\s]"""]) # remove punctuations (keep alphanumeric chars)
# if we don't set CleanupPatterns, it will only keep alphabet letters ([^A-Za-z])

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    normalizer
])

data = spark.createDataFrame([["John and Peter are brothers. However they don't support each other that much."]]) \
    .toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("normalized.result").show(truncate = False)
+----------------------------------------------------------------------------------------+
|result                                                                                  |
+----------------------------------------------------------------------------------------+
|[john, and, peter, are, brothers, however, they, dont, support, each, other, that, much]|
+----------------------------------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.{Normalizer, Tokenizer}
import org.apache.spark.ml.Pipeline
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val normalizer = new Normalizer()
  .setInputCols("token")
  .setOutputCol("normalized")
  .setLowercase(true)
  .setCleanupPatterns(Array("""[^\w\d\s]""")) // remove punctuations (keep alphanumeric chars)
// if we don't set CleanupPatterns, it will only keep alphabet letters ([^A-Za-z])

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  normalizer
))

val data = Seq("John and Peter are brothers. However they don't support each other that much.")
  .toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("normalized.result").show(truncate = false)
+----------------------------------------------------------------------------------------+
|result                                                                                  |
+----------------------------------------------------------------------------------------+
|[john, and, peter, are, brothers, however, they, dont, support, each, other, that, much]|
+----------------------------------------------------------------------------------------+

NorvigSweeting Spellchecker

Trains annotator, that retrieves tokens and makes corrections automatically if not found in an English dictionary.

The Symmetric Delete spelling correction algorithm reduces the complexity of edit candidate generation and dictionary lookup for a given Damerau-Levenshtein distance. It is six orders of magnitude faster (than the standard approach with deletes + transposes + replaces + inserts) and language independent. A dictionary of correct spellings must be provided with setDictionary as a text file, where each word is parsed by a regex pattern.

Inspired by Norvig model and SymSpell.

For instantiated/pretrained models, see NorvigSweetingModel.

For extended examples of usage, see the Spark NLP Workshop and the NorvigSweetingTestSpec.

Input Annotator Types: TOKEN

Output Annotator Type: TOKEN

API: NorvigSweetingApproach Source: NorvigSweetingApproach
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline
# In this example, the dictionary `"words.txt"` has the form of
#
# ...
# gummy
# gummic
# gummier
# gummiest
# gummiferous
# ...
#
# This dictionary is then set to be the basis of the spell checker.

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

spellChecker = NorvigSweetingApproach() \
    .setInputCols(["token"]) \
    .setOutputCol("spell") \
    .setDictionary("src/test/resources/spell/words.txt")

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    spellChecker
])

pipelineModel = pipeline.fit(trainingData)
// In this example, the dictionary `"words.txt"` has the form of
//
// ...
// gummy
// gummic
// gummier
// gummiest
// gummiferous
// ...
//
// This dictionary is then set to be the basis of the spell checker.
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.spell.norvig.NorvigSweetingApproach
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val spellChecker = new NorvigSweetingApproach()
  .setInputCols("token")
  .setOutputCol("spell")
  .setDictionary("src/test/resources/spell/words.txt")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  spellChecker
))

val pipelineModel = pipeline.fit(trainingData)

POSTagger (Part of speech tagger)

Trains an averaged Perceptron model to tag words part-of-speech. Sets a POS tag to each word within a sentence.

For pretrained models please see the PerceptronModel.

The training data needs to be in a Spark DataFrame, where the column needs to consist of Annotations of type POS. The Annotation needs to have member result set to the POS tag and have a "word" mapping to its word inside of member metadata. This DataFrame for training can easily created by the helper class POS.

POS().readDataset(spark, datasetPath).selectExpr("explode(tags) as tags").show(false)
+---------------------------------------------+
|tags                                         |
+---------------------------------------------+
|[pos, 0, 5, NNP, [word -> Pierre], []]       |
|[pos, 7, 12, NNP, [word -> Vinken], []]      |
|[pos, 14, 14, ,, [word -> ,], []]            |
|[pos, 31, 34, MD, [word -> will], []]        |
|[pos, 36, 39, VB, [word -> join], []]        |
|[pos, 41, 43, DT, [word -> the], []]         |
|[pos, 45, 49, NN, [word -> board], []]       |
                      ...

For extended examples of usage, see the Spark NLP Workshop and PerceptronApproach tests.

Input Annotator Types: TOKEN, DOCUMENT

Output Annotator Type: POS

API: PerceptronApproach Source: PerceptronApproach
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

datasetPath = "src/test/resources/anc-pos-corpus-small/test-training.txt"
trainingPerceptronDF = POS().readDataset(spark, datasetPath)

trainedPos = PerceptronApproach() \
    .setInputCols(["document", "token"]) \
    .setOutputCol("pos") \
    .setPosColumn("tags") \
    .fit(trainingPerceptronDF)

pipeline = Pipeline().setStages([
    documentAssembler,
    sentence,
    tokenizer,
    trainedPos
])

data = spark.createDataFrame([["To be or not to be, is this the question?"]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("pos.result").show(truncate=False)
+--------------------------------------------------+
|result                                            |
+--------------------------------------------------+
|[NNP, NNP, CD, JJ, NNP, NNP, ,, MD, VB, DT, CD, .]|
+--------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.training.POS
import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronApproach
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val datasetPath = "src/test/resources/anc-pos-corpus-small/test-training.txt"
val trainingPerceptronDF = POS().readDataset(spark, datasetPath)

val trainedPos = new PerceptronApproach()
  .setInputCols("document", "token")
  .setOutputCol("pos")
  .setPosColumn("tags")
  .fit(trainingPerceptronDF)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentence,
  tokenizer,
  trainedPos
))

val data = Seq("To be or not to be, is this the question?").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("pos.result").show(false)
+--------------------------------------------------+
|result                                            |
+--------------------------------------------------+
|[NNP, NNP, CD, JJ, NNP, NNP, ,, MD, VB, DT, CD, .]|
+--------------------------------------------------+

RecursiveTokenizer

Tokenizes raw text recursively based on a handful of definable rules.

Unlike the Tokenizer, the RecursiveTokenizer operates based on these array string parameters only:

  • prefixes: Strings that will be split when found at the beginning of token.
  • suffixes: Strings that will be split when found at the end of token.
  • infixes: Strings that will be split when found at the middle of token.
  • whitelist: Whitelist of strings not to split

For extended examples of usage, see the Spark NLP Workshop and the TokenizerTestSpec.

Input Annotator Types: DOCUMENT

Output Annotator Type: TOKEN

API: RecursiveTokenizer Source: RecursiveTokenizer
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = RecursiveTokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer
])

data = spark.createDataFrame([["One, after the Other, (and) again. PO, QAM,"]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.select("token.result").show(truncate=False)
+------------------------------------------------------------------+
|result                                                            |
+------------------------------------------------------------------+
|[One, ,, after, the, Other, ,, (, and, ), again, ., PO, ,, QAM, ,]|
+------------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.RecursiveTokenizer
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new RecursiveTokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer
))

val data = Seq("One, after the Other, (and) again. PO, QAM,").toDF("text")
val result = pipeline.fit(data).transform(data)

result.select("token.result").show(false)
+------------------------------------------------------------------+
|result                                                            |
+------------------------------------------------------------------+
|[One, ,, after, the, Other, ,, (, and, ), again, ., PO, ,, QAM, ,]|
+------------------------------------------------------------------+

RegexMatcher

Uses a reference file to match a set of regular expressions and associate them with a provided identifier.

A dictionary of predefined regular expressions must be provided with setExternalRules. The dictionary can be set as a delimited text file.

Pretrained pipelines are available for this module, see Pipelines.

For extended examples of usage, see the Spark NLP Workshop and the RegexMatcherTestSpec.

Input Annotator Types: DOCUMENT

Output Annotator Type: CHUNK

API: RegexMatcher Source: RegexMatcher
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline
# In this example, the `rules.txt` has the form of
#
# the\s\w+, followed by 'the'
# ceremonies, ceremony
#
# where each regex is separated by the identifier by `","`

documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")

sentence = SentenceDetector().setInputCols(["document"]).setOutputCol("sentence")

regexMatcher = RegexMatcher() \
    .setExternalRules("src/test/resources/regex-matcher/rules.txt",  ",") \
    .setInputCols(["sentence"]) \
    .setOutputCol("regex") \
    .setStrategy("MATCH_ALL")

pipeline = Pipeline().setStages([documentAssembler, sentence, regexMatcher])

data = spark.createDataFrame([[
    "My first sentence with the first rule. This is my second sentence with ceremonies rule."
]]).toDF("text")
results = pipeline.fit(data).transform(data)

results.selectExpr("explode(regex) as result").show(truncate=False)
+--------------------------------------------------------------------------------------------+
|result                                                                                      |
+--------------------------------------------------------------------------------------------+
|[chunk, 23, 31, the first, [identifier -> followed by 'the', sentence -> 0, chunk -> 0], []]|
|[chunk, 71, 80, ceremonies, [identifier -> ceremony, sentence -> 1, chunk -> 0], []]        |
+--------------------------------------------------------------------------------------------+
// In this example, the `rules.txt` has the form of
//
// the\s\w+, followed by 'the'
// ceremonies, ceremony
//
// where each regex is separated by the identifier by `","`
import ResourceHelper.spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.annotators.RegexMatcher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")

val sentence = new SentenceDetector().setInputCols("document").setOutputCol("sentence")

val regexMatcher = new RegexMatcher()
  .setExternalRules("src/test/resources/regex-matcher/rules.txt",  ",")
  .setInputCols(Array("sentence"))
  .setOutputCol("regex")
  .setStrategy("MATCH_ALL")

val pipeline = new Pipeline().setStages(Array(documentAssembler, sentence, regexMatcher))

val data = Seq(
  "My first sentence with the first rule. This is my second sentence with ceremonies rule."
).toDF("text")
val results = pipeline.fit(data).transform(data)

results.selectExpr("explode(regex) as result").show(false)
+--------------------------------------------------------------------------------------------+
|result                                                                                      |
+--------------------------------------------------------------------------------------------+
|[chunk, 23, 31, the first, [identifier -> followed by 'the', sentence -> 0, chunk -> 0], []]|
|[chunk, 71, 80, ceremonies, [identifier -> ceremony, sentence -> 1, chunk -> 0], []]        |
+--------------------------------------------------------------------------------------------+

RegexTokenizer

A tokenizer that splits text by a regex pattern.

The pattern needs to be set with setPattern and this sets the delimiting pattern or how the tokens should be split. By default this pattern is \s+ which means that tokens should be split by 1 or more whitespace characters.

Input Annotator Types: DOCUMENT

Output Annotator Type: TOKEN

API: RegexTokenizer Source: RegexTokenizer
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

regexTokenizer = RegexTokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("regexToken") \
    .setToLowercase(True) \
    .setPattern("\\s+")

pipeline = Pipeline().setStages([
      documentAssembler,
      regexTokenizer
    ])

data = spark.createDataFrame([["This is my first sentence.\nThis is my second."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("regexToken.result").show(truncate=False)
+-------------------------------------------------------+
|result                                                 |
+-------------------------------------------------------+
|[this, is, my, first, sentence., this, is, my, second.]|
+-------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.RegexTokenizer
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val regexTokenizer = new RegexTokenizer()
  .setInputCols("document")
  .setOutputCol("regexToken")
  .setToLowercase(true)
  .setPattern("\\s+")

val pipeline = new Pipeline().setStages(Array(
    documentAssembler,
    regexTokenizer
  ))

val data = Seq("This is my first sentence.\nThis is my second.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("regexToken.result").show(false)
+-------------------------------------------------------+
|result                                                 |
+-------------------------------------------------------+
|[this, is, my, first, sentence., this, is, my, second.]|
+-------------------------------------------------------+

SentenceDetector

Annotator that detects sentence boundaries using any provided approach.

Each extracted sentence can be returned in an Array or exploded to separate rows, if explodeSentences is set to true.

For extended examples of usage, see the Spark NLP Workshop.

Input Annotator Types: DOCUMENT

Output Annotator Type: DOCUMENT

API: SentenceDetector Source: SentenceDetector
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

pipeline = Pipeline().setStages([
    documentAssembler,
    sentence
])

data = spark.createDataFrame([["This is my first sentence. This my second. How about a third?"]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(sentence) as sentences").show(truncate=False)
+------------------------------------------------------------------+
|sentences                                                         |
+------------------------------------------------------------------+
|[document, 0, 25, This is my first sentence., [sentence -> 0], []]|
|[document, 27, 41, This my second., [sentence -> 1], []]          |
|[document, 43, 60, How about a third?, [sentence -> 2], []]       |
+------------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentence
))

val data = Seq("This is my first sentence. This my second. How about a third?").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(sentence) as sentences").show(false)
+------------------------------------------------------------------+
|sentences                                                         |
+------------------------------------------------------------------+
|[document, 0, 25, This is my first sentence., [sentence -> 0], []]|
|[document, 27, 41, This my second., [sentence -> 1], []]          |
|[document, 43, 60, How about a third?, [sentence -> 2], []]       |
+------------------------------------------------------------------+

SentenceDetectorDL

Trains an annotator that detects sentence boundaries using a deep learning approach.

For pretrained models see SentenceDetectorDLModel.

Currently, only the CNN model is supported for training, but in the future the architecture of the model can be set with setModelArchitecture.

The default model "cnn" is based on the paper Deep-EOS: General-Purpose Neural Networks for Sentence Boundary Detection (2020, Stefan Schweter, Sajawel Ahmed) using a CNN architecture. We also modified the original implementation a little bit to cover broken sentences and some impossible end of line chars.

Each extracted sentence can be returned in an Array or exploded to separate rows, if explodeSentences is set to true.

For extended examples of usage, see the SentenceDetectorDLSpec.

Input Annotator Types: DOCUMENT

Output Annotator Type: DOCUMENT

API: SentenceDetectorDLApproach Source: SentenceDetectorDLApproach
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline
# The training process needs data, where each data point is a sentence.
# In this example the `train.txt` file has the form of
#
# ...
# Slightly more moderate language would make our present situation – namely the lack of progress – a little easier.
# His political successors now have great responsibilities to history and to the heritage of values bequeathed to them by Nelson Mandela.
# ...
#
# where each line is one sentence.
# Training can then be started like so:

trainingData = spark.read.text("train.txt").toDF("text")

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = SentenceDetectorDLApproach() \
    .setInputCols(["document"]) \
    .setOutputCol("sentences") \
    .setEpochsNumber(100)

pipeline = Pipeline().setStages([documentAssembler, sentenceDetector])

model = pipeline.fit(trainingData)
// The training process needs data, where each data point is a sentence.
// In this example the `train.txt` file has the form of
//
// ...
// Slightly more moderate language would make our present situation – namely the lack of progress – a little easier.
// His political successors now have great responsibilities to history and to the heritage of values bequeathed to them by Nelson Mandela.
// ...
//
// where each line is one sentence.
// Training can then be started like so:
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.sentence_detector_dl.SentenceDetectorDLApproach
import org.apache.spark.ml.Pipeline

val trainingData = spark.read.text("train.txt").toDF("text")

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetectorDLApproach()
  .setInputCols(Array("document"))
  .setOutputCol("sentences")
  .setEpochsNumber(100)

val pipeline = new Pipeline().setStages(Array(documentAssembler, sentenceDetector))

val model = pipeline.fit(trainingData)

SentenceEmbeddings

Converts the results from WordEmbeddings, BertEmbeddings, or ElmoEmbeddings into sentence or document embeddings by either summing up or averaging all the word embeddings in a sentence or a document (depending on the inputCols).

This can be configured with setPoolingStrategy, which either be "AVERAGE" or "SUM".

For more extended examples see the Spark NLP Workshop. and the SentenceEmbeddingsTestSpec.

NOTE: If you choose document as your input for Tokenizer, WordEmbeddings/BertEmbeddings, and SentenceEmbeddings then it averages/sums all the embeddings into one array of embeddings. However, if you choose sentence as inputCols then for each sentence SentenceEmbeddings generates one array of embeddings. TIP: Here is how you can explode and convert these embeddings into Vectors or what’s known as Feature column so it can be used in Spark ML regression or clustering functions

from org.apache.spark.ml.linal import Vector, Vectors
from pyspark.sql.functions import udf
# Let's create a UDF to take array of embeddings and output Vectors
@udf(Vector)
def convertToVectorUDF(matrix):
    return Vectors.dense(matrix.toArray.map(_.toDouble))


# Now let's explode the sentence_embeddings column and have a new feature column for Spark ML
pipelineDF.select(explode("sentence_embeddings.embeddings").as("sentence_embedding"))
.withColumn("features", convertToVectorUDF("sentence_embedding"))
import org.apache.spark.ml.linalg.{Vector, Vectors}

// Let's create a UDF to take array of embeddings and output Vectors
val convertToVectorUDF = udf((matrix : Seq[Float]) => {
    Vectors.dense(matrix.toArray.map(_.toDouble))
})

// Now let's explode the sentence_embeddings column and have a new feature column for Spark ML
pipelineDF.select(explode($"sentence_embeddings.embeddings").as("sentence_embedding"))
.withColumn("features", convertToVectorUDF($"sentence_embedding"))

Input Annotator Types: DOCUMENT, WORD_EMBEDDINGS

Output Annotator Type: SENTENCE_EMBEDDINGS

API: SentenceEmbeddings Source: SentenceEmbeddings
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

embeddings = WordEmbeddingsModel.pretrained() \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings")

embeddingsSentence = SentenceEmbeddings() \
    .setInputCols(["document", "embeddings"]) \
    .setOutputCol("sentence_embeddings") \
    .setPoolingStrategy("AVERAGE")

embeddingsFinisher = EmbeddingsFinisher() \
    .setInputCols(["sentence_embeddings"]) \
    .setOutputCols("finished_embeddings") \
    .setOutputAsVector(True) \
    .setCleanAnnotations(False)

pipeline = Pipeline() \
    .setStages([
      documentAssembler,
      tokenizer,
      embeddings,
      embeddingsSentence,
      embeddingsFinisher
    ])

data = spark.createDataFrame([["This is a sentence."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[-0.22093398869037628,0.25130119919776917,0.41810303926467896,-0.380883991718...|
+--------------------------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.WordEmbeddingsModel
import com.johnsnowlabs.nlp.embeddings.SentenceEmbeddings
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols(Array("document"))
  .setOutputCol("token")

val embeddings = WordEmbeddingsModel.pretrained()
  .setInputCols("document", "token")
  .setOutputCol("embeddings")

val embeddingsSentence = new SentenceEmbeddings()
  .setInputCols(Array("document", "embeddings"))
  .setOutputCol("sentence_embeddings")
  .setPoolingStrategy("AVERAGE")

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("sentence_embeddings")
  .setOutputCols("finished_embeddings")
  .setOutputAsVector(true)
  .setCleanAnnotations(false)

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    tokenizer,
    embeddings,
    embeddingsSentence,
    embeddingsFinisher
  ))

val data = Seq("This is a sentence.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[-0.22093398869037628,0.25130119919776917,0.41810303926467896,-0.380883991718...|
+--------------------------------------------------------------------------------+

SentimentDL

Trains a SentimentDL, an annotator for multi-class sentiment analysis.

In natural language processing, sentiment analysis is the task of classifying the affective state or subjective view of a text. A common example is if either a product review or tweet can be interpreted positively or negatively.

For the instantiated/pretrained models, see SentimentDLModel.

Notes:

For extended examples of usage, see the Spark NLP Workshop and the SentimentDLTestSpec.

Input Annotator Types: SENTENCE_EMBEDDINGS

Output Annotator Type: CATEGORY

API: SentimentDLApproach Source: SentimentDLApproach
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline
# In this example, `sentiment.csv` is in the form
#
# text,label
# This movie is the best movie I have watched ever! In my opinion this movie can win an award.,0
# This was a terrible movie! The acting was bad really bad!,1
#
# The model can then be trained with

smallCorpus = spark.read.option("header", "True").csv("src/test/resources/classifier/sentiment.csv")

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

useEmbeddings = UniversalSentenceEncoder.pretrained() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence_embeddings")

docClassifier = SentimentDLApproach() \
    .setInputCols(["sentence_embeddings"]) \
    .setOutputCol("sentiment") \
    .setLabelColumn("label") \
    .setBatchSize(32) \
    .setMaxEpochs(1) \
    .setLr(5e-3) \
    .setDropout(0.5)

pipeline = Pipeline() \
    .setStages(
      [
        documentAssembler,
        useEmbeddings,
        docClassifier
      ]
    )

pipelineModel = pipeline.fit(smallCorpus)
// In this example, `sentiment.csv` is in the form
//
// text,label
// This movie is the best movie I have watched ever! In my opinion this movie can win an award.,0
// This was a terrible movie! The acting was bad really bad!,1
//
// The model can then be trained with
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.UniversalSentenceEncoder
import com.johnsnowlabs.nlp.annotators.classifier.dl.{SentimentDLApproach, SentimentDLModel}
import org.apache.spark.ml.Pipeline

val smallCorpus = spark.read.option("header", "true").csv("src/test/resources/classifier/sentiment.csv")

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val useEmbeddings = UniversalSentenceEncoder.pretrained()
  .setInputCols("document")
  .setOutputCol("sentence_embeddings")

val docClassifier = new SentimentDLApproach()
  .setInputCols("sentence_embeddings")
  .setOutputCol("sentiment")
  .setLabelColumn("label")
  .setBatchSize(32)
  .setMaxEpochs(1)
  .setLr(5e-3f)
  .setDropout(0.5f)

val pipeline = new Pipeline()
  .setStages(
    Array(
      documentAssembler,
      useEmbeddings,
      docClassifier
    )
  )

val pipelineModel = pipeline.fit(smallCorpus)

SentimentDetector

Trains a rule based sentiment detector, which calculates a score based on predefined keywords.

A dictionary of predefined sentiment keywords must be provided with setDictionary, where each line is a word delimited to its class (either positive or negative). The dictionary can be set as a delimited text file.

By default, the sentiment score will be assigned labels "positive" if the score is >= 0, else "negative". To retrieve the raw sentiment scores, enableScore needs to be set to true.

For extended examples of usage, see the Spark NLP Workshop and the SentimentTestSpec.

Input Annotator Types: TOKEN, DOCUMENT

Output Annotator Type: SENTIMENT

API: SentimentDetector Source: SentimentDetector
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline
# In this example, the dictionary `default-sentiment-dict.txt` has the form of
#
# ...
# cool,positive
# superb,positive
# bad,negative
# uninspired,negative
# ...
#
# where each sentiment keyword is delimited by `","`.

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

lemmatizer = Lemmatizer() \
    .setInputCols(["token"]) \
    .setOutputCol("lemma") \
    .setDictionary("lemmas_small.txt", "->", "\t")

sentimentDetector = SentimentDetector() \
    .setInputCols(["lemma", "document"]) \
    .setOutputCol("sentimentScore") \
    .setDictionary("default-sentiment-dict.txt", ",", ReadAs.TEXT)

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    lemmatizer,
    sentimentDetector,
])

data = spark.createDataFrame([
    ["The staff of the restaurant is nice"],
    ["I recommend others to avoid because it is too expensive"]
]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("sentimentScore.result").show(truncate=False)
+----------+  #  +------+ for enableScore set to True
|result    |  #  |result|
+----------+  #  +------+
|[positive]|  #  |[1.0] |
|[negative]|  #  |[-2.0]|
+----------+  #  +------+
// In this example, the dictionary `default-sentiment-dict.txt` has the form of
//
// ...
// cool,positive
// superb,positive
// bad,negative
// uninspired,negative
// ...
//
// where each sentiment keyword is delimited by `","`.
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.Tokenizer
import com.johnsnowlabs.nlp.annotators.Lemmatizer
import com.johnsnowlabs.nlp.annotators.sda.pragmatic.SentimentDetector
import com.johnsnowlabs.nlp.util.io.ReadAs
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val lemmatizer = new Lemmatizer()
  .setInputCols("token")
  .setOutputCol("lemma")
  .setDictionary("src/test/resources/lemma-corpus-small/lemmas_small.txt", "->", "\t")

val sentimentDetector = new SentimentDetector()
  .setInputCols("lemma", "document")
  .setOutputCol("sentimentScore")
  .setDictionary("src/test/resources/sentiment-corpus/default-sentiment-dict.txt", ",", ReadAs.TEXT)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  lemmatizer,
  sentimentDetector,
))

val data = Seq(
  "The staff of the restaurant is nice",
  "I recommend others to avoid because it is too expensive"
).toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("sentimentScore.result").show(false)
+----------+  //  +------+ for enableScore set to true
|result    |  //  |result|
+----------+  //  +------+
|[positive]|  //  |[1.0] |
|[negative]|  //  |[-2.0]|
+----------+  //  +------+

Stemmer

Returns hard-stems out of words with the objective of retrieving the meaningful part of the word. For extended examples of usage, see the Spark NLP Workshop.

Input Annotator Types: TOKEN

Output Annotator Type: TOKEN

API: Stemmer Source: Stemmer
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

stemmer = Stemmer() \
    .setInputCols(["token"]) \
    .setOutputCol("stem")

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    stemmer
])

data = spark.createDataFrame([["Peter Pipers employees are picking pecks of pickled peppers."]]) \
    .toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("stem.result").show(truncate = False)
+-------------------------------------------------------------+
|result                                                       |
+-------------------------------------------------------------+
|[peter, piper, employe, ar, pick, peck, of, pickl, pepper, .]|
+-------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.{Stemmer, Tokenizer}
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val stemmer = new Stemmer()
  .setInputCols("token")
  .setOutputCol("stem")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  stemmer
))

val data = Seq("Peter Pipers employees are picking pecks of pickled peppers.")
  .toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("stem.result").show(truncate = false)
+-------------------------------------------------------------+
|result                                                       |
+-------------------------------------------------------------+
|[peter, piper, employe, ar, pick, peck, of, pickl, pepper, .]|
+-------------------------------------------------------------+

StopWordsCleaner

This annotator takes a sequence of strings (e.g. the output of a Tokenizer, Normalizer, Lemmatizer, and Stemmer) and drops all the stop words from the input sequences.

By default, it uses stop words from MLlibs StopWordsRemover. Stop words can also be defined by explicitly setting them with setStopWords(value: Array[String]) or loaded from pretrained models using pretrained of its companion object.

val stopWords = StopWordsCleaner.pretrained()
  .setInputCols("token")
  .setOutputCol("cleanTokens")
  .setCaseSensitive(false)
// will load the default pretrained model `"stopwords_en"`.

For available pretrained models please see the Models Hub.

For extended examples of usage, see the Spark NLP Workshop and StopWordsCleanerTestSpec.

NOTE: If you need to setStopWords from a text file, you can first read and convert it into an array of string as follows.

# your stop words text file, each line is one stop word
stopwords = sc.textFile("/tmp/stopwords/english.txt").collect()

# simply use it in StopWordsCleaner
stopWordsCleaner = StopWordsCleaner()\
      .setInputCols("token")\
      .setOutputCol("cleanTokens")\
      .setStopWords(stopwords)\
      .setCaseSensitive(False)

# or you can use pretrained models for StopWordsCleaner
stopWordsCleaner = StopWordsCleaner.pretrained()
      .setInputCols("token")\
      .setOutputCol("cleanTokens")\
      .setCaseSensitive(False)

// your stop words text file, each line is one stop word
val stopwords = sc.textFile("/tmp/stopwords/english.txt").collect()

// simply use it in StopWordsCleaner
val stopWordsCleaner = new StopWordsCleaner()
      .setInputCols("token")
      .setOutputCol("cleanTokens")
      .setStopWords(stopwords)
      .setCaseSensitive(false)

// or you can use pretrained models for StopWordsCleaner
val stopWordsCleaner = StopWordsCleaner.pretrained()
      .setInputCols("token")
      .setOutputCol("cleanTokens")
      .setCaseSensitive(false)      

Input Annotator Types: TOKEN

Output Annotator Type: TOKEN

API: StopWordsCleaner Source: StopWordsCleaner
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

stopWords = StopWordsCleaner() \
    .setInputCols(["token"]) \
    .setOutputCol("cleanTokens") \
    .setCaseSensitive(False)

pipeline = Pipeline().setStages([
      documentAssembler,
      sentenceDetector,
      tokenizer,
      stopWords
    ])

data = spark.createDataFrame([
    ["This is my first sentence. This is my second."],
    ["This is my third sentence. This is my forth."]
]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("cleanTokens.result").show(truncate=False)
+-------------------------------+
|result                         |
+-------------------------------+
|[first, sentence, ., second, .]|
|[third, sentence, ., forth, .] |
+-------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.Tokenizer
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.annotators.StopWordsCleaner
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols(Array("document"))
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols(Array("sentence"))
  .setOutputCol("token")

val stopWords = new StopWordsCleaner()
  .setInputCols("token")
  .setOutputCol("cleanTokens")
  .setCaseSensitive(false)

val pipeline = new Pipeline().setStages(Array(
    documentAssembler,
    sentenceDetector,
    tokenizer,
    stopWords
  ))

val data = Seq(
  "This is my first sentence. This is my second.",
  "This is my third sentence. This is my forth."
).toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("cleanTokens.result").show(false)
+-------------------------------+
|result                         |
+-------------------------------+
|[first, sentence, ., second, .]|
|[third, sentence, ., forth, .] |
+-------------------------------+

SymmetricDelete Spellchecker

Trains a Symmetric Delete spelling correction algorithm. Retrieves tokens and utilizes distance metrics to compute possible derived words.

Inspired by SymSpell.

For instantiated/pretrained models, see SymmetricDeleteModel.

See SymmetricDeleteModelTestSpec for further reference.

Input Annotator Types: TOKEN

Output Annotator Type: TOKEN

API: SymmetricDeleteApproach Source: SymmetricDeleteApproach
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline
# In this example, the dictionary `"words.txt"` has the form of
#
# ...
# gummy
# gummic
# gummier
# gummiest
# gummiferous
# ...
#
# This dictionary is then set to be the basis of the spell checker.

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

spellChecker = SymmetricDeleteApproach() \
    .setInputCols(["token"]) \
    .setOutputCol("spell") \
    .setDictionary("src/test/resources/spell/words.txt")

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    spellChecker
])

pipelineModel = pipeline.fit(trainingData)
// In this example, the dictionary `"words.txt"` has the form of
//
// ...
// gummy
// gummic
// gummier
// gummiest
// gummiferous
// ...
//
// This dictionary is then set to be the basis of the spell checker.
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.spell.symmetric.SymmetricDeleteApproach
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val spellChecker = new SymmetricDeleteApproach()
  .setInputCols("token")
  .setOutputCol("spell")
  .setDictionary("src/test/resources/spell/words.txt")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  spellChecker
))

val pipelineModel = pipeline.fit(trainingData)

TextMatcher

Annotator to match exact phrases (by token) provided in a file against a Document.

A text file of predefined phrases must be provided with setEntities.

For extended examples of usage, see the Spark NLP Workshop and the TextMatcherTestSpec.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CHUNK

API: TextMatcher Source: TextMatcher
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline
# In this example, the entities file is of the form
#
# ...
# dolore magna aliqua
# lorem ipsum dolor. sit
# laborum
# ...
#
# where each line represents an entity phrase to be extracted.

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

data = spark.createDataFrame([["Hello dolore magna aliqua. Lorem ipsum dolor. sit in laborum"]]).toDF("text")
entityExtractor = TextMatcher() \
    .setInputCols(["document", "token"]) \
    .setEntities("src/test/resources/entity-extractor/test-phrases.txt", ReadAs.TEXT) \
    .setOutputCol("entity") \
    .setCaseSensitive(False)

pipeline = Pipeline().setStages([documentAssembler, tokenizer, entityExtractor])
results = pipeline.fit(data).transform(data)

results.selectExpr("explode(entity) as result").show(truncate=False)
+------------------------------------------------------------------------------------------+
|result                                                                                    |
+------------------------------------------------------------------------------------------+
|[chunk, 6, 24, dolore magna aliqua, [entity -> entity, sentence -> 0, chunk -> 0], []]    |
|[chunk, 27, 48, Lorem ipsum dolor. sit, [entity -> entity, sentence -> 0, chunk -> 1], []]|
|[chunk, 53, 59, laborum, [entity -> entity, sentence -> 0, chunk -> 2], []]               |
+------------------------------------------------------------------------------------------+
// In this example, the entities file is of the form
//
// ...
// dolore magna aliqua
// lorem ipsum dolor. sit
// laborum
// ...
//
// where each line represents an entity phrase to be extracted.
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.Tokenizer
import com.johnsnowlabs.nlp.annotator.TextMatcher
import com.johnsnowlabs.nlp.util.io.ReadAs
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val data = Seq("Hello dolore magna aliqua. Lorem ipsum dolor. sit in laborum").toDF("text")
val entityExtractor = new TextMatcher()
  .setInputCols("document", "token")
  .setEntities("src/test/resources/entity-extractor/test-phrases.txt", ReadAs.TEXT)
  .setOutputCol("entity")
  .setCaseSensitive(false)
  .setTokenizer(tokenizer.fit(data))

val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer, entityExtractor))
val results = pipeline.fit(data).transform(data)

results.selectExpr("explode(entity) as result").show(false)
+------------------------------------------------------------------------------------------+
|result                                                                                    |
+------------------------------------------------------------------------------------------+
|[chunk, 6, 24, dolore magna aliqua, [entity -> entity, sentence -> 0, chunk -> 0], []]    |
|[chunk, 27, 48, Lorem ipsum dolor. sit, [entity -> entity, sentence -> 0, chunk -> 1], []]|
|[chunk, 53, 59, laborum, [entity -> entity, sentence -> 0, chunk -> 2], []]               |
+------------------------------------------------------------------------------------------+

Token2Chunk

Converts TOKEN type Annotations to CHUNK type.

This can be useful if a entities have been already extracted as TOKEN and following annotators require CHUNK types.

Input Annotator Types: TOKEN

Output Annotator Type: CHUNK

API: Token2Chunk Source: Token2Chunk
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline


documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

token2chunk = Token2Chunk() \
    .setInputCols(["token"]) \
    .setOutputCol("chunk")

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    token2chunk
])

data = spark.createDataFrame([["One Two Three Four"]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(chunk) as result").show(truncate=False)
+------------------------------------------+
|result                                    |
+------------------------------------------+
|[chunk, 0, 2, One, [sentence -> 0], []]   |
|[chunk, 4, 6, Two, [sentence -> 0], []]   |
|[chunk, 8, 12, Three, [sentence -> 0], []]|
|[chunk, 14, 17, Four, [sentence -> 0], []]|
+------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.{Token2Chunk, Tokenizer}

import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val token2chunk = new Token2Chunk()
  .setInputCols("token")
  .setOutputCol("chunk")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  tokenizer,
  token2chunk
))

val data = Seq("One Two Three Four").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(chunk) as result").show(false)
+------------------------------------------+
|result                                    |
+------------------------------------------+
|[chunk, 0, 2, One, [sentence -> 0], []]   |
|[chunk, 4, 6, Two, [sentence -> 0], []]   |
|[chunk, 8, 12, Three, [sentence -> 0], []]|
|[chunk, 14, 17, Four, [sentence -> 0], []]|
+------------------------------------------+

TokenAssembler

This transformer reconstructs a DOCUMENT type annotation from tokens, usually after these have been normalized, lemmatized, normalized, spell checked, etc, in order to use this document annotation in further annotators. Requires DOCUMENT and TOKEN type annotations as input.

For more extended examples on document pre-processing see the Spark NLP Workshop.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: DOCUMENT

API: TokenAssembler Source: TokenAssembler
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

# First, the text is tokenized and cleaned
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentences")

tokenizer = Tokenizer() \
    .setInputCols(["sentences"]) \
    .setOutputCol("token")

normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normalized") \
    .setLowercase(False)

stopwordsCleaner = StopWordsCleaner() \
    .setInputCols(["normalized"]) \
    .setOutputCol("cleanTokens") \
    .setCaseSensitive(False)

# Then the TokenAssembler turns the cleaned tokens into a `DOCUMENT` type structure.
tokenAssembler = TokenAssembler() \
    .setInputCols(["sentences", "cleanTokens"]) \
    .setOutputCol("cleanText")

data = spark.createDataFrame([["Spark NLP is an open-source text processing library for advanced natural language processing."]]) \
    .toDF("text")

pipeline = Pipeline().setStages([
    documentAssembler,
    sentenceDetector,
    tokenizer,
    normalizer,
    stopwordsCleaner,
    tokenAssembler
]).fit(data)

result = pipeline.transform(data)
result.select("cleanText").show(truncate=False)
+---------------------------------------------------------------------------------------------------------------------------+
|cleanText                                                                                                                  |
+---------------------------------------------------------------------------------------------------------------------------+
|0, 80, Spark NLP opensource text processing library advanced natural language processing, [sentence -> 0], []|
+---------------------------------------------------------------------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.annotator.Tokenizer
import com.johnsnowlabs.nlp.annotator.{Normalizer, StopWordsCleaner}
import com.johnsnowlabs.nlp.TokenAssembler
import org.apache.spark.ml.Pipeline

// First, the text is tokenized and cleaned
val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentences")

val tokenizer = new Tokenizer()
  .setInputCols("sentences")
  .setOutputCol("token")

val normalizer = new Normalizer()
  .setInputCols("token")
  .setOutputCol("normalized")
  .setLowercase(false)

val stopwordsCleaner = new StopWordsCleaner()
  .setInputCols("normalized")
  .setOutputCol("cleanTokens")
  .setCaseSensitive(false)

// Then the TokenAssembler turns the cleaned tokens into a `DOCUMENT` type structure.
val tokenAssembler = new TokenAssembler()
  .setInputCols("sentences", "cleanTokens")
  .setOutputCol("cleanText")

val data = Seq("Spark NLP is an open-source text processing library for advanced natural language processing.")
  .toDF("text")

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  tokenizer,
  normalizer,
  stopwordsCleaner,
  tokenAssembler
)).fit(data)

val result = pipeline.transform(data)
result.select("cleanText").show(false)
+---------------------------------------------------------------------------------------------------------------------------+
|cleanText                                                                                                                  |
+---------------------------------------------------------------------------------------------------------------------------+
|[[document, 0, 80, Spark NLP opensource text processing library advanced natural language processing, [sentence -> 0], []]]|
+---------------------------------------------------------------------------------------------------------------------------+

Tokenizer

Tokenizes raw text in document type columns into TokenizedSentence .

This class represents a non fitted tokenizer. Fitting it will cause the internal RuleFactory to construct the rules for tokenizing from the input configuration.

Identifies tokens with tokenization open standards. A few rules will help customizing it if defaults do not fit user needs.

For extended examples of usage see the Spark NLP Workshop and Tokenizer test class

Input Annotator Types: DOCUMENT

Output Annotator Type: TOKEN

Note: All these APIs receive regular expressions so please make sure that you escape special characters according to Java conventions.

API: Tokenizer Source: Tokenizer
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

data = spark.createDataFrame([["I'd like to say we didn't expect that. Jane's boyfriend."]]).toDF("text")
documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token").fit(data)

pipeline = Pipeline().setStages([documentAssembler, tokenizer]).fit(data)
result = pipeline.transform(data)

result.selectExpr("token.result").show(truncate=False)
+-----------------------------------------------------------------------+
|output                                                                 |
+-----------------------------------------------------------------------+
|[I'd, like, to, say, we, didn't, expect, that, ., Jane's, boyfriend, .]|
+-----------------------------------------------------------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import org.apache.spark.ml.Pipeline

val data = Seq("I'd like to say we didn't expect that. Jane's boyfriend.").toDF("text")
val documentAssembler = new DocumentAssembler().setInputCol("text").setOutputCol("document")
val tokenizer = new Tokenizer().setInputCols("document").setOutputCol("token").fit(data)

val pipeline = new Pipeline().setStages(Array(documentAssembler, tokenizer)).fit(data)
val result = pipeline.transform(data)

result.selectExpr("token.result").show(false)
+-----------------------------------------------------------------------+
|output                                                                 |
+-----------------------------------------------------------------------+
|[I'd, like, to, say, we, didn't, expect, that, ., Jane's, boyfriend, .]|
+-----------------------------------------------------------------------+

TypedDependencyParser

Labeled parser that finds a grammatical relation between two words in a sentence. Its input is either a CoNLL2009 or ConllU dataset.

For instantiated/pretrained models, see TypedDependencyParserModel.

Dependency parsers provide information about word relationship. For example, dependency parsing can tell you what the subjects and objects of a verb are, as well as which words are modifying (describing) the subject. This can help you find precise answers to specific questions.

The parser requires the dependant tokens beforehand with e.g. DependencyParser. The required training data can be set in two different ways (only one can be chosen for a particular model):

Apart from that, no additional training data is needed.

See TypedDependencyParserApproachTestSpec for further reference on this API.

Input Annotator Types: TOKEN, POS, DEPENDENCY

Output Annotator Type: LABELED_DEPENDENCY

API: TypedDependencyParserApproach Source: TypedDependencyParserApproach
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

posTagger = PerceptronModel.pretrained() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("pos")

dependencyParser = DependencyParserModel.pretrained() \
    .setInputCols(["sentence", "pos", "token"]) \
    .setOutputCol("dependency")

typedDependencyParser = TypedDependencyParserApproach() \
    .setInputCols(["dependency", "pos", "token"]) \
    .setOutputCol("dependency_type") \
    .setConllU("src/test/resources/parser/labeled/train_small.conllu.txt") \
    .setNumberOfIterations(1)

pipeline = Pipeline().setStages([
    documentAssembler,
    sentence,
    tokenizer,
    posTagger,
    dependencyParser,
    typedDependencyParser
])

# Additional training data is not needed, the dependency parser relies on CoNLL-U only.
emptyDataSet = .empty[String].toDF("text")
pipelineModel = pipeline.fit(emptyDataSet)
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.sbd.pragmatic.SentenceDetector
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel
import com.johnsnowlabs.nlp.annotators.parser.dep.DependencyParserModel
import com.johnsnowlabs.nlp.annotators.parser.typdep.TypedDependencyParserApproach
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentence = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val tokenizer = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")

val posTagger = PerceptronModel.pretrained()
  .setInputCols("sentence", "token")
  .setOutputCol("pos")

val dependencyParser = DependencyParserModel.pretrained()
  .setInputCols("sentence", "pos", "token")
  .setOutputCol("dependency")

val typedDependencyParser = new TypedDependencyParserApproach()
  .setInputCols("dependency", "pos", "token")
  .setOutputCol("dependency_type")
  .setConllU("src/test/resources/parser/labeled/train_small.conllu.txt")
  .setNumberOfIterations(1)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentence,
  tokenizer,
  posTagger,
  dependencyParser,
  typedDependencyParser
))

// Additional training data is not needed, the dependency parser relies on CoNLL-U only.
val emptyDataSet = Seq.empty[String].toDF("text")
val pipelineModel = pipeline.fit(emptyDataSet)

ViveknSentiment

Trains a sentiment analyser inspired by the algorithm by Vivek Narayanan https://github.com/vivekn/sentiment/.

The algorithm is based on the paper “Fast and accurate sentiment classification using an enhanced Naive Bayes model”.

The analyzer requires sentence boundaries to give a score in context. Tokenization is needed to make sure tokens are within bounds. Transitivity requirements are also required.

The training data needs to consist of a column for normalized text and a label column (either "positive" or "negative").

For extended examples of usage, see the Spark NLP Workshop and the ViveknSentimentTestSpec.

Input Annotator Types: TOKEN, DOCUMENT

Output Annotator Type: SENTIMENT

API: ViveknSentimentApproach Source: ViveknSentimentApproach
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

document = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

token = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normal")

vivekn = ViveknSentimentApproach() \
    .setInputCols(["document", "normal"]) \
    .setSentimentCol("train_sentiment") \
    .setOutputCol("result_sentiment")

finisher = Finisher() \
    .setInputCols(["result_sentiment"]) \
    .setOutputCols("final_sentiment")

pipeline = Pipeline().setStages([document, token, normalizer, vivekn, finisher])

training = spark.createDataFrame([
    ("I really liked this movie!", "positive"),
    ("The cast was horrible", "negative"),
    ("Never going to watch this again or recommend it to anyone", "negative"),
    ("It's a waste of time", "negative"),
    ("I loved the protagonist", "positive"),
    ("The music was really really good", "positive")
]).toDF("text", "train_sentiment")
pipelineModel = pipeline.fit(training)

data = spark.createDataFrame([
    ["I recommend this movie"],
    ["Dont waste your time!!!"]
]).toDF("text")
result = pipelineModel.transform(data)

result.select("final_sentiment").show(truncate=False)
+---------------+
|final_sentiment|
+---------------+
|[positive]     |
|[negative]     |
+---------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.annotators.Normalizer
import com.johnsnowlabs.nlp.annotators.sda.vivekn.ViveknSentimentApproach
import com.johnsnowlabs.nlp.Finisher
import org.apache.spark.ml.Pipeline

val document = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val token = new Tokenizer()
  .setInputCols("document")
  .setOutputCol("token")

val normalizer = new Normalizer()
  .setInputCols("token")
  .setOutputCol("normal")

val vivekn = new ViveknSentimentApproach()
  .setInputCols("document", "normal")
  .setSentimentCol("train_sentiment")
  .setOutputCol("result_sentiment")

val finisher = new Finisher()
  .setInputCols("result_sentiment")
  .setOutputCols("final_sentiment")

val pipeline = new Pipeline().setStages(Array(document, token, normalizer, vivekn, finisher))

val training = Seq(
  ("I really liked this movie!", "positive"),
  ("The cast was horrible", "negative"),
  ("Never going to watch this again or recommend it to anyone", "negative"),
  ("It's a waste of time", "negative"),
  ("I loved the protagonist", "positive"),
  ("The music was really really good", "positive")
).toDF("text", "train_sentiment")
val pipelineModel = pipeline.fit(training)

val data = Seq(
  "I recommend this movie",
  "Dont waste your time!!!"
).toDF("text")
val result = pipelineModel.transform(data)

result.select("final_sentiment").show(false)
+---------------+
|final_sentiment|
+---------------+
|[positive]     |
|[negative]     |
+---------------+

WordEmbeddings

Word Embeddings lookup annotator that maps tokens to vectors.

For instantiated/pretrained models, see WordEmbeddingsModel.

A custom token lookup dictionary for embeddings can be set with setStoragePath. Each line of the provided file needs to have a token, followed by their vector representation, delimited by a spaces.

...
are 0.39658191506190343 0.630968081620067 0.5393722253731201 0.8428180123359783
were 0.7535235923631415 0.9699218875629833 0.10397182122983872 0.11833962569383116
stress 0.0492683418305907 0.9415954572751959 0.47624463167525755 0.16790967216778263
induced 0.1535748762292387 0.33498936903209897 0.9235178224122094 0.1158772920395934
...

If a token is not found in the dictionary, then the result will be a zero vector of the same dimension. Statistics about the rate of converted tokens, can be retrieved with[WordEmbeddingsModel.withCoverageColumn and WordEmbeddingsModel.overallCoverage.

For extended examples of usage, see the Spark NLP Workshop and the WordEmbeddingsTestSpec.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: WORD_EMBEDDINGS

API: WordEmbeddings Source: WordEmbeddings
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline
# In this example, the file `random_embeddings_dim4.txt` has the form of the content above.

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

embeddings = WordEmbeddings() \
    .setStoragePath("src/test/resources/random_embeddings_dim4.txt", ReadAs.TEXT) \
    .setStorageRef("glove_4d") \
    .setDimension(4) \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings")

embeddingsFinisher = EmbeddingsFinisher() \
    .setInputCols(["embeddings"]) \
    .setOutputCols("finished_embeddings") \
    .setOutputAsVector(True) \
    .setCleanAnnotations(False)

pipeline = Pipeline() \
    .setStages([
      documentAssembler,
      tokenizer,
      embeddings,
      embeddingsFinisher
    ])

data = spark.createDataFrame([["The patient was diagnosed with diabetes."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(truncate=False)
+----------------------------------------------------------------------------------+
|result                                                                            |
+----------------------------------------------------------------------------------+
|[0.9439099431037903,0.4707513153553009,0.806300163269043,0.16176554560661316]     |
|[0.7966810464859009,0.5551124811172485,0.8861005902290344,0.28284206986427307]    |
|[0.025029370561242104,0.35177749395370483,0.052506182342767715,0.1887107789516449]|
|[0.08617766946554184,0.8399239182472229,0.5395117998123169,0.7864698767662048]    |
|[0.6599600911140442,0.16109347343444824,0.6041093468666077,0.8913561105728149]    |
|[0.5955275893211365,0.01899011991918087,0.4397728443145752,0.8911281824111938]    |
|[0.9840458631515503,0.7599489092826843,0.9417727589607239,0.8624503016471863]     |
+----------------------------------------------------------------------------------+
// In this example, the file `random_embeddings_dim4.txt` has the form of the content above.
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.WordEmbeddings
import com.johnsnowlabs.nlp.util.io.ReadAs
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val tokenizer = new Tokenizer()
  .setInputCols(Array("document"))
  .setOutputCol("token")

val embeddings = new WordEmbeddings()
  .setStoragePath("src/test/resources/random_embeddings_dim4.txt", ReadAs.TEXT)
  .setStorageRef("glove_4d")
  .setDimension(4)
  .setInputCols("document", "token")
  .setOutputCol("embeddings")

val embeddingsFinisher = new EmbeddingsFinisher()
  .setInputCols("embeddings")
  .setOutputCols("finished_embeddings")
  .setOutputAsVector(true)
  .setCleanAnnotations(false)

val pipeline = new Pipeline()
  .setStages(Array(
    documentAssembler,
    tokenizer,
    embeddings,
    embeddingsFinisher
  ))

val data = Seq("The patient was diagnosed with diabetes.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(false)
+----------------------------------------------------------------------------------+
|result                                                                            |
+----------------------------------------------------------------------------------+
|[0.9439099431037903,0.4707513153553009,0.806300163269043,0.16176554560661316]     |
|[0.7966810464859009,0.5551124811172485,0.8861005902290344,0.28284206986427307]    |
|[0.025029370561242104,0.35177749395370483,0.052506182342767715,0.1887107789516449]|
|[0.08617766946554184,0.8399239182472229,0.5395117998123169,0.7864698767662048]    |
|[0.6599600911140442,0.16109347343444824,0.6041093468666077,0.8913561105728149]    |
|[0.5955275893211365,0.01899011991918087,0.4397728443145752,0.8911281824111938]    |
|[0.9840458631515503,0.7599489092826843,0.9417727589607239,0.8624503016471863]     |
+----------------------------------------------------------------------------------+

WordSegmenter

Trains a WordSegmenter which tokenizes non-english or non-whitespace separated texts.

Many languages are not whitespace separated and their sentences are a concatenation of many symbols, like Korean, Japanese or Chinese. Without understanding the language, splitting the words into their corresponding tokens is impossible. The WordSegmenter is trained to understand these languages and split them into semantically correct parts.

For instantiated/pretrained models, see WordSegmenterModel.

To train your own model, a training dataset consisting of Part-Of-Speech tags is required. The data has to be loaded into a dataframe, where the column is an Annotation of type "POS". This can be set with setPosColumn.

Tip: The helper class POS might be useful to read training data into data frames.

For extended examples of usage, see the Spark NLP Workshop and the WordSegmenterTest.

Input Annotator Types: DOCUMENT

Output Annotator Type: TOKEN

API: WordSegmenterApproach Source: WordSegmenterApproach
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline
# In this example, `"chinese_train.utf8"` is in the form of
#
# 十|LL 四|RR 不|LL 是|RR 四|LL 十|RR
#
# and is loaded with the `POS` class to create a dataframe of `"POS"` type Annotations.

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

wordSegmenter = WordSegmenterApproach() \
    .setInputCols(["document"]) \
    .setOutputCol("token") \
    .setPosColumn("tags") \
    .setNIterations(5)

pipeline = Pipeline().setStages([
    documentAssembler,
    wordSegmenter
])

trainingDataSet = POS().readDataset(
    spark,
    "src/test/resources/word-segmenter/chinese_train.utf8"
)

pipelineModel = pipeline.fit(trainingDataSet)
// In this example, `"chinese_train.utf8"` is in the form of
//
// 十|LL 四|RR 不|LL 是|RR 四|LL 十|RR
//
// and is loaded with the `POS` class to create a dataframe of `"POS"` type Annotations.
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.ws.WordSegmenterApproach
import com.johnsnowlabs.nlp.training.POS
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val wordSegmenter = new WordSegmenterApproach()
  .setInputCols("document")
  .setOutputCol("token")
  .setPosColumn("tags")
  .setNIterations(5)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  wordSegmenter
))

val trainingDataSet = POS().readDataset(
  ResourceHelper.spark,
  "src/test/resources/word-segmenter/chinese_train.utf8"
)

val pipelineModel = pipeline.fit(trainingDataSet)

Yake

Yake is an Unsupervised, Corpus-Independent, Domain and Language-Independent and Single-Document keyword extraction algorithm.

Extracting keywords from texts has become a challenge for individuals and organizations as the information grows in complexity and size. The need to automate this task so that text can be processed in a timely and adequate manner has led to the emergence of automatic keyword extraction tools. Yake is a novel feature-based system for multi-lingual keyword extraction, which supports texts of different sizes, domain or languages. Unlike other approaches, Yake does not rely on dictionaries nor thesauri, neither is trained against any corpora. Instead, it follows an unsupervised approach which builds upon features extracted from the text, making it thus applicable to documents written in different languages without the need for further knowledge. This can be beneficial for a large number of tasks and a plethora of situations where access to training corpora is either limited or restricted. The algorithm makes use of the position of a sentence and token. Therefore, to use the annotator, the text should be first sent through a Sentence Boundary Detector and then a tokenizer.

Note that each keyword will be given a keyword score greater than 0 (The lower the score better the keyword). Therefore to filter the keywords, an upper bound for the score can be set with setThreshold.

For extended examples of usage, see the Spark NLP Workshop and the YakeTestSpec.

Sources :

Campos, R., Mangaravite, V., Pasquali, A., Jatowt, A., Jorge, A., Nunes, C. and Jatowt, A. (2020). YAKE! Keyword Extraction from Single Documents using Multiple Local Features. In Information Sciences Journal. Elsevier, Vol 509, pp 257-289

Paper abstract:

As the amount of generated information grows, reading and summarizing texts of large collections turns into a challenging task. Many documents do not come with descriptive terms, thus requiring humans to generate keywords on-the-fly. The need to automate this kind of task demands the development of keyword extraction systems with the ability to automatically identify keywords within the text. One approach is to resort to machine-learning algorithms. These, however, depend on large annotated text corpora, which are not always available. An alternative solution is to consider an unsupervised approach. In this article, we describe YAKE!, a light-weight unsupervised automatic keyword extraction method which rests on statistical text features extracted from single documents to select the most relevant keywords of a text. Our system does not need to be trained on a particular set of documents, nor does it depend on dictionaries, external corpora, text size, language, or domain. To demonstrate the merits and significance of YAKE!, we compare it against ten state-of-the-art unsupervised approaches and one supervised method. Experimental results carried out on top of twenty datasets show that YAKE! significantly outperforms other unsupervised methods on texts of different sizes, languages, and domains.

Input Annotator Types: TOKEN

Output Annotator Type: KEYWORD

API: YakeModel Source: YakeModel
Show Example
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentenceDetector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

token = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token") \
    .setContextChars(["(", "]", "?", "!", ".", ","])

keywords = YakeModel() \
    .setInputCols(["token"]) \
    .setOutputCol("keywords") \
    .setThreshold(0.6) \
    .setMinNGrams(2) \
    .setNKeywords(10)

pipeline = Pipeline().setStages([
    documentAssembler,
    sentenceDetector,
    token,
    keywords
])

data = spark.createDataFrame([[
    "Sources tell us that Google is acquiring Kaggle, a platform that hosts data science and machine learning competitions. Details about the transaction remain somewhat vague, but given that Google is hosting its Cloud Next conference in San Francisco this week, the official announcement could come as early as tomorrow. Reached by phone, Kaggle co-founder CEO Anthony Goldbloom declined to deny that the acquisition is happening. Google itself declined 'to comment on rumors'. Kaggle, which has about half a million data scientists on its platform, was founded by Goldbloom  and Ben Hamner in 2010. The service got an early start and even though it has a few competitors like DrivenData, TopCoder and HackerRank, it has managed to stay well ahead of them by focusing on its specific niche. The service is basically the de facto home for running data science and machine learning competitions. With Kaggle, Google is buying one of the largest and most active communities for data scientists - and with that, it will get increased mindshare in this community, too (though it already has plenty of that thanks to Tensorflow and other projects). Kaggle has a bit of a history with Google, too, but that's pretty recent. Earlier this month, Google and Kaggle teamed up to host a $100,000 machine learning competition around classifying YouTube videos. That competition had some deep integrations with the Google Cloud Platform, too. Our understanding is that Google will keep the service running - likely under its current name. While the acquisition is probably more about Kaggle's community than technology, Kaggle did build some interesting tools for hosting its competition and 'kernels', too. On Kaggle, kernels are basically the source code for analyzing data sets and developers can share this code on the platform (the company previously called them 'scripts'). Like similar competition-centric sites, Kaggle also runs a job board, too. It's unclear what Google will do with that part of the service. According to Crunchbase, Kaggle raised $12.5 million (though PitchBook says it's $12.75) since its   launch in 2010. Investors in Kaggle include Index Ventures, SV Angel, Max Levchin, NaRavikant, Google chie economist Hal Varian, Khosla Ventures and Yuri Milner"
]]).toDF("text")
result = pipeline.fit(data).transform(data)

# combine the result and score (contained in keywords.metadata)
scores = result \
    .selectExpr("explode(arrays_zip(keywords.result, keywords.metadata)) as resultTuples") \
    .selectExpr("resultTuples['0'] as keyword", "resultTuples['1'].score as score")

# Order ascending, as lower scores means higher importance
scores.orderBy("score").show(5, truncate = False)
+---------------------+-------------------+
|keyword              |score              |
+---------------------+-------------------+
|google cloud         |0.32051516486864573|
|google cloud platform|0.37786450577630676|
|ceo anthony goldbloom|0.39922830978423146|
|san francisco        |0.40224744669493756|
|anthony goldbloom    |0.41584827825302534|
+---------------------+-------------------+
import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.{SentenceDetector, Tokenizer}
import com.johnsnowlabs.nlp.annotators.keyword.yake.YakeModel
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
  .setInputCol("text")
  .setOutputCol("document")

val sentenceDetector = new SentenceDetector()
  .setInputCols("document")
  .setOutputCol("sentence")

val token = new Tokenizer()
  .setInputCols("sentence")
  .setOutputCol("token")
  .setContextChars(Array("(", ")", "?", "!", ".", ","))

val keywords = new YakeModel()
  .setInputCols("token")
  .setOutputCol("keywords")
  .setThreshold(0.6f)
  .setMinNGrams(2)
  .setNKeywords(10)

val pipeline = new Pipeline().setStages(Array(
  documentAssembler,
  sentenceDetector,
  token,
  keywords
))

val data = Seq(
  "Sources tell us that Google is acquiring Kaggle, a platform that hosts data science and machine learning competitions. Details about the transaction remain somewhat vague, but given that Google is hosting its Cloud Next conference in San Francisco this week, the official announcement could come as early as tomorrow. Reached by phone, Kaggle co-founder CEO Anthony Goldbloom declined to deny that the acquisition is happening. Google itself declined 'to comment on rumors'. Kaggle, which has about half a million data scientists on its platform, was founded by Goldbloom  and Ben Hamner in 2010. The service got an early start and even though it has a few competitors like DrivenData, TopCoder and HackerRank, it has managed to stay well ahead of them by focusing on its specific niche. The service is basically the de facto home for running data science and machine learning competitions. With Kaggle, Google is buying one of the largest and most active communities for data scientists - and with that, it will get increased mindshare in this community, too (though it already has plenty of that thanks to Tensorflow and other projects). Kaggle has a bit of a history with Google, too, but that's pretty recent. Earlier this month, Google and Kaggle teamed up to host a $100,000 machine learning competition around classifying YouTube videos. That competition had some deep integrations with the Google Cloud Platform, too. Our understanding is that Google will keep the service running - likely under its current name. While the acquisition is probably more about Kaggle's community than technology, Kaggle did build some interesting tools for hosting its competition and 'kernels', too. On Kaggle, kernels are basically the source code for analyzing data sets and developers can share this code on the platform (the company previously called them 'scripts'). Like similar competition-centric sites, Kaggle also runs a job board, too. It's unclear what Google will do with that part of the service. According to Crunchbase, Kaggle raised $12.5 million (though PitchBook says it's $12.75) since its   launch in 2010. Investors in Kaggle include Index Ventures, SV Angel, Max Levchin, Naval Ravikant, Google chief economist Hal Varian, Khosla Ventures and Yuri Milner"
).toDF("text")
val result = pipeline.fit(data).transform(data)

// combine the result and score (contained in keywords.metadata)
val scores = result
  .selectExpr("explode(arrays_zip(keywords.result, keywords.metadata)) as resultTuples")
  .select($"resultTuples.0" as "keyword", $"resultTuples.1.score")

// Order ascending, as lower scores means higher importance
scores.orderBy("score").show(5, truncate = false)
+---------------------+-------------------+
|keyword              |score              |
+---------------------+-------------------+
|google cloud         |0.32051516486864573|
|google cloud platform|0.37786450577630676|
|ceo anthony goldbloom|0.39922830978423146|
|san francisco        |0.40224744669493756|
|anthony goldbloom    |0.41584827825302534|
+---------------------+-------------------+
Last updated