Annotators

 

How to read this section

All annotators in Spark NLP share a common interface, this is:

  • Annotation: Annotation(annotatorType, begin, end, result, meta-data, embeddings)
  • AnnotatorType: some annotators share a type. This is not only figurative, but also tells about the structure of the metadata map in the Annotation. This is the one referred in the input and output of annotators.
  • Inputs: Represents how many and which annotator types are expected in setInputCols(). These are column names of output of other annotators in the DataFrames.
  • Output Represents the type of the output in the column setOutputCol().

There are two types of Annotators:

  • Approach: AnnotatorApproach extend Estimators, which are meant to be trained through fit()
  • Model: AnnotatorModel extend from Transformers, which are meant to transform DataFrames through transform()

Model suffix is explicitly stated when the annotator is the result of a training process. Some annotators, such as Tokenizer are transformers, but do not contain the word Model since they are not trained annotators.

Model annotators have a pretrained() on it’s static object, to retrieve the public pre-trained version of a model.

  • pretrained(name, language, extra_location) -> by default, pre-trained will bring a default model, sometimes we offer more than one model, in this case, you may have to use name, language or extra location to download them.

The types are:

AnnotatorType AnnotatorType
DOCUMENT = “document” DATE = “date”
TOKEN = “token” ENTITY = “entity”
WORDPIECE = “wordpiece” NEGEX = “negex”
WORD_EMBEDDINGS = “word_embeddings” DEPENDENCY = “dependency”
SENTENCE_EMBEDDINGS = “sentence_embeddings” KEYWORD = “keyword”
CATEGORY = “category” LABELED_DEPENDENCY = “labeled_dependency”
SENTIMENT = “sentiment” LANGUAGE = “language”
POS = “pos” CHUNK = “chunk”
NAMED_ENTITY = “named_entity”  
Annotator Description Version
Tokenizer Identifies tokens with tokenization open standards Opensource
WordSegmenter Trainable annotator for word segmentation of languages without any rule-based tokenization such as Chinese, Japanese, or Korean Opensource
Normalizer Removes all dirty characters from text Opensource
DocumentNormalizer Cleaning content from HTML or XML documents Opensource
Stemmer Returns hard-stems out of words with the objective of retrieving the meaningful part of the word Opensource
Lemmatizer Retrieves lemmas out of words with the objective of returning a base dictionary word Opensource
StopWordsCleaner This annotator excludes from a sequence of strings (e.g. the output of a Tokenizer, Normalizer, Lemmatizer, and Stemmer) and drops all the stop words from the input sequences Opensource
RegexMatcher Uses a reference file to match a set of regular expressions and put them inside a provided key. Opensource
TextMatcher Annotator to match entire phrases (by token) provided in a file against a Document Opensource
Chunker Matches a pattern of part-of-speech tags in order to return meaningful phrases from document Opensource
NGramGenerator integrates Spark ML NGram function into Spark ML with a new cumulative feature to also generate range ngrams like the scikit-learn library Opensource
DateMatcher Reads from different forms of date and time expressions and converts them to a provided date format Opensource
MultiDateMatcher Reads from multiple different forms of date and time expressions and converts them to a provided date format Opensource
SentenceDetector Finds sentence bounds in raw text. Applies rules from Pragmatic Segmenter Opensource
POSTagger Sets a Part-Of-Speech tag to each word within a sentence. Opensource
ViveknSentimentDetector Scores a sentence for a sentiment Opensource
SentimentDetector Scores a sentence for a sentiment Opensource
WordEmbeddings Word Embeddings lookup annotator that maps tokens to vectors Opensource
BertEmbeddings BERT (Bidirectional Encoder Representations from Transformers) provides dense vector representations for natural language by using a deep, pre-trained neural network with the Transformer architecture Opensource
BertSentenceEmbeddings This annotator generates sentence embeddings from all BERT models Opensource
ElmoEmbeddings Computes contextualized word representations using character-based word representations and bidirectional LSTMs Opensource
AlbertEmbeddings Computes contextualized word representations using “A Lite” implementation of BERT algorithm by applying parameter-reduction techniques Opensource
XlnetEmbeddings Computes contextualized word representations using combination of Autoregressive Language Model and Permutation Language Model Opensource
UniversalSentenceEncoder Encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks. Opensource
SentenceEmbeddings utilizes WordEmbeddings or BertEmbeddings to generate sentence or document embeddings Opensource
ChunkEmbeddings utilizes WordEmbeddings or BertEmbeddings to generate chunk embeddings from either Chunker, NGramGenerator, or NerConverter outputs Opensource
ClassifierDL Multi-class Text Classification. ClassifierDL uses the state-of-the-art Universal Sentence Encoder as an input for text classifications. The ClassifierDL annotator uses a deep learning model (DNNs) we have built inside TensorFlow and supports up to 100 classes Opensource
MultiClassifierDL Multi-label Text Classification. MultiClassifierDL uses a Bidirectional GRU with Convolution model that we have built inside TensorFlow and supports up to 100 classes. Opensource
SentimentDL Multi-class Sentiment Analysis Annotator. SentimentDL is an annotator for multi-class sentiment analysis. This annotator comes with 2 available pre-trained models trained on IMDB and Twitter datasets Opensource
T5Transformer for Text-To-Text Transfer Transformer (Google T5) models to achieve state-of-the-art results on multiple NLP tasks such as Translation, Summarization, Question Answering, Sentence Similarity, and so on Opensource
MarianTransformer Neural Machine Translation based on MarianNMT models being developed by the Microsoft Translator team Opensource
LanguageDetectorDL State-of-the-art language detection and identification annotator trained by using TensorFlow/keras neural networks Opensource
YakeModel Yake is an Unsupervised, Corpus-Independent, Domain and Language-Independent and Single-Document keyword extraction algorithm. Opensource
NerDL Named Entity recognition annotator allows for a generic model to be trained by utilizing a deep learning algorithm (Char CNNs - BiLSTM - CRF - word embeddings) Opensource
NerCrf Named Entity recognition annotator allows for a generic model to be trained by utilizing a CRF machine learning algorithm Opensource
NorvigSweeting SpellChecker This annotator retrieves tokens and makes corrections automatically if not found in an English dictionary Opensource
SymmetricDelete SpellChecker This spell checker is inspired on Symmetric Delete algorithm Opensource
Context SpellChecker Implements Noisy Channel Model Spell Algorithm. Correction candidates are extracted combining context information and word information Opensource
DependencyParser Unlabeled parser that finds a grammatical relation between two words in a sentence Opensource
TypedDependencyParser Labeled parser that finds a grammatical relation between two words in a sentence Opensource
PubTator reader Converts automatic annotations of the biomedical datasets into Spark DataFrame Opensource

Tokenizer

Identifies tokens with tokenization open standards. A few rules will help customizing it if defaults do not fit user needs.

Output Annotator Type: Token

Input Annotator Types: Document

Note: all these APIs receive regular expressions so please make sure that you escape special characters according to Java conventions.

Example:

tokenizer = Tokenizer() \
    .setInputCols(["sentences"]) \
    .setOutputCol("token") \
    .setSplitChars(['-']) \
    .setContextChars(['(', ')', '?', '!']) \
    .setExceptions(["New York", "e-mail"]) \
    .setSplitPattern("'") \
    .setMaxLength(0) \
    .setMaxLength(99999) \
    .setCaseSensitiveExceptions(False)
val tokenizer = new Tokenizer()
    .setInputCols("sentence")
    .setOutputCol("token")
    .setContextChars(Array("(", ")", "?", "!"))
    .setSplitChars(Array('-'))
    .setExceptions(["New York", "e-mail"])
    .setSplitPattern("'")
    .setMaxLength(0)
    .setMaxLength(99999)
    .setCaseSensitiveExceptions(False)
API: Tokenizer
Source: Tokenizer

DocumentNormalizer (Text cleaning)

Annotator which normalizes raw text from tagged text, e.g. scraped web pages or xml documents, from document type columns into Sentence.

Output Annotator Type: Document

Input Annotator Types: Document

Example:

documentNormalizer = DocumentNormalizer() \
      .setInputCols("document") \
      .setOutputCol("normalizedDocument") \
      .setPatterns(cleanUpPatterns) \
      .setPolicy(removalPolicy)
    val documentNormalizer = new DocumentNormalizer()
      .setInputCols("document")
      .setOutputCol("normalizedDocument")
      .setPatterns(cleanUpPatterns)
      .setPolicy(removalPolicy)
API: DocumentNormalizer
Source: DocumentNormalizer

Normalizer (Text cleaning)

Removes all dirty characters from text following a regex pattern and transforms words based on a provided dictionary

Output Annotator Type: Token

Input Annotator Types: Token

Example:

normalizer = Normalizer() \
    .setInputCols(["token"]) \
    .setOutputCol("normalized")\
    .setLowercase(True)\
    .setCleanupPatterns(["[^\w\d\s]"]) \
    .setSlangMatchCase(False)
val normalizer = new Normalizer()
    .setInputCols(Array("token"))
    .setOutputCol("normalized")
    .setLowercase(True)
    .setCleanupPatterns(["[^\w\d\s]"])
    .setSlangMatchCase(False)
API: Normalizer
Source: Normalizer

Stemmer

Returns hard-stems out of words with the objective of retrieving the meaningful part of the word

Output Annotator Type: Token

Input Annotator Types: Token

Example:

stemmer = Stemmer() \
    .setInputCols(["token"]) \
    .setOutputCol("stem") \
    .setLanguage("English")
    
val stemmer = new Stemmer()
    .setInputCols(Array("token"))
    .setOutputCol("stem")
    .setLanguage("English") 
    
API: Stemmer
Source: Stemmer

Lemmatizer

Retrieves lemmas out of words with the objective of returning a base dictionary word

Output Annotator Type: Token

Input Annotator Types: Token

Example:

# Uncomment to Download the Dictionary
# !wget -q https://raw.githubusercontent.com/mahavivo/vocabulary/master/lemmas/AntBNC_lemmas_ver_001.txt

lemmatizer = Lemmatizer() \
    .setInputCols(["token"]) \
    .setOutputCol("lemma") \
    .setDictionary("./AntBNC_lemmas_ver_001.txt", value_delimiter ="\t", key_delimiter = "->")
    
// Uncomment to Download the Dictionary
// !wget -q https://raw.githubusercontent.com/mahavivo/vocabulary/master/lemmas/AntBNC_lemmas_ver_001.txt

val lemmatizer = new Lemmatizer()
    .setInputCols(Array("token"))
    .setOutputCol("lemma")
    .setDictionary("./AntBNC_lemmas_ver_001.txt", value_delimiter ="\t", key_delimiter = "->")
    
API: Lemmatizer
Source: Lemmatizer

StopWordsCleaner

This annotator excludes from a sequence of strings (e.g. the output of a Tokenizer(), Normalizer(), Lemmatizer(), and Stemmer()) and drops all the stop words from the input sequences.

Output Annotator Type: token

Input Annotator Types: token

Example:

stop_words_cleaner = StopWordsCleaner() \
    .setInputCols(["token"]) \
    .setOutputCol("cleanTokens") \
    .setStopWords(["this", "is", "and"]) \
    .setCaseSensitive(False)
    
val stopWordsCleaner = new StopWordsCleaner()
    .setInputCols("token")
    .setOutputCol("cleanTokens")
    .setStopWords(Array("this", "is", "and"))
    .setCaseSensitive(false)
    

NOTE: If you need to setStopWords from a text file, you can first read and convert it into an array of string as follows.

# your stop words text file, each line is one stop word
stopwords = sc.textFile("/tmp/stopwords/english.txt").collect()

# simply use it in StopWordsCleaner
stopWordsCleaner = StopWordsCleaner()\
      .setInputCols("token")\
      .setOutputCol("cleanTokens")\
      .setStopWords(stopwords)\
      .setCaseSensitive(False)

# or you can use pretrained models for StopWordsCleaner
stopWordsCleaner = StopWordsCleaner.pretrained()
      .setInputCols("token")\
      .setOutputCol("cleanTokens")\
      .setCaseSensitive(False)

// your stop words text file, each line is one stop word
val stopwords = sc.textFile("/tmp/stopwords/english.txt").collect()

// simply use it in StopWordsCleaner
val stopWordsCleaner = new StopWordsCleaner()
      .setInputCols("token")
      .setOutputCol("cleanTokens")
      .setStopWords(stopwords)
      .setCaseSensitive(false)

// or you can use pretrained models for StopWordsCleaner
val stopWordsCleaner = StopWordsCleaner.pretrained()
      .setInputCols("token")
      .setOutputCol("cleanTokens")
      .setCaseSensitive(false)      
API: StopWordsCleaner
Source: StopWordsCleaner

RegexMatcher

Uses a reference file to match a set of regular expressions and put them inside a provided key. File must be comma separated.

Output Annotator Type: Regex

Input Annotator Types: Document

Example:

# For example, here are some Regex Rules which you can write in regex_rules.txt
rules = '''
renal\s\w+, started with 'renal'
cardiac\s\w+, started with 'cardiac'
\w*ly\b, ending with 'ly'
\S*\d+\S*, match any word that contains numbers
(\d+).?(\d*)\s*(mg|ml|g), match medication metrics
'''

regex_matcher = RegexMatcher() \
    .setStrategy("MATCH_ALL") \
    .setInputCols("document") \
    .setOutputCol("regex") \
    .setExternalRules(path='./regex_rules.txt', delimiter=',')
    
val regexMatcher = new RegexMatcher()
    .setStrategy("MATCH_ALL")
    .setInputCols(Array("document"))
    .setOutputCol("regex")
    
API: RegexMatcher
Source: RegexMatcher RegexMatcherModel

TextMatcher (Phrase matching)

Annotator to match entire phrases (by token) provided in a file against a Document

Output Annotator Type: Entity

Input Annotator Types: Document, Token

Example:

# For example, here are some entities and they are stored in sport_entities.txt
entities = ['soccer', 'world cup', 'Messi', 'FC Barcelona', 'cricket', 'Dhoni']

entity_extractor = TextMatcher() \
    .setInputCols(["inputCol"])\
    .setOutputCol("entity")\
    .setEntities("/path/to/file/sport_entities.txt") \ 
    .setEntityValue('sport_entity') \
    .setCaseSensitive(True) \ 
    .setMergeOverlapping(False)
// Assume following are our entities and they are stored in sport_entities.txt
entities = ("soccer", "world cup", "Messi", "FC Barcelona", "cricket", "Dhoni")

val entityExtractor = new TextMatcher()
    .setInputCols("inputCol")
    .setOutputCol("entity")
    .setEntities("/path/to/file/myentities.txt")
    .setEntityValue("sport_entity")
    .setCaseSensitive(true)
    .setMergeOverlapping(false)
API: TextMatcher
Source: TextMatcher TextMatcherModel

Chunker

This annotator matches a pattern of part-of-speech tags in order to return meaningful phrases from document

Output Annotator Type: Chunk

Input Annotator Types: Document, POS

Example:

chunker = Chunker() \
    .setInputCols(["document", "pos"]) \
    .setOutputCol("chunk") \
    .setRegexParsers(["‹NNP›+", "‹DT|PP\\$›?‹JJ›*‹NN›"])
    
val chunker = new Chunker()
    .setInputCols(Array("document", "pos"))
    .setOutputCol("chunk")
    .setRegexParsers(Array("‹NNP›+", "‹DT|PP\\$›?‹JJ›*‹NN›"))
    
API: Chunker
Source: Chunker

NGramGenerator

NGramGenerator annotator takes as input a sequence of strings (e.g. the output of a Tokenizer, Normalizer, Stemmer, Lemmatizer, and StopWordsCleaner). The parameter n is used to determine the number of terms in each n-gram. The output will consist of a sequence of n-grams where each n-gram is represented by a space-delimited string of n consecutive words with annotatorType CHUNK same as the Chunker annotator.

Output Annotator Type: CHUNK

Input Annotator Types: TOKEN

Reference: NGramGenerator

Example:

ngrams_cum = NGramGenerator() \
            .setInputCols(["token"]) \
            .setOutputCol("ngrams") \
            .setN(2) \
            .setEnableCumulative(True) \
            .setDelimiter("_") # Default is space
val nGrams = new NGramGenerator()
        .setInputCols("token")
        .setOutputCol("ngrams")
        .setN(2)
        .setEnableCumulative(true)
        .setDelimiter("_") // Default is space

DateMatcher

Reads from different forms of date and time expressions and converts them to a provided date format. Extracts only ONE date per sentence. Use with sentence detector for more matches.

Output Annotator Type: Date

Input Annotator Types: Document

Reads the following kind of dates:

Format Format Format
1978-01-28 last wednesday 5 am tomorrow
1984/04/02 today 0600h
1/02/1980 tomorrow 06:00 hours
2/28/79 yesterday 6pm
The 31st of April in the year 2008 next week at 7.30 5:30 a.m.
Fri, 21 Nov 1997 next week at 5
Jan 21, ‘97 next month 12:59
Sun, Nov 21 next year 1988/11/23 6pm
jan 1st day after 23:59
next thursday the day before  

Example:

date_matcher = DateMatcher() \
    .setInputCols('document')\
    .setOutputCol("date") \
    .setDateFormat("yyyy/MM/dd")
val dateMatcher = new DateMatcher()
    .setInputCols("document")
    .setOutputCol("date")
    .setFormat("yyyyMM")
API: DateMatcher
Source: DateMatcher

MultiDateMatcher

Reads from multiple different forms of date and time expressions and converts them to a provided date format. Extracts multiple dates per sentence.

Output Annotator Type: Date

Input Annotator Types: Document

Reads the following kind of dates:

Format Format Format
1978-01-28 jan 1st day after
1984/04/02 next thursday the day before
1978-01-28 last wednesday 0600h
1988/11/23 6pm today 06:00 hours
1/02/1980 tomorrow 6pm
2/28/79 yesterday 5:30 a.m.
The 31st of April in the year 2008 at 5 next week at 7.30
Fri, 21 Nov 1997 next week 12:59
Jan 21, ‘97 next month 23:59
Sun, Nov 21 next year 5 am tomorrow

Example:

date_matcher = MultiDateMatcher() \
    .setInputCols('document')\
    .setOutputCol("date") \
    .setDateFormat("yyyy/MM/dd")
val dateMatcher = new MultiDateMatcher()
    .setInputCols("document")
    .setOutputCol("date")
    .setFormat("yyyyMM")
API: MultiDateMatcher
Source: MultiDateMatcher

SentenceDetector

Finds sentence bounds in raw text. Applies rules from Pragmatic Segmenter.

Output Annotator Type: Sentence

Input Annotator Types: Document

Example:

sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")
    
val sentenceDetector = new SentenceDetector()
    .setInputCols("document")
    .setOutputCol("sentence")
    
API: SentenceDetector

Source: SentenceDetector

POSTagger (Part of speech tagger)

Sets a POS tag to each word within a sentence. Its train data (train_pos) is a spark dataset of POS format values with Annotation columns.

Output Annotator Type: POS

Input Annotator Types: Document, Token

Example:

pos_tagger = PerceptronApproach() \
    .setInputCols(["token", "sentence"]) \
    .setOutputCol("pos") \
    .setNIterations(2) \
    .setFrequencyThreshold(30)
    
val posTagger = new PerceptronApproach()
    .setInputCols(Array("sentence", "token"))
    .setOutputCol("pos")
    .setNIterations(2)
    .setFrequencyThreshold(30)
    
API: PerceptronApproach
Source: PerceptronApproach PerceptronModel

ViveknSentimentDetector

Scores a sentence for a sentiment

Output Annotator Type: sentiment

Input Annotator Types: Document, Token

Example:

  • Train your own model
sentiment_detector = ViveknSentimentApproach() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("sentiment") \
    .setSentimentCol("sentiment_label") \
    .setCorpusPrune(0) \
    .setImportantFeatureRatio(16.66)
    
val sentimentDetector = new ViveknSentimentApproach()
    .setInputCols(Array("token", "sentence"))
    .setOutputCol("vivekn")
    .setSentimentCol("sentiment_label")
    .setCorpusPrune(0)
    .setImportantFeatureRatio(16.66)
    
  • Use a pretrained model
sentiment_detector = ViveknSentimentModel.pretrained() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("sentiment") \
    
val sentimentDetector = new ViveknSentimentModel.pretrained
    .setInputCols(Array("token", "sentence"))
    .setOutputCol("vivekn")
    
API: ViveknSentimentApproach
Source: ViveknSentimentApproach ViveknSentimentModel

SentimentDetector (Sentiment analysis)

Scores a sentence for a sentiment

Output Annotator Type: Sentiment

Input Annotator Types: Document, Token

Example:

sentiment_detector = SentimentDetector() \
    .setInputCols(["token", "sentence"]) \
    .setOutputCol("sentiment") \
    .setPositiveMultiplier(1.0)\
    .setNegativeMultiplier(-1.0)\
    .setIncrementMultiplier(2.0)\
    .setDecrementMultiplier(-2.0)\
    .setReverseMultiplier(-1.0)
val sentimentDetector = new SentimentDetector
    .setInputCols(Array("token", "sentence"))
    .setOutputCol("sentiment")
    .setPositiveMultiplier(1.0)
    .setNegativeMultiplier(-1.0)
    .setIncrementMultiplier(2.0)
    .setDecrementMultiplier(-2.0)
    .setReverseMultiplier(-1.0)
API: SentimentDetector
Reference: SentimentDetector SentimentDetectorModel

WordEmbeddings

Word Embeddings lookup annotator that maps tokens to vectors

Output Annotator Type: Word_Embeddings

Input Annotator Types: Document, Token

Example:

embeddings = WordEmbeddings()
    .setStoragePath("/tmp/glove.6B.100d.txt", "TEXT")\
    .setDimension(100)\
    .setStorageRef("glove_100d") \
    .setInputCols("document", "token") \
    .setOutputCol("embeddings")


# or you can use the pretrained models for WordEmbeddings
embeddings = WordEmbeddingsModel.pretrained()
    .setInputCols("document", "token") \
    .setOutputCol("embeddings")

val embeddings = new WordEmbeddings()
    .setStoragePath("/tmp/glove.6B.100d.txt", "TEXT")
    .setDimension(100)
    .setStorageRef("glove_100d")
    .setInputCols("document", "token")
    .setOutputCol("embeddings")

// or you can use the pretrained models for WordEmbeddings
val embeddings = WordEmbeddingsModel.pretrained()
    .setInputCols("document", "token")
    .setOutputCol("embeddings")

There are also two convenient functions to retrieve the embeddings coverage with respect to the transformed dataset:

  • withCoverageColumn(dataset, embeddingsCol, outputCol): Adds a custom column with word coverage stats for the embedded field: (coveredWords, totalWords, coveragePercentage). This creates a new column with statistics for each row.
  • overallCoverage(dataset, embeddingsCol): Calculates overall word coverage for the whole data in the embedded field. This returns a single coverage object considering all rows in the field.
API: WordEmbeddings
Source: WordEmbeddings WordEmbeddingsModel

BertEmbeddings

BERT (Bidirectional Encoder Representations from Transformers) provides dense vector representations for natural language by using a deep, pre-trained neural network with the Transformer architecture

You can find the pre-trained models for BertEmbeddings in the Spark NLP Models repository

Output Annotator Type: Word_Embeddings

Input Annotator Types: Document, Token

Example:


bert = BertEmbeddings.pretrained() \
    .setInputCols("sentence", "token") \
    .setOutputCol("bert")
    
val bert = BertEmbeddings.pretrained()
    .setInputCols("sentence", "token")
    .setOutputCol("bert")
    
API: BertEmbeddings
Source: BertEmbeddings

BertSentenceEmbeddings

BERT (Bidirectional Encoder Representations from Transformers) provides dense vector representations for natural language by using a deep, pre-trained neural network with the Transformer architecture

You can find the pre-trained models for BertEmbeddings in the Spark NLP Models repository

Output Annotator Type: Sentence_Embeddings

Input Annotator Types: Document

Example:

How to use pretrained BertEmbeddings:


bert = BertSentencembeddings.pretrained() \
    .setInputCols("document") \
    .setOutputCol("bert_sentence_embeddings")
    
val bert = BertEmbeddings.pretrained()
    .setInputCols("document")
    .setOutputCol("bert_sentence_embeddings")
                                                    
API: BertSentenceEmbeddings
Source: BertSentenceEmbeddings

ElmoEmbeddings

Computes contextualized word representations using character-based word representations and bidirectional LSTMs

You can find the pre-trained model for ElmoEmbeddings in the Spark NLP Models repository

Output Annotator Type: Word_Embeddings

Input Annotator Types: Document, Token

Example:

# Online - Download the pretrained model
elmo = ElmoEmbeddings.pretrained()
        .setInputCols("sentence", "token") \
        .setOutputCol("elmo")
        

# Offline - Download the pretrained model manually and extract it
elmo = ElmoEmbeddings.load("/elmo_en_2.4.0_2.4_1580488815299") \
        .setInputCols("sentence", "token") \
        .setOutputCol("elmo")
        

val elmo = ElmoEmbeddings.pretrained()
        .setInputCols("sentence", "token")
        .setOutputCol("elmo")
        .setPoolingLayer("elmo") //  word_emb, lstm_outputs1, lstm_outputs2 or elmo
        
API: ElmoEmbeddings
Source: ElmoEmbeddings

AlbertEmbeddings

Computes contextualized word representations using “A Lite” implementation of BERT algorithm by applying parameter-reduction techniques

You can find the pre-trained model for AlbertEmbeddings in the Spark NLP Models repository

Output Annotator Type: Word_Embeddings

Input Annotator Types: Document, Token

Examples:

# Online - Download the pretrained model
albert = AlbertEmbeddings.pretrained()
        .setInputCols("sentence", "token") \
        .setOutputCol("albert")
        

# Offline - Download the pretrained model manually and extract it
albert = AlbertEmbeddings.load("/albert_base_uncased_en_2.5.0_2.4_1588073363475") \
        .setInputCols("sentence", "token") \
        .setOutputCol("albert")
        

val albert = AlbertEmbeddings.pretrained()
        .setInputCols("sentence", "token")
        .setOutputCol("albert")
        
API: AlbertEmbeddings
Source: AlbertEmbeddings

XlnetEmbeddings

Computes contextualized word representations using combination of Autoregressive Language Model and Permutation Language Model

You can find the pre-trained model for XlnetEmbeddings in the Spark NLP Models repository

Output Annotator Type: Word_Embeddings

Input Annotator Types: Document, Token

Example:

How to use pretrained XlnetEmbeddings:

# Online - Download the pretrained model
xlnet = XlnetEmbeddings.pretrained()
        .setInputCols("sentence", "token") \
        .setOutputCol("xlnet")
        

# Offline - Download the pretrained model manually and extract it
xlnet = XlnetEmbeddings.load("/xlnet_large_cased_en_2.5.0_2.4_1588074397954") \
        .setInputCols("sentence", "token") \
        .setOutputCol("xlnet")
        

val xlnet = XlnetEmbeddings.pretrained()
        .setInputCols("sentence", "token")
        .setOutputCol("xlnet")
API: XlnetEmbeddings
Source: XlnetEmbeddings

UniversalSentenceEncoder

The Universal Sentence Encoder encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks.

Output Annotator Type: SENTENCE_EMBEDDINGS

Input Annotator Types: Document

Example:

use = UniversalSentenceEncoder.pretrained() \
        .setInputCols("sentence") \
        .setOutputCol("use_embeddings")
val use = UniversalSentenceEncoder.pretrained()
        .setInputCols("document")
        .setOutputCol("use_embeddings")
API: UniversalSentenceEncoder
Source: UniversalSentenceEncoder

SentenceEmbeddings

This annotator converts the results from WordEmbeddings, BertEmbeddings, ElmoEmbeddings, AlbertEmbeddings, or XlnetEmbeddings into sentence or document embeddings by either summing up or averaging all the word embeddings in a sentence or a document (depending on the inputCols).

Output Annotator Type: SENTENCE_EMBEDDINGS

Input Annotator Types: Document

Example:

sentence_embeddings = SentenceEmbeddings() \
        .setInputCols(["document", "embeddings"]) \
        .setOutputCol("sentence_embeddings") \
        .setPoolingStrategy("AVERAGE")
val embeddingsSentence = new SentenceEmbeddings()
        .setInputCols(Array("document", "embeddings"))
        .setOutputCol("sentence_embeddings")
        .setPoolingStrategy("AVERAGE")

NOTE: If you choose document as your input for Tokenizer, WordEmbeddings/BertEmbeddings, and SentenceEmbeddings then it averages/sums all the embeddings into one array of embeddings. However, if you choose sentence as inputCols then for each sentence SentenceEmbeddings generates one array of embeddings. TIP: Here is how you can explode and convert these embeddings into Vectors or what’s known as Feature column so it can be used in Spark ML regression or clustering functions

from org.apache.spark.ml.linal import Vector, Vectors
from pyspark.sql.functions import udf
# Let's create a UDF to take array of embeddings and output Vectors
@udf(Vector)
def convertToVectorUDF(matrix):
    return Vectors.dense(matrix.toArray.map(_.toDouble))


# Now let's explode the sentence_embeddings column and have a new feature column for Spark ML
pipelineDF.select(explode("sentence_embeddings.embeddings").as("sentence_embedding"))
.withColumn("features", convertToVectorUDF("sentence_embedding"))
import org.apache.spark.ml.linalg.{Vector, Vectors}

// Let's create a UDF to take array of embeddings and output Vectors
val convertToVectorUDF = udf((matrix : Seq[Float]) => {
    Vectors.dense(matrix.toArray.map(_.toDouble))
})

// Now let's explode the sentence_embeddings column and have a new feature column for Spark ML
pipelineDF.select(explode($"sentence_embeddings.embeddings").as("sentence_embedding"))
.withColumn("features", convertToVectorUDF($"sentence_embedding"))
API: SentenceEmbeddings
Source: SentenceEmbeddings

ChunkEmbeddings

This annotator utilizes WordEmbeddings or BertEmbeddings to generate chunk embeddings from either Chunker, NGramGenerator, or NerConverter outputs.

Output Annotator Type: CHUNK

Input Annotator Types: CHUNK, Word_Embeddings

Example:

chunk_embeddings = ChunkEmbeddings() \
        .setInputCols(["chunk", "embeddings"]) \
        .setOutputCol("chunk_embeddings") \
        .setPoolingStrategy("AVERAGE")
val chunkSentence = new ChunkEmbeddings()
        .setInputCols(Array("chunk", "embeddings"))
        .setOutputCol("chunk_embeddings")
        .setPoolingStrategy("AVERAGE")

TIP: Here is how you can explode and convert these embeddings into Vectors or what’s known as Feature column so it can be used in Spark ML regression or clustering functions

from org.apache.spark.ml.linal import Vector, Vectors
from pyspark.sql.functions import udf

// Let's create a UDF to take array of embeddings and output Vectors
@udf(Vector)
def convertToVectorUDF(matrix):
    return Vectors.dense(matrix.toArray.map(_.toDouble))

import org.apache.spark.ml.linalg.{Vector, Vectors}

// Let's create a UDF to take array of embeddings and output Vectors
val convertToVectorUDF = udf((matrix : Seq[Float]) => {
    Vectors.dense(matrix.toArray.map(_.toDouble))
})

// Now let's explode the sentence_embeddings column and have a new feature column for Spark ML
pipelineDF.select(explode($"chunk_embeddings.embeddings").as("chunk_embeddings_exploded"))
.withColumn("features", convertToVectorUDF($"chunk_embeddings_exploded"))
API: ChunkEmbeddings
Source: ChunkEmbeddings

ClassifierDL (Multi-class Text Classification)

ClassifierDL is a generic Multi-class Text Classification. ClassifierDL uses the state-of-the-art Universal Sentence Encoder as an input for text classifications. The ClassifierDL annotator uses a deep learning model (DNNs) we have built inside TensorFlow and supports up to 100 classes

Output Annotator Type: CATEGORY

Input Annotator Types: SENTENCE_EMBEDDINGS

NOTE: This annotator accepts a label column of a single item in either type of String, Int, Float, or Double. NOTE: UniversalSentenceEncoder, BertSentenceEmbeddings, or SentenceEmbeddings can be used for the inputCol

Example:

docClassifier = ClassifierDLApproach()\
        .setInputCols("sentence_embeddings")\
        .setOutputCol("category")\
        .setLabelColumn("label")\
        .setBatchSize(64)\
        .setMaxEpochs(20)\
        .setLr(0.5)\
        .setDropout(0.5)
val docClassifier = new ClassifierDLApproach()
        .setInputCols("sentence_embeddings")
        .setOutputCol("category")
        .setLabelColumn("label")
        .setBatchSize(64)
        .setMaxEpochs(20)
        .setLr(5e-3f)
        .setDropout(0.5f)

Please refer to existing notebooks for more examples.

API: ClassifierDLApproach
Source: ClassifierDLApproach ClassifierDLModel

MultiClassifierDL (Multi-label Text Classification)

MultiClassifierDL is a Multi-label Text Classification. MultiClassifierDL uses a Bidirectional GRU with Convolution model that we have built inside TensorFlow and supports up to 100 classes. The input to MultiClassifierDL is Sentence Embeddings such as state-of-the-art UniversalSentenceEncoder, BertSentenceEmbeddings, or SentenceEmbeddings

Output Annotator Type: CATEGORY

Input Annotator Types: SENTENCE_EMBEDDINGS

NOTE: This annotator accepts a label column of a single item in either type of String, Int, Float, or Double.

NOTE: UniversalSentenceEncoder, BertSentenceEmbeddings, or SentenceEmbeddings can be used for the inputCol

Example:

docMultiClassifier = MultiClassifierDLApproach()\
        .setInputCols("sentence_embeddings")\
        .setOutputCol("category")\
        .setLabelColumn("label")\
        .setBatchSize(64)\
        .setMaxEpochs(20)\
        .setLr(0.5)
val docMultiClassifier = new MultiClassifierDLApproach()
        .setInputCols("sentence_embeddings")
        .setOutputCol("category")
        .setLabelColumn("label")
        .setBatchSize(64)
        .setMaxEpochs(20)
        .setLr(5e-3f)

Please refer to existing notebooks for more examples.

API: MultiClassifierDLApproach
Source: MultiClassifierDLApproach MultiClassifierDLModel

SentimentDL (Multi-class Sentiment Analysis annotator)

SentimentDL is an annotator for multi-class sentiment analysis. This annotator comes with 2 available pre-trained models trained on IMDB and Twitter datasets

Output Annotator Type: CATEGORY

Input Annotator Types: SENTENCE_EMBEDDINGS

NOTE: This annotator accepts a label column of a single item in either type of String, Int, Float, or Double.

NOTE: UniversalSentenceEncoder, BertSentenceEmbeddings, or SentenceEmbeddings can be used for the inputCol

Example:

sentimentClassifier = SentimentDLApproach()\
        .setInputCols("sentence_embeddings")\
        .setOutputCol("category")\
        .setLabelColumn("label")\
        .setBatchSize(64)\
        .setMaxEpochs(20)\
        .setLr(0.5)\
        .setDropout(0.5)
val sentimentClassifier = new SentimentDLApproach()
        .setInputCols("sentence_embeddings")
        .setOutputCol("category")
        .setLabelColumn("label")
        .setBatchSize(64)
        .setMaxEpochs(20)
        .setLr(5e-3f)
        .setDropout(0.5f)

Please refer to existing notebooks for more examples.

API: SentimentDLApproach
Source: SentimentDLApproach SentimentDLModel

LanguageDetectorDL (Language Detection and Identiffication)

LanguageDetectorDL is a state-of-the-art language detection and identification annotator trained by using TensorFlow/keras neural networks.

Output Annotator Type: LANGUAGE

Input Annotator Types: DOCUMENT or SENTENCE

Example:

languageDetector = LanguageDetectorDL.pretrained("ld_wiki_20")
        .setInputCols("document")\
        .setOutputCol("language")\
        .setThreshold(0.3)\
        .setCoalesceSentences(True)
 val languageDetector = LanguageDetectorDL.pretrained("ld_wiki_20")
        .setInputCols("document")
        .setOutputCol("language")
        .setThreshold(0.3f)
        .setCoalesceSentences(true)
API: LanguageDetectorDL
Source: LanguageDetectorDL

YakeModel (Keywords Extraction)

Yake is an Unsupervised, Corpus-Independent, Domain and Language-Independent and Single-Document keyword extraction algorithm.

sExtracting keywords from texts has become a challenge for individuals and organizations as the information grows in complexity and size. The need to automate this task so that text can be processed in a timely and adequate manner has led to the emergence of automatic keyword extraction tools. Yake is a novel feature-based system for multi-lingual keyword extraction, which supports texts of different sizes, domain or languages. Unlike other approaches, Yake does not rely on dictionaries nor thesauri, neither is trained against any corpora. Instead, it follows an unsupervised approach which builds upon features extracted from the text, making it thus applicable to documents written in different languages without the need for further knowledge. This can be beneficial for a large number of tasks and a plethora of situations where access to training corpora is either limited or restricted.

The algorithm makes use of the position of a sentence and token. Therefore, to use the annotator, the text should be first sent through a Sentence Boundary Detector and then a tokenizer.

You can tweak the following parameters to get the best result from the annotator.

Output Annotator Type: KEYWORD

Input Annotator Types: TOKEN

Example:

keywords = YakeModel() \
        .setInputCols("token") \
        .setOutputCol("keywords") \
        .setMinNGrams(1) \
        .setMaxNGrams(3)\
        .setNKeywords(20)\
        .setStopWords(stopwords)
 val keywords = new YakeModel()
        .setInputCols("token")
        .setOutputCol("keywords")
        .setMinNGrams(1)
        .setMaxNGrams(3)
        .setNKeywords(20)
        .setStopWords(stopwords)
API: YakeModel
Source: YakeModel

NER CRF (Named Entity Recognition CRF annotator)

This Named Entity recognition annotator allows for a generic model to be trained by utilizing a CRF machine learning algorithm. Its train data (train_ner) is either a labeled or an external CoNLL 2003 IOB based spark dataset with Annotations columns. Also the user has to provide word embeddings annotation column.
Optionally the user can provide an entity dictionary file for better accuracy

Output Annotator Type: Named_Entity

Input Annotator Types: Document, Token, POS, Word_Embeddings

Example:

nerTagger = NerCrfApproach()\
        .setInputCols(["sentence", "token", "pos", "embeddings"])\
        .setLabelColumn("label")\
        .setOutputCol("ner")\
        .setMinEpochs(1)\
        .setMaxEpochs(20)\
        .setLossEps(1e-3)\
        .setDicts(["ner-corpus/dict.txt"])\
        .setL2(1)\
        .setC0(1250000)\
        .setRandomSeed(0)\
        .setVerbose(2)
val nerTagger = new NerCrfApproach()
        .setInputCols("sentence", "token", "pos", "embeddings")
        .setLabelColumn("label")
        .setMinEpochs(1)
        .setMaxEpochs(3)
        .setC0(34)
        .setL2(3.0)
        .setOutputCol("ner")
API: NerCrfApproach
Source: NerCrfApproach NerCrfModel

NER DL (Named Entity Recognition Deep Learning annotator)

This Named Entity recognition annotator allows to train generic NER model based on Neural Networks. Its train data (train_ner) is either a labeled or an external CoNLL 2003 IOB based spark dataset with Annotations columns. Also the user has to provide word embeddings annotation column.
Neural Network architecture is Char CNNs - BiLSTM - CRF that achieves state-of-the-art in most datasets.

Output Annotator Type: Named_Entity

Input Annotator Types: Document, Token, Word_Embeddings

Note: Please check here in case you get an IllegalArgumentException error with a description such as:

Graph [parameter] should be [value]: Could not find a suitable tensorflow graph for embeddings dim: [value] tags: [value] nChars: [value]. Generate graph by python code in python/tensorflow/ner/create_models before usage and use setGraphFolder Param to point to output.

Example:

nerTagger = NerDLApproach()\
        .setInputCols(["sentence", "token", "embeddings"])\
        .setLabelColumn("label")\
        .setOutputCol("ner")\
        .setMaxEpochs(10)\
        .setRandomSeed(0)\
        .setVerbose(2)
val nerTagger = new NerDLApproach()
        .setInputCols("sentence", "token", "embeddings")
        .setOutputCol("ner")
        .setLabelColumn("label")
        .setMaxEpochs(120)
        .setRandomSeed(0)
        .setPo(0.03f)
        .setLr(0.2f)
        .setDropout(0.5f)
        .setBatchSize(9)
        .setVerbose(Verbose.Epochs)
API: NerDLApproach
Source: NerDLApproach NerDLModel

NER Converter (Converts IOB or IOB2 representation of NER to user-friendly)

NER Converter used to finalize work of NER annotators. Combines entites with types B-, I- and etc. to the Chunks with Named entity in the metadata field (if LightPipeline is used can be extracted after fullAnnotate())

This NER converter can be used to the output of a NER model into the ner chunk format.

Output Annotator Type: Chunk

Input Annotator Types: Document, Token, Named_Entity

Example:

nerConverter = NerConverter()\
        .setInputCols(["sentence", "token", "ner_src"])\
        .setOutputCol("ner_chunk")
val nerConverter = new NerConverter()
        .setInputCols("sentence", "token", "ner_src")
        .setOutputCol("ner_chunk")
API: NerConverter
Source: NerConverter

Norvig SpellChecker

This annotator retrieves tokens and makes corrections automatically if not found in an English dictionary

Output Annotator Type: Token

Input Annotator Types: Token

Inputs: Any text for corpus. A list of words for dictionary. A comma separated custom dictionary.

Train Data: train_corpus is a spark dataset of text content

Example:

spell_checker = NorvigSweetingApproach() \
        .setInputCols(["token"]) \
        .setOutputCol("checked") \
        .setDictionary("coca2017.txt", "[a-zA-Z]+")
val symSpellChecker = new NorvigSweetingApproach()
        .setInputCols("token")
        .setOutputCol("checked")
        .setDictionary("coca2017.txt", "[a-zA-Z]+")
API: NorvigSweetingApproach
Source: NorvigSweetingApproach NorvigSweetingModel

Symmetric SpellChecker

This spell checker is inspired on Symmetric Delete algorithm. It retrieves tokens and utilizes distance metrics to compute possible derived words

Output Annotator Type: Token

Input Annotator Types: Token

Inputs: Any text for corpus. A list of words for dictionary. A comma separated custom dictionary.

Train Data: train_corpus is a spark dataset of text content

Example:

spell_checker = SymmetricDeleteApproach() \
    .setInputCols(["token"]) \
    .setOutputCol("spell")
val spellChecker = new SymmetricDeleteApproach()
    .setInputCols(Array("normalized"))
    .setOutputCol("spell")
API: SymmetricDeleteApproach
Source: SymmetricDeleteApproach SymmetricDeleteModel

Context SpellChecker

Implements Noisy Channel Model Spell Algorithm. Correction candidates are extracted combining context information and word information

Output Annotator Type: Token

Input Annotator Types: Token

Inputs: Any text for corpus. A list of words for dictionary. A comma separated custom dictionary.

Train Data: train_corpus is a spark dataset of text content

Example:

spell_checker = ContextSpellCheckerApproach() \
    .setInputCols(["token"]) \
    .setOutputCol("spell") \
    .fit(train_corpus) \
    .setErrorThreshold(4.0)\
    .setTradeoff(6.0)
val spellChecker = new ContextSpellCheckerApproach()
    .setInputCols(Array("token"))
    .setOutputCol("spell")
    .fit(trainCorpus)
    .setErrorThreshold(4.0)
    .setTradeoff(6.0)
API: ContextSpellCheckerApproach
Source: ContextSpellCheckerApproach ContextSpellCheckerModel

Dependency Parsers

Dependency parser provides information about word relationship. For example, dependency parsing can tell you what the subjects and objects of a verb are, as well as which words are modifying (describing) the subject. This can help you find precise answers to specific questions. The following diagram illustrates a dependency-style analysis using the standard graphical method favored in the dependency-parsing community.

Dependency Parser

Relations among the words are illustrated above the sentence with directed, labeled arcs from heads to dependents. We call this a typed dependency structure because the labels are drawn from a fixed inventory of grammatical relations. It also includes a root node that explicitly marks the root of the tree, the head of the entire structure. [1]

Untyped Dependency Parser (Unlabeled grammatical relation)

Unlabeled parser that finds a grammatical relation between two words in a sentence. Its input is a directory with dependency treebank files.

Output Annotator Type: Dependency

Input Annotator Types: Document, POS, Token

Example:

dependency_parser = DependencyParserApproach() \
        .setInputCols(["sentence", "pos", "token"]) \
        .setOutputCol("dependency") \
        .setDependencyTreeBank("file://parser/dependency_treebank") \
        .setNumberOfIterations(10)
val dependencyParser = new DependencyParserApproach()
        .setInputCols(Array("sentence", "pos", "token"))
        .setOutputCol("dependency")
        .setDependencyTreeBank("parser/dependency_treebank")
        .setNumberOfIterations(10)
API: DependencyParserApproach
Source: DependencyParserApproach DependencyParserModel

Typed Dependency Parser (Labeled grammatical relation)

Labeled parser that finds a grammatical relation between two words in a sentence. Its input is a CoNLL2009 or ConllU dataset.

Output Annotator Type: Labeled Dependency

Input Annotator Types: Token, POS, Dependency

Example:

typed_dependency_parser = TypedDependencyParserApproach() \
        .setInputCols(["token", "pos", "dependency"]) \
        .setOutputCol("labdep") \
        .setConll2009("file://conll2009/eng.train") \
        .setNumberOfIterations(10)
val typedDependencyParser = new TypedDependencyParserApproach()
        .setInputCols(Array("token", "pos", "dependency"))
        .setOutputCol("labdep")
        .setConll2009("conll2009/eng.train"))
        
API: TypedDependencyParserApproach
Source: TypedDependencyParserApproach TypedDependencyParserModel

References

[1] Speech and Language Processing. Daniel Jurafsky & James H. Martin. 2018

Last updated