sparknlp.annotatorΒΆ

Module containing all available Annotators of Spark NLP and their base classes.

Classes

AlbertEmbeddings

ALBERT: A Lite Bert For Self-Supervised Learning Of Language Representations - Google Research, Toyota Technological Institute at Chicago

AlbertForTokenClassification

AlbertForTokenClassification can load ALBERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g.

BertEmbeddings

Token-level embeddings using BERT.

BertForTokenClassification

BertForTokenClassification can load Bert Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g.

BertSentenceEmbeddings

Sentence-level embeddings using BERT.

BigTextMatcher

Annotator to match exact phrases (by token) provided in a file against a Document.

BigTextMatcherModel

Instantiated model of the BigTextMatcher.

ChunkEmbeddings

This annotator utilizes WordEmbeddings, BertEmbeddings etc.

ChunkTokenizer

Tokenizes and flattens extracted NER chunks.

ChunkTokenizerModel

Instantiated model of the ChunkTokenizer.

Chunker

This annotator matches a pattern of part-of-speech tags in order to return meaningful phrases from document.

ClassifierDLApproach

Trains a ClassifierDL for generic Multi-class Text Classification.

ClassifierDLModel

ClassifierDL for generic Multi-class Text Classification.

ContextSpellCheckerApproach

Trains a deep-learning based Noisy Channel Model Spell Algorithm.

ContextSpellCheckerModel

Implements a deep-learning based Noisy Channel Model Spell Algorithm.

DateMatcher

Matches standard date formats into a provided format Reads from different forms of date and time expressions and converts them to a provided date format.

DateMatcherUtils

Base class for DateMatcher Annotators

DependencyParserApproach

Trains an unlabeled parser that finds a grammatical relations between two words in a sentence.

DependencyParserModel

Unlabeled parser that finds a grammatical relation between two words in a sentence.

DistilBertEmbeddings

DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base.

DistilBertForTokenClassification

DistilBertForTokenClassification can load Bert Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g.

DocumentNormalizer

Annotator which normalizes raw text from tagged text, e.g.

ElmoEmbeddings

Word embeddings from ELMo (Embeddings from Language Models), a language model trained on the 1 Billion Word Benchmark.

EntityRulerApproach

Fits an Annotator to match exact strings or regex patterns provided in a file against a Document and assigns them an named entity.

EntityRulerModel

Instantiated model of the EntityRulerApproach.

GraphExtraction

Extracts a dependency graph between entities.

LanguageDetectorDL

Language Identification and Detection by using CNN and RNN architectures in TensorFlow.

Lemmatizer

Class to find lemmas out of words with the objective of returning a base dictionary word.

LemmatizerModel

Instantiated Model of the Lemmatizer.

LongformerEmbeddings

Longformer is a transformer model for long documents.

LongformerForTokenClassification

LongformerForTokenClassification can load Longformer Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g.

MarianTransformer

MarianTransformer: Fast Neural Machine Translation

MultiClassifierDLApproach

Trains a MultiClassifierDL for Multi-label Text Classification.

MultiClassifierDLModel

MultiClassifierDL for Multi-label Text Classification.

MultiDateMatcher

Matches standard date formats into a provided format.

NGramGenerator

A feature transformer that converts the input array of strings (annotatorType TOKEN) into an array of n-grams (annotatorType CHUNK).

NerApproach

Base class for Ner*Approach Annotators

NerConverter

Converts a IOB or IOB2 representation of NER to a user-friendly one, by associating the tokens of recognized entities and their label.

NerCrfApproach

Algorithm for training a Named Entity Recognition Model

NerCrfModel

Extracts Named Entities based on a CRF Model.

NerDLApproach

This Named Entity recognition annotator allows to train generic NER model based on Neural Networks.

NerDLModel

This Named Entity recognition annotator is a generic NER model based on Neural Networks.

NerOverwriter

Overwrites entities of specified strings.

Normalizer

Annotator that cleans out tokens.

NormalizerModel

Instantiated Model of the Normalizer.

NorvigSweetingApproach

Trains annotator, that retrieves tokens and makes corrections automatically if not found in an English dictionary.

NorvigSweetingModel

This annotator retrieves tokens and makes corrections automatically if not found in an English dictionary.

PerceptronApproach

Trains an averaged Perceptron model to tag words part-of-speech.

PerceptronModel

Averaged Perceptron model to tag words part-of-speech.

RecursiveTokenizer

Tokenizes raw text recursively based on a handful of definable rules.

RecursiveTokenizerModel

Instantiated model of the RecursiveTokenizer.

RegexMatcher

Uses a reference file to match a set of regular expressions and associate them with a provided identifier.

RegexMatcherModel

Instantiated model of the RegexMatcher.

RegexTokenizer

A tokenizer that splits text by a regex pattern.

RoBertaEmbeddings

Creates word embeddings using RoBERTa.

RoBertaForTokenClassification

RoBertaForTokenClassification can load RoBerta Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g.

RoBertaSentenceEmbeddings

Sentence-level embeddings using RoBERTa.

SentenceDetector

Annotator that detects sentence boundaries using any provided approach.

SentenceDetectorDLApproach

Trains an annotator that detects sentence boundaries using a deep learning approach.

SentenceDetectorDLModel

Annotator that detects sentence boundaries using a deep learning approach.

SentenceDetectorParams

Base class for SentenceDetector parameters

SentenceEmbeddings

Converts the results from WordEmbeddings, BertEmbeddings, or other word embeddings into sentence or document embeddings by either summing up or averaging all the word embeddings in a sentence or a document (depending on the inputCols).

SentimentDLApproach

Trains a SentimentDL, an annotator for multi-class sentiment analysis.

SentimentDLModel

SentimentDL, an annotator for multi-class sentiment analysis.

SentimentDetector

Trains a rule based sentiment detector, which calculates a score based on predefined keywords.

SentimentDetectorModel

Rule based sentiment detector, which calculates a score based on predefined keywords.

Stemmer

Returns hard-stems out of words with the objective of retrieving the meaningful part of the word.

StopWordsCleaner

This annotator takes a sequence of strings (e.g.

SymmetricDeleteApproach

Trains a Symmetric Delete spelling correction algorithm.

SymmetricDeleteModel

Symmetric Delete spelling correction algorithm.

T5Transformer

T5: the Text-To-Text Transfer Transformer

TextMatcher

Annotator to match exact phrases (by token) provided in a file against a Document.

TextMatcherModel

Instantiated model of the TextMatcher.

Token2Chunk

Converts TOKEN type Annotations to CHUNK type.

Tokenizer

Tokenizes raw text in document type columns into TokenizedSentence .

TokenizerModel

Tokenizes raw text into word pieces, tokens.

TypedDependencyParserApproach

Labeled parser that finds a grammatical relation between two words in a sentence.

TypedDependencyParserModel

Labeled parser that finds a grammatical relation between two words in a sentence.

UniversalSentenceEncoder

The Universal Sentence Encoder encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks.

ViveknSentimentApproach

Trains a sentiment analyser inspired by the algorithm by Vivek Narayanan.

ViveknSentimentModel

Sentiment analyser inspired by the algorithm by Vivek Narayanan.

WordEmbeddings

Word Embeddings lookup annotator that maps tokens to vectors.

WordEmbeddingsModel

Word Embeddings lookup annotator that maps tokens to vectors

WordSegmenterApproach

Trains a WordSegmenter which tokenizes non-english or non-whitespace separated texts.

WordSegmenterModel

WordSegmenter which tokenizes non-english or non-whitespace separated texts.

XlmRoBertaEmbeddings

The XLM-RoBERTa model was proposed in Unsupervised Cross-lingual Representation Learning at Scale by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzman, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov.

XlmRoBertaForTokenClassification

XlmRoBertaForTokenClassification can load XLM-RoBERTa Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g.

XlmRoBertaSentenceEmbeddings

Sentence-level embeddings using XLM-RoBERTa.

XlnetEmbeddings

XlnetEmbeddings (XLNet): Generalized Autoregressive Pretraining for Language Understanding

XlnetForTokenClassification

XlnetForTokenClassification can load XLNet Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g.

YakeKeywordExtraction

Yake is an Unsupervised, Corpus-Independent, Domain and Language-Independent and Single-Document keyword extraction algorithm.