Spark NLP - Transformers

AlbertEmbeddings

ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS - Google Research, Toyota Technological Institute at Chicago

These word embeddings represent the outputs generated by the Albert model. All official Albert releases by google in TF-HUB are supported with this Albert Wrapper:

Ported TF-Hub Models:

Spark NLP Model	TF-Hub Model	Model Properties
`"albert_base_uncased"`	albert_base	768-embed-dim, 12-layer, 12-heads, 12M parameters
`"albert_large_uncased"`	albert_large	1024-embed-dim, 24-layer, 16-heads, 18M parameters
`"albert_xlarge_uncased"`	albert_xlarge	2048-embed-dim, 24-layer, 32-heads, 60M parameters
`"albert_xxlarge_uncased"`	albert_xxlarge	4096-embed-dim, 12-layer, 64-heads, 235M parameters

This model requires input tokenization with SentencePiece model, which is provided by Spark-NLP (See tokenizers package).

Pretrained models can be loaded with pretrained of the companion object:

val embeddings = AlbertEmbeddings.pretrained()
 .setInputCols("sentence", "token")
 .setOutputCol("embeddings")

# Offline - Download the pretrained model manually and extract it
albert = AlbertEmbeddings.load("/albert_base_uncased_en_2.5.0_2.4_1588073363475") \
        .setInputCols("sentence", "token") \
        .setOutputCol("albert")

The default model is "albert_base_uncased", if no name is provided.

For extended examples of usage, see the Examples and the AlbertEmbeddingsTestSpec.

Sources:

ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS

https://github.com/google-research/ALBERT

https://tfhub.dev/s?q=albert

Paper abstract:

Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations and longer training times. To address these problems, we present two parameter reduction techniques to lower memory consumption and increase the training speed of BERT (Devlin et al., 2019). Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better compared to the original BERT. We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large.

Tips: ALBERT uses repeating layers which results in a small memory footprint, however the computational cost remains similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same number of (repeating) layers.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: WORD_EMBEDDINGS

Python API: AlbertEmbeddings

Scala API: AlbertEmbeddings

Source: AlbertEmbeddings

Show Examples

This example shows how to predict classes by using the embeddings generated by the Transformer.

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

# First extract the prerequisites for the NerDLModel
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

# Use the transformer embeddings
embeddings = AlbertEmbeddings.pretrained("albert_base_uncased") \
    .setInputCols(['document', 'token']) \
    .setOutputCol('embeddings')

# This pretrained model requires those specific transformer embeddings
ner_model = NerDLModel.pretrained("ner_conll_albert_base_uncased", "en") \
    .setInputCols(["document", "token", "embeddings"]) \
    .setOutputCol("ner")

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    embeddings,
    ner_model
])

data = spark.createDataFrame([["U.N. official Ekeus heads for Baghdad."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.select("ner.result").show(truncate=False)
+------------------------------------+
|result                              |
+------------------------------------+
|[I-LOC, O, O, I-PER, O, O, I-LOC, O]|
+------------------------------------+

AlbertForQuestionAnswering

AlbertForQuestionAnswering can load ALBERT Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits).

Pretrained models can be loaded with pretrained of the companion object:

val spanClassifier = AlbertForQuestionAnswering.pretrained()
  .setInputCols(Array("document_question", "document_context"))
  .setOutputCol("answer")

The default model is "albert_base_qa_squad2", if no name is provided.

For available pretrained models please see the Models Hub.

To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669. and the AlbertForQuestionAnsweringTestSpec.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CHUNK

Python API: AlbertForQuestionAnswering

Scala API: AlbertForQuestionAnswering

Source: AlbertForQuestionAnswering

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = MultiDocumentAssembler() \
    .setInputCols(["question", "context"]) \
    .setOutputCols(["document_question", "document_context"])

spanClassifier = AlbertForQuestionAnswering.pretrained() \
    .setInputCols(["document_question", "document_context"]) \
    .setOutputCol("answer") \
    .setCaseSensitive(False)

pipeline = Pipeline().setStages([
    documentAssembler,
    spanClassifier
])

data = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")

result = pipeline.fit(data).transform(data)

result.select("answer.result").show(truncate=False)
+--------------------+
|result              |
+--------------------+
|[Clara]             |
+--------------------+

AlbertForSequenceClassification

AlbertForSequenceClassification can load ALBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output), e.g. for document classification tasks.

For multi-class, use setActivation("softmax"). For multi-label, use setActivation("sigmoid").

Pretrained models can be loaded with pretrained of the companion object:

val sequenceClassifier = AlbertForSequenceClassification.pretrained()
  .setInputCols("token", "document")
  .setOutputCol("label")

The default model is "albert_base_sequence_classifier_imdb", if no name is provided.

For available pretrained models please see the Models Hub.

Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669. and the AlbertForSequenceClassification.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CATEGORY

Python API: AlbertForSequenceClassification

Scala API: AlbertForSequenceClassification

Source: AlbertForSequenceClassification

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

sequenceClassifier = AlbertForSequenceClassification.pretrained() \
    .setInputCols(["token", "document"]) \
    .setOutputCol("label") \
    .setCaseSensitive(True)

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    sequenceClassifier
])

data = spark.createDataFrame([["I loved this movie when I was a child.", "It was pretty boring."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("label.result").show(truncate=False)
+------+
|result|
+------+
|[pos] |
|[neg] |
+------+

AlbertForTokenClassification

AlbertForTokenClassification can load ALBERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

Pretrained models can be loaded with pretrained of the companion object:

val tokenClassifier = AlbertForTokenClassification.pretrained()
  .setInputCols("token", "document")
  .setOutputCol("label")

The default model is "albert_base_token_classifier_conll03", if no name is provided.

For available pretrained models please see the Models Hub.

and the AlbertForTokenClassificationTestSpec. Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see Import Transformers into Spark NLP 🚀.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: NAMED_ENTITY

Python API: AlbertForTokenClassification

Scala API: AlbertForTokenClassification

Source: AlbertForTokenClassification

Show Examples

This example shows how to predict classes by using the embeddings generated by the Transformer.

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

tokenClassifier = AlbertForTokenClassification.pretrained() \
    .setInputCols(["token", "document"]) \
    .setOutputCol("label") \
    .setCaseSensitive(True)

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    tokenClassifier
])

data = spark.createDataFrame([["John Lenon was born in London and lived in Paris. My name is Sarah and I live in London"]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.select("label.result").show(truncate=False)
+------------------------------------------------------------------------------------+
|result                                                                              |
+------------------------------------------------------------------------------------+
|[B-PER, I-PER, O, O, O, B-LOC, O, O, O, B-LOC, O, O, O, O, B-PER, O, O, O, O, B-LOC]|
+------------------------------------------------------------------------------------+

BartForZeroShotClassification

BartForZeroShotClassification using a ModelForSequenceClassification trained on NLI (natural language inference) tasks. Equivalent of BartForZeroShotClassification models, but these models don’t require a hardcoded number of potential classes, they can be chosen at runtime. It usually means it’s slower but it is much more flexible.

Note that the model will loop through all provided labels. So the more labels you have, the longer this process will take.

Any combination of sequences and labels can be passed and each combination will be posed as a premise/hypothesis pair and passed to the pretrained model.

Pretrained models can be loaded with pretrained of the companion object:

val sequenceClassifier = BartForZeroShotClassification .pretrained()
  .setInputCols("token", "document")
  .setOutputCol("label")

The default model is "bart_large_zero_shot_classifier_mnli", if no name is provided.

For available pretrained models please see the Models Hub.

To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.

Input Annotator Types: TOKEN, DOCUMENT

Output Annotator Type: CATEGORY

Python API: BartForZeroShotClassification

Scala API: BartForZeroShotClassification

Source: BartForZeroShotClassification

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")
tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")
sequenceClassifier = BartForZeroShotClassification.pretrained() \
    .setInputCols(["token", "document"]) \
    .setOutputCol("label") \
    .setCaseSensitive(True)
pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    sequenceClassifier
])

data = spark.createDataFrame([["I loved this movie when I was a child.", "It was pretty boring."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("label.result").show(truncate=False)
+------+
|result|
+------+
|[pos] |
|[neg] |
+------+

BartTransformer

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension Transformer

The Facebook BART (Bidirectional and Auto-Regressive Transformer) model is a state-of-the-art language generation model that was introduced by Facebook AI in 2019. It is based on the transformer architecture and is designed to handle a wide range of natural language processing tasks such as text generation, summarization, and machine translation.

BART is unique in that it is both bidirectional and auto-regressive, meaning that it can generate text both from left-to-right and from right-to-left. This allows it to capture contextual information from both past and future tokens in a sentence,resulting in more accurate and natural language generation.

The model was trained on a large corpus of text data using a combination of unsupervised and supervised learning techniques. It incorporates pretraining and fine-tuning phases, where the model is first trained on a large unlabeled corpus of text, and then fine-tuned on specific downstream tasks.

BART has achieved state-of-the-art performance on a wide range of NLP tasks, including summarization, question-answering, and language translation. Its ability to handle multiple tasks and its high performance on each of these tasks make it a versatile and valuable tool for natural language processing applications.

Pretrained models can be loaded with pretrained of the companion object:

val bart = BartTransformer.pretrained()
  .setInputCols("document")
  .setOutputCol("generation")

The default model is "bart_large_cnn", if no name is provided. For available pretrained models please see the Models Hub.

For extended examples of usage, see BartTestSpec.

References:

Paper Abstract:

We present BART, a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Tranformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and other recent pretraining schemes. We evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of sentences and using a novel in-filling scheme, where spans of text are replaced with a single mask token. BART is particularly effective when fine tuned for text generation but also works well for comprehension tasks. It matches the performance of RoBERTa on GLUE and SQuAD, and achieves new stateof-the-art results on a range of abstractive dialogue, question answering, and summarization tasks, with gains of up to 3.5 ROUGE. BART also provides a 1.1 BLEU increase over a back-translation system for machine translation, with only target language pretraining. We also replicate other pretraining schemes within the BART framework, to understand their effect on end-task performance

Note:

This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.

Input Annotator Types: DOCUMENT

Output Annotator Type: DOCUMENT

Python API: BartTransformer

Scala API: BartTransformer

Source: BartTransformer

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("documents")
bart = BartTransformer.pretrained("bart_large_cnn") \
    .setTask("summarize:") \
    .setInputCols(["documents"]) \
    .setMaxOutputLength(200) \
    .setOutputCol("summaries")

pipeline = Pipeline().setStages([documentAssembler, bart])

data = spark.createDataFrame([[
    "Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a " +
    "downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness" +
    " of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this " +
    "paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework " +
    "that converts all text-based language problems into a text-to-text format. Our systematic study compares " +
    "pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens " +
    "of language understanding tasks. By combining the insights from our exploration with scale and our new " +
    "Colossal Clean Crawled Corpus, we achieve state-of-the-art results on many benchmarks covering " +
    "summarization, question answering, text classification, and more. To facilitate future work on transfer " +
    "learning for NLP, we release our data set, pre-trained models, and code."
]]).toDF("text")

result = pipeline.fit(data).transform(data)
result.select("summaries.result").show(truncate=False)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                        |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[transfer learning has emerged as a powerful technique in natural language processing (NLP) the effectiveness of transfer learning has given rise to a diversity of approaches, methodologies, and practice .]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

BertEmbeddings

Token-level embeddings using BERT. BERT (Bidirectional Encoder Representations from Transformers) provides dense vector representations for natural language by using a deep, pre-trained neural network with the Transformer architecture.

Pretrained models can be loaded with pretrained of the companion object:

val embeddings = BertEmbeddings.pretrained()
  .setInputCols("token", "document")
  .setOutputCol("bert_embeddings")

The default model is "small_bert_L2_768", if no name is provided.

For available pretrained models please see the Models Hub.

For extended examples of usage, see the Examples and the BertEmbeddingsTestSpec.

Sources :

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

https://github.com/google-research/bert

Paper abstract

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: WORD_EMBEDDINGS

Python API: BertEmbeddings

Scala API: BertEmbeddings

Source: BertEmbeddings

Show Examples

This example shows how to predict classes by using the embeddings generated by the Transformer.

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

# First extract the prerequisites for the NerDLModel
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

# Use the transformer embeddings
embeddings = BertEmbeddings.pretrained(name='bert_base_cased', lang='en') \
    .setInputCols(['document', 'token']) \
    .setOutputCol('embeddings')

# This pretrained model requires those specific transformer embeddings
ner_model = NerDLModel.pretrained("ner_dl_bert", "en") \
    .setInputCols(["document", "token", "embeddings"]) \
    .setOutputCol("ner")

pipeline = Pipeline().setStages([
    documentAssembler,
    sentence,
    tokenizer,
    embeddings,
    ner_model
])

data = spark.createDataFrame([["U.N. official Ekeus heads for Baghdad."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.select("ner.result").show(truncate=False)
+------------------------------------+
|result                              |
+------------------------------------+
|[I-LOC, O, O, I-PER, O, O, I-LOC, O]|
+------------------------------------+

BertForQuestionAnswering

BertForQuestionAnswering can load Bert Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits).

Pretrained models can be loaded with pretrained of the companion object:

val spanClassifier = BertForQuestionAnswering.pretrained()
  .setInputCols(Array("document_question", "document_context"))
  .setOutputCol("answer")

The default model is "bert_base_cased_qa_squad2", if no name is provided.

For available pretrained models please see the Models Hub.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CHUNK

Python API: BertForQuestionAnswering

Scala API: BertForQuestionAnswering

Source: BertForQuestionAnswering

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = MultiDocumentAssembler() \
    .setInputCols(["question", "context"]) \
    .setOutputCols(["document_question", "document_context"])

spanClassifier = BertForQuestionAnswering.pretrained() \
    .setInputCols(["document_question", "document_context"]) \
    .setOutputCol("answer") \
    .setCaseSensitive(False)

pipeline = Pipeline().setStages([
    documentAssembler,
    spanClassifier
])

data = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")

result = pipeline.fit(data).transform(data)

result.select("answer.result").show(truncate=False)
+--------------------+
|result              |
+--------------------+
|[Clara]             |
+--------------------+

BertForSequenceClassification

BertForSequenceClassification can load Bert Models with sequence classification/regression head on top (a linear layer on top of the pooled output), e.g. for document classification tasks.

For multi-class, use setActivation("softmax"). For multi-label, use setActivation("sigmoid").

Pretrained models can be loaded with pretrained of the companion object:

val sequenceClassifier = BertForSequenceClassification.pretrained()
  .setInputCols("token", "document")
  .setOutputCol("label")

The default model is "bert_base_sequence_classifier_imdb", if no name is provided.

For available pretrained models please see the Models Hub.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CATEGORY

Python API: BertForSequenceClassification

Scala API: BertForSequenceClassification

Source: BertForSequenceClassification

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

sequenceClassifier = BertForSequenceClassification.pretrained() \
    .setInputCols(["token", "document"]) \
    .setOutputCol("label") \
    .setCaseSensitive(True)

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    sequenceClassifier
])

data = spark.createDataFrame([["I loved this movie when I was a child.", "It was pretty boring."]]).toDF("text")

result = pipeline.fit(data).transform(data)
result.select("label.result").show(truncate=False)
+------+
|result|
+------+
|[pos] |
|[neg] |
+------+

BertForTokenClassification

BertForTokenClassification can load Bert Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

Pretrained models can be loaded with pretrained of the companion object:

val labels = BertForTokenClassification.pretrained()
  .setInputCols("token", "document")
  .setOutputCol("label")

The default model is "bert_base_token_classifier_conll03", if no name is provided.

For available pretrained models please see the Models Hub.

and the BertForTokenClassificationTestSpec. Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see Import Transformers into Spark NLP 🚀.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: NAMED_ENTITY

Python API: BertForTokenClassification

Scala API: BertForTokenClassification

Source: BertForTokenClassification

Show Examples

This example shows how to predict classes by using the embeddings generated by the Transformer.

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

tokenClassifier = BertForTokenClassification.pretrained() \
    .setInputCols(["token", "document"]) \
    .setOutputCol("label") \
    .setCaseSensitive(True)

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    tokenClassifier
])

data = spark.createDataFrame([["John Lenon was born in London and lived in Paris. My name is Sarah and I live in London"]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.select("label.result").show(truncate=False)
+------------------------------------------------------------------------------------+
|result                                                                              |
+------------------------------------------------------------------------------------+
|[B-PER, I-PER, O, O, O, B-LOC, O, O, O, B-LOC, O, O, O, O, B-PER, O, O, O, O, B-LOC]|
+------------------------------------------------------------------------------------+

BertForZeroShotClassification

BertForZeroShotClassification using a ModelForSequenceClassification trained on NLI (natural language inference) tasks. Equivalent of BertForSequenceClassification models, but these models don’t require a hardcoded number of potential classes, they can be chosen at runtime. It usually means it’s slower but it is much more flexible.

Note that the model will loop through all provided labels. So the more labels you have, the longer this process will take.

Any combination of sequences and labels can be passed and each combination will be posed as a premise/hypothesis pair and passed to the pretrained model.

Pretrained models can be loaded with pretrained of the companion object:

val sequenceClassifier = BertForZeroShotClassification.pretrained()
  .setInputCols("token", "document")
  .setOutputCol("label")

The default model is "bert_base_cased_zero_shot_classifier_xnli", if no name is provided.

For available pretrained models please see the Models Hub.

To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see BertForZeroShotClassification.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CATEGORY

Python API: BertForZeroShotClassification

Scala API: BertForZeroShotClassification

Source: BertForZeroShotClassification

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")
tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")
sequenceClassifier = BertForZeroShotClassification.pretrained() \
    .setInputCols(["token", "document"]) \
    .setOutputCol("label") \
    .setCaseSensitive(True)

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    sequenceClassifier
])

data = spark.createDataFrame([["I loved this movie when I was a child.", "It was pretty boring."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("label.result").show(truncate=False)
+------+
|result|
+------+
|[pos] |
|[neg] |
+------+

BertSentenceEmbeddings

Sentence-level embeddings using BERT. BERT (Bidirectional Encoder Representations from Transformers) provides dense vector representations for natural language by using a deep, pre-trained neural network with the Transformer architecture.

Pretrained models can be loaded with pretrained of the companion object:

val embeddings = BertSentenceEmbeddings.pretrained()
  .setInputCols("sentence")
  .setOutputCol("sentence_bert_embeddings")

The default model is "sent_small_bert_L2_768", if no name is provided.

For available pretrained models please see the Models Hub.

For extended examples of usage, see the Examples and the BertSentenceEmbeddingsTestSpec.

Sources :

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

https://github.com/google-research/bert

Paper abstract

Input Annotator Types: DOCUMENT

Output Annotator Type: SENTENCE_EMBEDDINGS

Python API: BertSentenceEmbeddings

Scala API: BertSentenceEmbeddings

Source: BertSentenceEmbeddings

Show Examples

This example shows how to predict classes by using the embeddings generated by the Transformer.

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

# First extract the prerequisites for the ClassifierDLModel
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

# Use the transformer embeddings
embeddings = BertSentenceEmbeddings\
  .pretrained('sent_bert_multi_cased', 'xx') \
  .setInputCols(["document"]) \
  .setOutputCol("sentence_embeddings")

# This pretrained model requires those specific transformer embeddings
document_classifier = ClassifierDLModel.pretrained("classifierdl_bert_news", "de") \
  .setInputCols(["document", "sentence_embeddings"]) \
  .setOutputCol("class")

pipeline = Pipeline().setStages([
    documentAssembler,
    embeddings,
    document_classifier
])

data = spark.createDataFrame([["Dressurreiterin Jessica von Bredow-Werndl hat ihr zweites Olympia-Gold gewonnen"]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.select("class.result").show(truncate=False)
+-------+
|result |
+-------+
|[Sport]|
+-------+

CLIPForZeroShotClassification

Zero Shot Image Classifier based on CLIP.

CLIP (Contrastive Language-Image Pre-Training) is a neural network that was trained on image and text pairs. It has the ability to predict images without training on any hard-coded labels. This makes it very flexible, as labels can be provided during inference. This is similar to the zero-shot capabilities of the GPT-2 and 3 models.

Pretrained models can be loaded with pretrained of the companion object:

val imageClassifier = CLIPForZeroShotClassification.pretrained()
  .setInputCols("image_assembler")
  .setOutputCol("label")

The default model is "zero_shot_classifier_clip_vit_base_patch32", if no name is provided.

For available pretrained models please see the Models Hub.

Input Annotator Types: IMAGE

Output Annotator Type: CATEGORY

Python API: CLIPForZeroShotClassification
CLIPForZeroShotClassification	Scala API: CLIPForZeroShotClassification	Source: CLIPForZeroShotClassification

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

imageDF = spark.read \
    .format("image") \
    .option("dropInvalid", value = True) \
    .load("src/test/resources/image/")

imageAssembler: ImageAssembler = ImageAssembler() \
    .setInputCol("image") \
    .setOutputCol("image_assembler")

candidateLabels = [
    "a photo of a bird",
    "a photo of a cat",
    "a photo of a dog",
    "a photo of a hen",
    "a photo of a hippo",
    "a photo of a room",
    "a photo of a tractor",
    "a photo of an ostrich",
    "a photo of an ox"]

imageClassifier = CLIPForZeroShotClassification \
    .pretrained() \
    .setInputCols(["image_assembler"]) \
    .setOutputCol("label") \
    .setCandidateLabels(candidateLabels)

pipeline = Pipeline().setStages([imageAssembler, imageClassifier])
pipelineDF = pipeline.fit(imageDF).transform(imageDF)
pipelineDF \
  .selectExpr("reverse(split(image.origin, '/'))[0] as image_name", "label.result") \
  .show(truncate=False)

CamemBertEmbeddings

The CamemBERT model was proposed in CamemBERT: a Tasty French Language Model by Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suárez, Yoann Dupont, Laurent Romary, Éric Villemonte de la Clergerie, Djamé Seddah, and Benoît Sagot. It is based on Facebook’s RoBERTa model released in 2019. It is a model trained on 138GB of French text.

Pretrained models can be loaded with pretrained of the companion object:

val embeddings = CamemBertEmbeddings.pretrained()
  .setInputCols("token", "document")
  .setOutputCol("camembert_embeddings")

The default model is "camembert_base", if no name is provided.

For available pretrained models please see the Models Hub.

For extended examples of usage, see the Examples and the CamemBertEmbeddingsTestSpec. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.

Sources :

CamemBERT: a Tasty French Language Model

https://huggingface.co/camembert

Paper abstract

Pretrained language models are now ubiquitous in Natural Language Processing. Despite their success, most available models have either been trained on English data or on the concatenation of data in multiple languages. This makes practical use of such models –in all languages except English– very limited. In this paper, we investigate the feasibility of training monolingual Transformer-based language models for other languages, taking French as an example and evaluating our language models on part-of-speech tagging, dependency parsing, named entity recognition and natural language inference tasks. We show that the use of web crawled data is preferable to the use of Wikipedia data. More surprisingly, we show that a relatively small web crawled dataset (4GB) leads to results that are as good as those obtained using larger datasets (130+GB). Our best performing model CamemBERT reaches or improves the state of the art in all four downstream tasks.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: WORD_EMBEDDINGS

Python API: CamemBertEmbeddings

Scala API: CamemBertEmbeddings

Source: CamemBertEmbeddings

Show Examples

This example shows how to predict classes by using the embeddings generated by the Transformer.

# Coming Soon!

This example shows how to train an Approach Annotator by using the embeddings generated by the Transformer.

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

# First extract the prerequisites for the NerDLApproach
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

# Use the transformer embeddings
embeddings = CamemBertEmbeddings.pretrained() \
    .setInputCols(["document", "token"]) \
    .setOutputCol("embeddings") \
    .setCaseSensitive(True)

# Then the training can start with the transformer embeddings
nerTagger = NerDLApproach() \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setLabelColumn("label") \
    .setOutputCol("ner") \
    .setMaxEpochs(1) \
    .setVerbose(0)

pipeline = Pipeline().setStages([
    documentAssembler,
    sentence,
    tokenizer,
    embeddings,
    nerTagger
])

This example shows how to extract the embeddings generated by the Transformer.

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \\
    .setInputCol("text") \\
    .setOutputCol("document")
tokenizer = Tokenizer() \\
    .setInputCols(["document"]) \\
    .setOutputCol("token")
embeddings = CamemBertEmbeddings.pretrained() \\
    .setInputCols(["token", "document"]) \\
    .setOutputCol("camembert_embeddings")
embeddingsFinisher = EmbeddingsFinisher() \\
    .setInputCols(["camembert_embeddings"]) \\
    .setOutputCols("finished_embeddings") \\
    .setOutputAsVector(True)
pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    embeddings,
    embeddingsFinisher
])

data = spark.createDataFrame([["C'est une phrase."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[0.08442357927560806,-0.12863239645957947,-0.03835778683423996,0.200479581952...|
|[0.048462312668561935,0.12637358903884888,-0.27429091930389404,-0.07516729831...|
|[0.02690504491329193,0.12104076147079468,0.012526623904705048,-0.031543646007...|
|[0.05877285450696945,-0.08773420006036758,-0.06381352990865707,0.122621834278...|
+--------------------------------------------------------------------------------+

CamemBertForQuestionAnswering

CamemBertForQuestionAnswering can load CamemBERT Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits).

Pretrained models can be loaded with pretrained of the companion object:

val spanClassifier = CamemBertForQuestionAnswering.pretrained()
  .setInputCols(Array("document_question", "document_context"))
  .setOutputCol("answer")

The default model is "camembert_base_qa_fquad", if no name is provided.

For available pretrained models please see the Models Hub.

To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see CamemBertForQuestionAnsweringTestSpec.

Input Annotator Types: DOCUMENT, DOCUMENT

Output Annotator Type: CHUNK

Python API: CamemBertForQuestionAnswering

Scala API: CamemBertForQuestionAnswering

Source: CamemBertForQuestionAnswering

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = MultiDocumentAssembler() \
    .setInputCols(["question", "context"]) \
    .setOutputCol(["document_question", "document_context"])
spanClassifier = CamemBertForQuestionAnswering.pretrained() \
    .setInputCols(["document_question", "document_context"]) \
    .setOutputCol("answer") \
    .setCaseSensitive(False)
pipeline = Pipeline().setStages([
    documentAssembler,
    spanClassifier
])

data = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", ontext")
result = pipeline.fit(data).transform(data)
result.select("answer.result").show(truncate=False)
+--------------------+
|result              |
+--------------------+
|[Clara]             |
+--------------------+

CamemBertForSequenceClassification

CamemBertForSequenceClassification can load CamemBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output), e.g. for document classification tasks.

For multi-class, use setActivation("softmax"). For multi-label, use setActivation("sigmoid").

Pretrained models can be loaded with pretrained of the companion object:

val sequenceClassifier = CamemBertForSequenceClassification.pretrained()
  .setInputCols("token", "document")
  .setOutputCol("label")

The default model is camembert_base_sequence_classifier_allocine", if no name is provided.

For available pretrained models please see the Models Hub.

To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see CamemBertForSequenceClassification.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CATEGORY

Python API: CamemBertForSequenceClassification

Scala API: CamemBertForSequenceClassification

Source: CamemBertForSequenceClassification

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")
tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")
sequenceClassifier = CamemBertForSequenceClassification.pretrained() \
    .setInputCols(["token", "document"]) \
    .setOutputCol("label") \
    .setCaseSensitive(True)
pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    sequenceClassifier
])
data = spark.createDataFrame([["j'ai adoré ce film lorsque j'étais enfant.", "Je déteste ça."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("class.result").show(truncate=False)
+------+
|result|
+------+
|[pos] |
|[neg] |
+------+

CamemBertForTokenClassification

CamemBertForTokenClassification can load CamemBERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

Pretrained models can be loaded with pretrained of the companion object:

val tokenClassifier = CamemBertForTokenClassification.pretrained()
  .setInputCols("token", "document")
  .setOutputCol("label")

The default model is "camembert_base_token_classifier_wikiner", if no name is provided.

For available pretrained models please see the Models Hub.

and the CamemBertForTokenClassificationTestSpec. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: NAMED_ENTITY

Python API: CamemBertForTokenClassification

Scala API: CamemBertForTokenClassification

Source: CamemBertForTokenClassification

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \\
    .setInputCol("text") \\
    .setOutputCol("document")
tokenizer = Tokenizer() \\
    .setInputCols(["document"]) \\
    .setOutputCol("token")
tokenClassifier = CamemBertForTokenClassification.pretrained() \\
    .setInputCols(["token", "document"]) \\
    .setOutputCol("label") \\
    .setCaseSensitive(True)

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    tokenClassifier
])

data = spark.createDataFrame([["george washington est allé à washington"]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("label.result").show(truncate=False)
+------------------------------+
|result                        |
+------------------------------+
|[I-PER, I-PER, O, O, O, I-LOC]|
+------------------------------+

ConvNextForImageClassification

ConvNextForImageClassification is an image classifier based on ConvNet models.

The ConvNeXT model was proposed in A ConvNet for the 2020s by Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie. ConvNeXT is a pure convolutional model (ConvNet), inspired by the design of Vision Transformers, that claims to outperform them.

Pretrained models can be loaded with pretrained of the companion object:

val imageClassifier = ConvNextForImageClassification.pretrained()
  .setInputCols("image_assembler")
  .setOutputCol("class")

The default model is "image_classifier_convnext_tiny_224_local", if no name is provided.

For available pretrained models please see the Models Hub.

References:

A ConvNet for the 2020s

Paper Abstract:

The “Roaring 20s” of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model. A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide variety of vision tasks. However, the effectiveness of such hybrid approaches is still largely credited to the intrinsic superiority of Transformers, rather than the inherent inductive biases of convolutions. In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve. We gradually “modernize” a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way. The outcome of this exploration is a family of pure ConvNet models dubbed ConvNeXt. Constructed entirely from standard ConvNet modules, ConvNeXts compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while maintaining the simplicity and efficiency of standard ConvNets.

Input Annotator Types: IMAGE

Output Annotator Type: CATEGORY

Python API: ConvNextForImageClassification

Scala API: ConvNextForImageClassification

Source: ConvNextForImageClassification

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
imageDF = spark.read \
    .format("image") \
    .option("dropInvalid", value = True) \
    .load("src/test/resources/image/")
imageAssembler = ImageAssembler() \
    .setInputCol("image") \
    .setOutputCol("image_assembler")
imageClassifier = ConvNextForImageClassification \
    .pretrained() \
    .setInputCols(["image_assembler"]) \
    .setOutputCol("class")
pipeline = Pipeline().setStages([imageAssembler, imageClassifier])
pipelineDF = pipeline.fit(imageDF).transform(imageDF)
pipelineDF \
  .selectExpr("reverse(split(image.origin, '/'))[0] as image_name", "class.result") \
  .show(truncate=False)
+-----------------+----------------------------------------------------------+
|image_name       |result                                                    |
+-----------------+----------------------------------------------------------+
|bluetick.jpg     |[bluetick]                                                |
|chihuahua.jpg    |[Chihuahua]                                               |
|egyptian_cat.jpeg|[tabby, tabby cat]                                        |
|hen.JPEG         |[hen]                                                     |
|hippopotamus.JPEG|[hippopotamus, hippo, river horse, Hippopotamus amphibius]|
|junco.JPEG       |[junco, snowbird]                                         |
|ostrich.JPEG     |[ostrich, Struthio camelus]                               |
|ox.JPEG          |[ox]                                                      |
|palace.JPEG      |[palace]                                                  |
|tractor.JPEG     |[thresher, thrasher, threshing machine                    |
+-----------------+----------------------------------------------------------+

DeBertaEmbeddings

The DeBERTa model was proposed in DeBERTa: Decoding-enhanced BERT with Disentangled Attention by Pengcheng He, Xiaodong Liu, Jianfeng Gao, Weizhu Chen. It is based on Google’s BERT model released in 2018 and Facebook’s RoBERTa model released in 2019.

This model requires input tokenization with SentencePiece model, which is provided by Spark NLP (See tokenizers package).

Pretrained models can be loaded with pretrained of the companion object:

val embeddings = DeBertaEmbeddings.pretrained()
 .setInputCols("sentence", "token")
 .setOutputCol("embeddings")

The default model is "deberta_v3_base", if no name is provided.

For extended examples see DeBertaEmbeddingsTestSpec. Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.

It builds on RoBERTa with disentangled attention and enhanced mask decoder training with half of the data used in RoBERTa.

Sources:

https://github.com/microsoft/DeBERTa

https://www.microsoft.com/en-us/research/blog/microsoft-deberta-surpasses-human-performance-on-the-superglue-benchmark/

Paper abstract:

Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) that improves the BERT and RoBERTa models using two novel techniques. The first is the disentangled attention mechanism, where each word is represented using two vectors that encode its content and position, respectively, and the attention weights among words are computed using disentangled matrices on their contents and relative positions. Second, an enhanced mask decoder is used to replace the output softmax layer to predict the masked tokens for model pretraining. We show that these two techniques significantly improve the efficiency of model pretraining and performance of downstream tasks. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). The DeBERTa code and pre-trained models will be made publicly available at https://github.com/microsoft/DeBERTa.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: WORD_EMBEDDINGS

Python API: DeBertaEmbeddings

Scala API: DeBertaEmbeddings

Source: DeBertaEmbeddings

Show Examples

This example shows how to predict classes by using the embeddings generated by the Transformer.

# Coming Soon!

This example shows how to train an Approach Annotator by using the embeddings generated by the Transformer.

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

# First extract the prerequisites for the NerDLApproach
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

embeddings = DeBertaEmbeddings.pretrained() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("embeddings")

# Then the training can start with the transformer embeddings
nerTagger = NerDLApproach() \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setLabelColumn("label") \
    .setOutputCol("ner") \
    .setMaxEpochs(1) \
    .setVerbose(0)

pipeline = Pipeline().setStages([
    documentAssembler,
    sentence,
    tokenizer,
    embeddings,
    nerTagger
])

# We use the text and labels from the CoNLL dataset
conll = CoNLL()
trainingData = conll.readDataset(spark, "eng.train")

pipelineModel = pipeline.fit(trainingData)

This example shows how to extract the embeddings generated by the Transformer.

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

embeddings = DeBertaEmbeddings.pretrained() \
    .setInputCols(["token", "document"]) \
    .setOutputCol("embeddings")

embeddingsFinisher = EmbeddingsFinisher() \
    .setInputCols(["embeddings"]) \
    .setOutputCols("finished_embeddings") \
    .setOutputAsVector(True) \
    .setCleanAnnotations(False)

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    embeddings,
    embeddingsFinisher
])

data = spark.createDataFrame([["This is a sentence."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[1.1342473030090332,-1.3855540752410889,0.9818322062492371,-0.784737348556518...|
|[0.847029983997345,-1.047153353691101,-0.1520637571811676,-0.6245765686035156...|
|[-0.009860038757324219,-0.13450059294700623,2.707749128341675,1.2916892766952...|
|[-0.04192575812339783,-0.5764210224151611,-0.3196685314178467,-0.527840495109...|
|[0.15583214163780212,-0.1614152491092682,-0.28423872590065,-0.135491415858268...|
+--------------------------------------------------------------------------------+

DeBertaForQuestionAnswering

DeBertaForQuestionAnswering can load DeBERTa Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits).

Pretrained models can be loaded with pretrained of the companion object:

val spanClassifier = DeBertaForQuestionAnswering.pretrained()
  .setInputCols(Array("document_question", "document_context"))
  .setOutputCol("answer")

The default model is "deverta_v3_xsmall_qa_squad2", if no name is provided.

For available pretrained models please see the Models Hub.

To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669. and the DeBertaForQuestionAnsweringTestSpec.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CHUNK

Python API: DeBertaForQuestionAnswering

Scala API: DeBertaForQuestionAnswering

Source: DeBertaForQuestionAnswering

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = MultiDocumentAssembler() \
    .setInputCols(["question", "context"]) \
    .setOutputCols(["document_question", "document_context"])

spanClassifier = DeBertaForQuestionAnswering.pretrained() \
    .setInputCols(["document_question", "document_context"]) \
    .setOutputCol("answer") \
    .setCaseSensitive(False)

pipeline = Pipeline().setStages([
    documentAssembler,
    spanClassifier
])

data = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")

result = pipeline.fit(data).transform(data)

result.select("answer.result").show(truncate=False)
+--------------------+
|result              |
+--------------------+
|[Clara]             |
+--------------------+

DeBertaForSequenceClassification

DeBertaForSequenceClassification can load DeBerta v2 & v3 Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.

Pretrained models can be loaded with pretrained of the companion object:

val sequenceClassifier = DeBertaForSequenceClassification.pretrained()
  .setInputCols("token", "document")
  .setOutputCol("label")

The default model is "deberta_v3_xsmall_sequence_classifier_imdb", if no name is provided.

For available pretrained models please see the Models Hub.

To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see DeBertaForSequenceClassification.

Input Annotator Types: TOKEN, DOCUMENT

Output Annotator Type: CATEGORY

Python API: DeBertaForSequenceClassification

Scala API: DeBertaForSequenceClassification

Source: DeBertaForSequenceClassification

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")
tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")
sequenceClassifier = DeBertaForSequenceClassification.pretrained() \
    .setInputCols(["token", "document"]) \
    .setOutputCol("label") \
    .setCaseSensitive(True)
pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    sequenceClassifier
])

data = spark.createDataFrame([["I loved this movie when I was a child.", "It was pretty boring."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("label.result").show(truncate=False)
+------+
|result|
+------+
|[pos] |
|[neg] |
+------+

DeBertaForTokenClassification

DeBertaForTokenClassification can load DeBERTA Models v2 and v3 with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

Pretrained models can be loaded with pretrained of the companion object:

val tokenClassifier = DeBertaForTokenClassification.pretrained()
  .setInputCols("token", "document")
  .setOutputCol("label")

The default model is "deberta_v3_xsmall_token_classifier_conll03", if no name is provided.

For available pretrained models please see the Models Hub.

and the DeBertaForTokenClassificationTestSpec. Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.

Input Annotator Types: TOKEN, DOCUMENT

Output Annotator Type: NAMED_ENTITY

Python API: DeBertaForTokenClassification

Scala API: DeBertaForTokenClassification

Source: DeBertaForTokenClassification

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")
tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")
tokenClassifier = DeBertaForTokenClassification.pretrained() \
    .setInputCols(["token", "document"]) \
    .setOutputCol("label") \
    .setCaseSensitive(True)
pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    tokenClassifier
])

data = spark.createDataFrame([["John Lenon was born in London and lived in Paris. My name is Sarah and I live in London"]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("label.result").show(truncate=False)
+------------------------------------------------------------------------------------+
|result                                                                              |
+------------------------------------------------------------------------------------+
|[B-PER, I-PER, O, O, O, B-LOC, O, O, O, B-LOC, O, O, O, O, B-PER, O, O, O, O, B-LOC]|
+------------------------------------------------------------------------------------+

DistilBertEmbeddings

DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than bert-base-uncased, runs 60% faster while preserving over 95% of BERT’s performances as measured on the GLUE language understanding benchmark.

Pretrained models can be loaded with pretrained of the companion object:

val embeddings = DistilBertEmbeddings.pretrained()
  .setInputCols("document", "token")
  .setOutputCol("embeddings")

The default model is "distilbert_base_cased", if no name is provided. For available pretrained models please see the Models Hub.

For extended examples of usage, see the Examples and the DistilBertEmbeddingsTestSpec. Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see Import Transformers into Spark NLP 🚀.

The DistilBERT model was proposed in the paper DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.

Paper Abstract:

As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pretraining phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pretraining, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.

Tips:

DistilBERT doesn’t have :obj:token_type_ids, you don’t need to indicate which token belongs to which segment. Just separate your segments with the separation token :obj:tokenizer.sep_token (or :obj:[SEP]).
DistilBERT doesn’t have options to select the input positions (:obj:position_ids input). This could be added if necessary though, just let us know if you need this option.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: WORD_EMBEDDINGS

Python API: DistilBertEmbeddings

Scala API: DistilBertEmbeddings

Source: DistilBertEmbeddings

Show Examples

This example shows how to predict classes by using the embeddings generated by the Transformer.

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

# First extract the prerequisites for the NerDLModel
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

# Use the transformer embeddings
embeddings = DistilBertEmbeddings\
      .pretrained('distilbert_base_cased', 'en') \
      .setInputCols(["token", "document"]) \
      .setOutputCol("embeddings")

# This pretrained model requires those specific transformer embeddings
ner_model = NerDLModel.pretrained('ner_mit_movie_complex_distilbert_base_cased', 'en') \
    .setInputCols(['document', 'token', 'embeddings']) \
    .setOutputCol('ner')

pipeline = Pipeline().setStages([
    documentAssembler,
    sentence,
    tokenizer,
    embeddings,
    ner_model
])

data = spark.createDataFrame([["The Grand Budapest Hotel is a 2014 comedy-drama film written and directed by Wes Anderson"]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.select("ner.result").show(truncate=False)
+----------------------------------------------------------------------------------------------+
|result                                                                                        |
+----------------------------------------------------------------------------------------------+
|[I-Plot, I-Plot, I-Plot, I-Plot, O, O, B-Year, B-Genre, O, O, O, O, O, B-Director, I-Director]|
+----------------------------------------------------------------------------------------------+

DistilBertForQuestionAnswering

DistilBertForQuestionAnswering can load DistilBert Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits).

Pretrained models can be loaded with pretrained of the companion object:

val spanClassifier = DistilBertForQuestionAnswering.pretrained()
  .setInputCols(Array("document_question", "document_context"))
  .setOutputCol("answer")

The default model is "distilbert_base_cased_qa_squad2", if no name is provided.

For available pretrained models please see the Models Hub.

To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669. and the DistilBertForSequenceClassificationTestSpec.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CHUNK

Python API: DistilBertForQuestionAnswering

Scala API: DistilBertForQuestionAnswering

Source: DistilBertForQuestionAnswering

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = MultiDocumentAssembler() \
    .setInputCols(["question", "context"]) \
    .setOutputCols(["document_question", "document_context"])

spanClassifier = DistilBertForQuestionAnswering.pretrained() \
    .setInputCols(["document_question", "document_context"]) \
    .setOutputCol("answer") \
    .setCaseSensitive(False)

pipeline = Pipeline().setStages([
    documentAssembler,
    spanClassifier
])

data = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")

result = pipeline.fit(data).transform(data)

result.select("answer.result").show(truncate=False)
+--------------------+
|result              |
+--------------------+
|[Clara]             |
+--------------------+

DistilBertForSequenceClassification

DistilBertForSequenceClassification can load DistilBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output), e.g. for document classification tasks.

For multi-class, use setActivation("softmax"). For multi-label, use setActivation("sigmoid").

Pretrained models can be loaded with pretrained of the companion object:

val sequenceClassifier = DistilBertForSequenceClassification.pretrained()
  .setInputCols("token", "document")
  .setOutputCol("label")

The default model is "distilbert_base_sequence_classifier_imdb", if no name is provided.

For available pretrained models please see the Models Hub.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CATEGORY

Python API: DistilBertForSequenceClassification

Scala API: DistilBertForSequenceClassification

Source: DistilBertForSequenceClassification

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

sequenceClassifier = DistilBertForSequenceClassification.pretrained() \
    .setInputCols(["token", "document"]) \
    .setOutputCol("label") \
    .setCaseSensitive(True)

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    sequenceClassifier
])
data = spark.createDataFrame([["""John Lenon was born in London and lived
in Paris. My name is Sarah and I live in London"""]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("label.result").show(truncate=False)
+------+
|result|
+------+
|[pos] |
|[neg] |
+------+

DistilBertForTokenClassification

DistilBertForTokenClassification can load Bert Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

Pretrained models can be loaded with pretrained of the companion object:

val labels = DistilBertForTokenClassification.pretrained()
  .setInputCols("token", "document")
  .setOutputCol("label")

The default model is "distilbert_base_token_classifier_conll03", if no name is provided.

For available pretrained models please see the Models Hub.

and the DistilBertForTokenClassificationTestSpec. Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see Import Transformers into Spark NLP 🚀.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: NAMED_ENTITY

Python API: DistilBertForTokenClassification

Scala API: DistilBertForTokenClassification

Source: DistilBertForTokenClassification

Show Examples

This example shows how to predict classes by using the embeddings generated by the Transformer.

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

# First extract the prerequisites for the NerDLModel
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

# Use the transformer embeddings
embeddings = DistilBertEmbeddings\
      .pretrained('distilbert_base_cased', 'en') \
      .setInputCols(["token", "document"]) \
      .setOutputCol("embeddings")

# This pretrained model requires those specific transformer embeddings
ner_model = NerDLModel.pretrained('ner_mit_movie_complex_distilbert_base_cased', 'en') \
    .setInputCols(['document', 'token', 'embeddings']) \
    .setOutputCol('ner')

pipeline = Pipeline().setStages([
    documentAssembler,
    sentence,
    tokenizer,
    embeddings,
    ner_model
])

data = spark.createDataFrame([["The Grand Budapest Hotel is a 2014 comedy-drama film written and directed by Wes Anderson"]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.select("ner.result").show(truncate=False)
+----------------------------------------------------------------------------------------------+
|result                                                                                        |
+----------------------------------------------------------------------------------------------+
|[I-Plot, I-Plot, I-Plot, I-Plot, O, O, B-Year, B-Genre, O, O, O, O, O, B-Director, I-Director]|
+----------------------------------------------------------------------------------------------+

DistilBertForZeroShotClassification

DistilBertForZeroShotClassification using a ModelForSequenceClassification trained on NLI (natural language inference) tasks. Equivalent of DistilBertForZeroShotClassification models, but these models don’t require a hardcoded number of potential classes, they can be chosen at runtime. It usually means it’s slower but it is much more flexible.

Note that the model will loop through all provided labels. So the more labels you have, the longer this process will take.

Any combination of sequences and labels can be passed and each combination will be posed as a premise/hypothesis pair and passed to the pretrained model.

Pretrained models can be loaded with pretrained of the companion object:

val sequenceClassifier = DistilBertForZeroShotClassification .pretrained()
  .setInputCols("token", "document")
  .setOutputCol("label")

The default model is "distilbert_base_zero_shot_classifier_uncased_mnli", if no name is provided.

For available pretrained models please see the Models Hub.

To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CATEGORY

Python API: DistilBertForZeroShotClassification

Scala API: DistilBertForZeroShotClassification

Source: DistilBertForZeroShotClassification

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

sequenceClassifier = BertForZeroShotClassification.pretrained() \
    .setInputCols(["token", "document"]) \
    .setOutputCol("label") \
    .setCaseSensitive(True)

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    sequenceClassifier
])

data = spark.createDataFrame([["I loved this movie when I was a child."], ["It was pretty boring."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("label.result").show(truncate=False)
+------+
|result|
+------+
|[pos] |
|[neg] |
+------+

E5Embeddings

Sentence embeddings using E5.

E5, an instruction-finetuned text embedding model that can generate text embeddings tailored to any task (e.g., classification, retrieval, clustering, text evaluation, etc.)

Note that this annotator is only supported for Spark Versions 3.4 and up.

Pretrained models can be loaded with pretrained of the companion object:

val embeddings = E5Embeddings.pretrained()
  .setInputCols("document")
  .setOutputCol("e5_embeddings")

The default model is "e5_small", if no name is provided.

For available pretrained models please see the Models Hub.

For extended examples of usage, see E5EmbeddingsTestSpec.

Sources :

Text Embeddings by Weakly-Supervised Contrastive Pre-training

E5 Github Repository

Paper abstract

This paper presents E5, a family of state-of-the-art text embeddings that transfer well to a wide range of tasks. The model is trained in a contrastive manner with weak supervision signals from our curated large-scale text pair dataset (called CCPairs). E5 can be readily used as a general-purpose embedding model for any tasks requiring a single-vector representation of texts such as retrieval, clustering, and classification, achieving strong performance in both zero-shot and fine-tuned settings. We conduct extensive evaluations on 56 datasets from the BEIR and MTEB benchmarks. For zero-shot settings, E5 is the first model that outperforms the strong BM25 baseline on the BEIR retrieval benchmark without using any labeled data. When fine-tuned, E5 obtains the best results on the MTEB benchmark, beating existing embedding models with 40× more parameters.

Input Annotator Types: DOCUMENT

Output Annotator Type: SENTENCE_EMBEDDINGS

Python API: E5Embeddings

Scala API: E5Embeddings

Source: E5Embeddings

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")
embeddings = E5Embeddings.pretrained() \
    .setInputCols(["document"]) \
    .setOutputCol("e5_embeddings")
embeddingsFinisher = EmbeddingsFinisher() \
    .setInputCols(["e5_embeddings"]) \
    .setOutputCols("finished_embeddings") \
    .setOutputAsVector(True)
pipeline = Pipeline().setStages([
    documentAssembler,
    embeddings,
    embeddingsFinisher
])

data = spark.createDataFrame([["query: how much protein should a female eat",
    "passage: As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day." + \
    "But, as you can see from this chart, you'll need to increase that if you're expecting or training for a" + \
    "marathon. Check out the chart below to see how much protein you should be eating each day.",
]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[[8.0190285E-4, -0.005974853, -0.072875895, 0.007944068, 0.026059335, -0.0080...|
|[[0.050514214, 0.010061974, -0.04340176, -0.020937217, 0.05170225, 0.01157857...|
+--------------------------------------------------------------------------------+

ElmoEmbeddings

Word embeddings from ELMo (Embeddings from Language Models), a language model trained on the 1 Billion Word Benchmark.

Note that this is a very computationally expensive module compared to word embedding modules that only perform embedding lookups. The use of an accelerator is recommended.

Pretrained models can be loaded with pretrained of the companion object:

val embeddings = ElmoEmbeddings.pretrained()
  .setInputCols("sentence", "token")
  .setOutputCol("elmo_embeddings")

# Offline - Download the pretrained model manually and extract it
elmo = ElmoEmbeddings.load("/elmo_en_2.4.0_2.4_1580488815299") \
        .setInputCols("sentence", "token") \
        .setOutputCol("elmo")

The default model is "elmo", if no name is provided.

For available pretrained models please see the Models Hub.

The pooling layer can be set with setPoolingLayer to the following values:

"word_emb": the character-based word representations with shape [batch_size, max_length, 512].
"lstm_outputs1": the first LSTM hidden state with shape [batch_size, max_length, 1024].
"lstm_outputs2": the second LSTM hidden state with shape [batch_size, max_length, 1024].
"elmo": the weighted sum of the 3 layers, where the weights are trainable. This tensor has shape [batch_size, max_length, 1024].

For extended examples of usage, see the Examples and the ElmoEmbeddingsTestSpec.

Sources:

https://tfhub.dev/google/elmo/3

Deep contextualized word representations

Paper abstract:

We introduce a new type of deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). Our word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. We show that these representations can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems, including question answering, textual entailment and sentiment analysis. We also present an analysis showing that exposing the deep internals of the pre-trained network is crucial, allowing downstream models to mix different types of semi-supervision signals.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: WORD_EMBEDDINGS

Python API: ElmoEmbeddings

Scala API: ElmoEmbeddings

Source: ElmoEmbeddings

Show Examples

This example shows how to predict classes by using the embeddings generated by the Transformer.

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

# First extract the prerequisites for the NerDLModel
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

# Use the transformer embeddings
embeddings = ElmoEmbeddings.pretrained() \
    .setInputCols(['document', 'token']) \
    .setOutputCol('embeddings')

# This pretrained model requires those specific transformer embeddings
ner_model = NerDLModel.pretrained("ner_conll_elmo", "en") \
    .setInputCols(["document", "token", "embeddings"]) \
    .setOutputCol("ner")

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    embeddings,
    ner_model
])

data = spark.createDataFrame([["U.N. official Ekeus heads for Baghdad."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.select("ner.result").show(truncate=False)
+------------------------------------+
|result                              |
+------------------------------------+
|[I-LOC, O, O, I-PER, O, O, I-LOC, O]|
+------------------------------------+

GPT2Transformer

GPT-2: the OpenAI Text-To-Text Transformer

GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than 10X the amount of data.

GPT-2 displays a broad set of capabilities, including the ability to generate conditional synthetic text samples of unprecedented quality, where we prime the model with an input and have it generate a lengthy continuation. In addition, GPT-2 outperforms other language models trained on specific domains (like Wikipedia, news, or books) without needing to use these domain-specific training datasets. On language tasks like question answering, reading comprehension, summarization, and translation, GPT-2 begins to learn these tasks from the raw text, using no task-specific training data. While scores on these downstream tasks are far from state-of-the-art, they suggest that the tasks can benefit from unsupervised techniques, given sufficient (unlabeled) data and compute.

Pretrained models can be loaded with pretrained of the companion object:

val gpt2 = GPT2Transformer.pretrained()
  .setInputCols("document")
  .setOutputCol("generation")

The default model is "gpt2", if no name is provided. For available pretrained models please see the Models Hub.

For extended examples of usage, see GPT2TestSpec.

Sources:

Paper Abstract:

Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on taskspecific datasets. We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText. When conditioned on a document plus questions, the answers generated by the language model reach F1 on the CoQA dataset - matching or exceeding the performance of 3 out of 4 baseline systems without using the 127,000+ training examples. The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain coherent paragraphs of text. These findings suggest a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.

Note:

This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.

Input Annotator Types: DOCUMENT

Output Annotator Type: DOCUMENT

Python API: GPT2Transformer

Scala API: GPT2Transformer

Source: GPT2Transformer

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("documents")

gpt2 = GPT2Transformer.pretrained("gpt2") \
    .setInputCols(["documents"]) \
    .setMaxOutputLength(50) \
    .setOutputCol("generation")

pipeline = Pipeline().setStages([documentAssembler, gpt2])
data = spark.createDataFrame([["My name is Leonardo."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("summaries.generation").show(truncate=False)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                              |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[My name is Leonardo. I am a man of letters. I have been a man for many years. I was born in the year 1776. I came to the United States in 1776, and I have lived in the United Kingdom since 1776.]|
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

HubertForCTC

Hubert Model with a language modeling head on top for Connectionist Temporal Classification (CTC). Hubert was proposed in HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units by Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed.

The annotator takes audio files and transcribes it as text. The audio needs to be provided pre-processed an array of floats.

Note that this annotator is currently not supported on Apple Silicon processors such as the M1. This is due to the processor not supporting instructions for XLA.

Pretrained models can be loaded with pretrained of the companion object:

val speechToText = HubertForCTC.pretrained()
  .setInputCols("audio_assembler")
  .setOutputCol("text")

The default model is "asr_hubert_large_ls960", if no name is provided.

For available pretrained models please see the Models Hub.

To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see HubertForCTCTestSpec.

References:

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

Paper Abstract:

Self-supervised approaches for speech representation learning are challenged by three unique problems: (1) there are multiple sound units in each input utterance, (2) there is no lexicon of input sound units during the pre-training phase, and (3) sound units have variable lengths with no explicit segmentation. To deal with these three problems, we propose the Hidden-Unit BERT (HuBERT) approach for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss. A key ingredient of our approach is applying the prediction loss over the masked regions only, which forces the model to learn a combined acoustic and language model over the continuous inputs. HuBERT relies primarily on the consistency of the unsupervised clustering step rather than the intrinsic quality of the assigned cluster labels. Starting with a simple k-means teacher of 100 clusters, and using two iterations of clustering, the HuBERT model either matches or improves upon the state-of-the-art wav2vec 2.0 performance on the Librispeech (960h) and Libri-light (60,000h) benchmarks with 10min, 1h, 10h, 100h, and 960h fine-tuning subsets. Using a 1B parameter model, HuBERT shows up to 19% and 13% relative WER reduction on the more challenging dev-other and test-other evaluation subsets.

Input Annotator Types: AUDIO

Output Annotator Type: DOCUMENT

Python API: HubertForCTC

Scala API: HubertForCTC

Source: HubertForCTC

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

audioAssembler = AudioAssembler() \\
    .setInputCol("audio_content") \\
    .setOutputCol("audio_assembler")

speechToText = HubertForCTC \\
    .pretrained() \\
    .setInputCols(["audio_assembler"]) \\
    .setOutputCol("text")

pipeline = Pipeline().setStages([audioAssembler, speechToText])
processedAudioFloats = spark.createDataFrame([[rawFloats]]).toDF("audio_content")

result = pipeline.fit(processedAudioFloats).transform(processedAudioFloats)

result.select("text.result").show(truncate = False)
+------------------------------------------------------------------------------------------+
|result                                                                                    |
+------------------------------------------------------------------------------------------+
|[MISTER QUILTER IS THE APOSTLE OF THE MIDLE CLASES AND WE ARE GLAD TO WELCOME HIS GOSPEL ]|
+------------------------------------------------------------------------------------------+

InstructorEmbeddings

Sentence embeddings using INSTRUCTOR.

Instructor👨‍🏫, an instruction-finetuned text embedding model that can generate text embeddings tailored to any task (e.g., classification, retrieval, clustering, text evaluation, etc.) and domains (e.g., science, finance, etc.) by simply providing the task instruction, without any finetuning. Instructor👨‍ achieves sota on 70 diverse embedding tasks!

Pretrained models can be loaded with pretrained of the companion object:

val embeddings = InstructorEmbeddings.pretrained()
  .setInputCols("document")
  .setOutputCol("instructor_embeddings")

The default model is "instructor_base", if no name is provided.

For available pretrained models please see the Models Hub.

For extended examples of usage, see InstructorEmbeddingsTestSpec.

Sources :

One Embedder, Any Task: Instruction-Finetuned Text Embeddings

INSTRUCTOR Github Repository

Paper abstract

We introduce INSTRUCTOR, a new method for computing text embeddings given task instructions: every text input is embedded together with instructions explaining the use case (e.g., task and domain descriptions). Unlike encoders from prior work that are more specialized, INSTRUCTOR is a single embedder that can generate text embeddings tailored to different downstream tasks and domains, without any further training. We first annotate instructions for 330 diverse tasks and train INSTRUCTOR on this multitask mixture with a contrastive loss. We evaluate INSTRUCTOR on 70 embedding evaluation tasks (66 of which are unseen during training), ranging from classification and information retrieval to semantic textual similarity and text generation evaluation. INSTRUCTOR, while having an order of magnitude fewer parameters than the previous best model, achieves state-of-the-art performance, with an average improvement of 3.4% compared to the previous best results on the 70 diverse datasets. Our analysis suggests that INSTRUCTOR is robust to changes in instructions, and that instruction finetuning mitigates the challenge of training a single model on diverse datasets. Our model, code, and data are available at this https URL. https://instructor-embedding.github.io/

Input Annotator Types: DOCUMENT

Output Annotator Type: SENTENCE_EMBEDDINGS

Python API: InstructorEmbeddings

Scala API: InstructorEmbeddings

Source: InstructorEmbeddings

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")
embeddings = InstructorEmbeddings.pretrained() \
    .setInputCols(["document"]) \
    .setInstruction("Represent the Medicine sentence for clustering: ") \
    .setOutputCol("instructor_embeddings")
embeddingsFinisher = EmbeddingsFinisher() \
    .setInputCols(["instructor_embeddings"]) \
    .setOutputCols("finished_embeddings") \
    .setOutputAsVector(True)
pipeline = Pipeline().setStages([
    documentAssembler,
    embeddings,
    embeddingsFinisher
])

data = spark.createDataFrame([["Dynamical Scalar Degree of Freedom in Horava-Lifshitz Gravity"]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[-2.3497989177703857,0.480538547039032,-0.3238905668258667,-1.612930893898010...|
+--------------------------------------------------------------------------------+

LongformerEmbeddings

Longformer is a transformer model for long documents. The Longformer model was presented in Longformer: The Long-Document Transformer by Iz Beltagy, Matthew E. Peters, Arman Cohan. longformer-base-4096 is a BERT-like model started from the RoBERTa checkpoint and pretrained for MLM on long documents. It supports sequences of length up to 4,096.

Pretrained models can be loaded with pretrained of the companion object:

val embeddings = LongformerEmbeddings.pretrained()
  .setInputCols("document", "token")
  .setOutputCol("embeddings")

The default model is "longformer_base_4096", if no name is provided. For available pretrained models please see the Models Hub.

For some examples of usage, see LongformerEmbeddingsTestSpec. Models from the HuggingFace 🤗 Transformers library are compatible with Spark NLP 🚀. To see which models are compatible and how to import them see Import Transformers into Spark NLP 🚀.

Paper Abstract:

Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer. Longformer’s attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on WikiHop and TriviaQA. We finally introduce the Longformer-Encoder-Decoder (LED), a Longformer variant for supporting long document generative sequence-to-sequence tasks, and demonstrate its effectiveness on the arXiv summarization dataset.

The original code can be found here https://github.com/allenai/longformer.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: WORD_EMBEDDINGS

Python API: LongformerEmbeddings

Scala API: LongformerEmbeddings

Source: LongformerEmbeddings

Show Examples

This example shows how to predict classes by using the embeddings generated by the Transformer.

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

# First extract the prerequisites for the NerDLModel
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

# Use the transformer embeddings
embeddings = LongformerEmbeddings \
      .pretrained("longformer_large_4096") \
      .setInputCols(['document', 'token']) \
      .setOutputCol("embeddings") \
      .setCaseSensitive(True) \
      .setMaxSentenceLength(4096)

# This pretrained model requires those specific transformer embeddings
ner_model = NerDLModel.pretrained('ner_conll_longformer_large_4096', 'en') \
    .setInputCols(['document', 'token', 'embeddings']) \
    .setOutputCol('ner')

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    embeddings,
    ner_model
])

data = spark.createDataFrame([["U.N. official Ekeus heads for Baghdad."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.select("ner.result").show(truncate=False)
+------------------------------------+
|result                              |
+------------------------------------+
|[B-ORG, O, O, B-PER, O, O, B-LOC, O]|
+------------------------------------+

LongformerForQuestionAnswering

LongformerForQuestionAnswering can load Longformer Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits).

Pretrained models can be loaded with pretrained of the companion object:

val spanClassifier = LongformerForQuestionAnswering.pretrained()
  .setInputCols(Array("document_question", "document_context"))
  .setOutputCol("answer")

The default model is "longformer_base_base_qa_squad2", if no name is provided.

For available pretrained models please see the Models Hub.

To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669. and the LongformerForQuestionAnsweringTestSpec.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CHUNK

Python API: LongformerForQuestionAnswering

Scala API: LongformerForQuestionAnswering

Source: LongformerForQuestionAnswering

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = MultiDocumentAssembler() \
    .setInputCols(["question", "context"]) \
    .setOutputCols(["document_question", "document_context"])

spanClassifier = LongformerForQuestionAnswering.pretrained() \
    .setInputCols(["document_question", "document_context"]) \
    .setOutputCol("answer") \
    .setCaseSensitive(False)

pipeline = Pipeline().setStages([
    documentAssembler,
    spanClassifier
])

data = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")

result = pipeline.fit(data).transform(data)

result.select("answer.result").show(truncate=False)
+--------------------+
|result              |
+--------------------+
|[Clara]             |
+--------------------+

LongformerForSequenceClassification

LongformerForSequenceClassification can load Longformer Models with sequence classification/regression head on top (a linear layer on top of the pooled output), e.g. for document classification tasks.

For multi-class, use setActivation("softmax"). For multi-label, use setActivation("sigmoid").

Pretrained models can be loaded with pretrained of the companion object:

val sequenceClassifier = LongformerForSequenceClassification.pretrained()
  .setInputCols("token", "document")
  .setOutputCol("label")

The default model is "longformer_base_sequence_classifier_imdb", if no name is provided.

For available pretrained models please see the Models Hub.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CATEGORY

Python API: LongformerForSequenceClassification

Scala API: LongformerForSequenceClassification

Source: LongformerForSequenceClassification

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

sequenceClassifier = LongformerForSequenceClassification.pretrained() \
    .setInputCols(["token", "document"]) \
    .setOutputCol("label") \
    .setCaseSensitive(True)

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    sequenceClassifier
])

data = spark.createDataFrame([["I loved this movie when I was a child.", "It was pretty boring."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("label.result").show(truncate=False)
+------+
|result|
+------+
|[pos] |
|[neg] |
+------+

LongformerForTokenClassification

LongformerForTokenClassification can load Longformer Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

Pretrained models can be loaded with pretrained of the companion object:

val tokenClassifier = LongformerForTokenClassification.pretrained()
  .setInputCols("token", "document")
  .setOutputCol("label")

The default model is "longformer_base_token_classifier_conll03", if no name is provided.

For available pretrained models please see the Models Hub.

and the LongformerForTokenClassificationTestSpec. Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see Import Transformers into Spark NLP 🚀.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: NAMED_ENTITY

Python API: LongformerForTokenClassification

Scala API: LongformerForTokenClassification

Source: LongformerForTokenClassification

Show Examples

This example shows how to predict classes by using the embeddings generated by the Transformer.

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

tokenClassifier = LongformerForTokenClassification.pretrained() \
    .setInputCols(["token", "document"]) \
    .setOutputCol("label") \
    .setCaseSensitive(True)

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    tokenClassifier
])

data = spark.createDataFrame([["John Lenon was born in London and lived in Paris. My name is Sarah and I live in London"]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.select("label.result").show(truncate=False)
+------------------------------------------------------------------------------------+
|result                                                                              |
+------------------------------------------------------------------------------------+
|[B-PER, I-PER, O, O, O, B-LOC, O, O, O, B-LOC, O, O, O, O, B-PER, O, O, O, O, B-LOC]|
+------------------------------------------------------------------------------------+

MPNetEmbeddings

Sentence embeddings using MPNet.

The MPNet model was proposed in MPNet: Masked and Permuted Pre-training for Language Understanding by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu. MPNet adopts a novel pre-training method, named masked and permuted language modeling, to inherit the advantages of masked language modeling and permuted language modeling for natural language understanding.

Note that this annotator is only supported for Spark Versions 3.4 and up.

Pretrained models can be loaded with pretrained of the companion object:

val embeddings = MPNetEmbeddings.pretrained()
  .setInputCols("document")
  .setOutputCol("mpnet_embeddings")

The default model is "all_mpnet_base_v2", if no name is provided.

For available pretrained models please see the Models Hub.

For extended examples of usage, see MPNetEmbeddingsTestSpec.

Sources :

MPNet: Masked and Permuted Pre-training for Language Understanding

MPNet Github Repository

Paper abstract

BERT adopts masked language modeling (MLM) for pre-training and is one of the most successful pre-training models. Since BERT neglects dependency among predicted tokens, XLNet introduces permuted language modeling (PLM) for pre-training to address this problem. However, XLNet does not leverage the full position information of a sentence and thus suffers from position discrepancy between pre-training and fine-tuning. In this paper, we propose MPNet, a novel pre-training method that inherits the advantages of BERT and XLNet and avoids their limitations. MPNet leverages the dependency among predicted tokens through permuted language modeling (vs. MLM in BERT), and takes auxiliary position information as input to make the model see a full sentence and thus reducing the position discrepancy (vs. PLM in XLNet). We pre-train MPNet on a large-scale dataset (over 160GB text corpora) and fine-tune on a variety of down-streaming tasks (GLUE, SQuAD, etc). Experimental results show that MPNet outperforms MLM and PLM by a large margin, and achieves better results on these tasks compared with previous state-of-the-art pre-trained methods (e.g., BERT, XLNet, RoBERTa) under the same model setting.

Input Annotator Types: DOCUMENT

Output Annotator Type: SENTENCE_EMBEDDINGS

Python API: MPNetEmbeddings

Scala API: MPNetEmbeddings

Source: MPNetEmbeddings

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")
embeddings = MPNetEmbeddings.pretrained() \
    .setInputCols(["document"]) \
    .setOutputCol("mpnet_embeddings")
embeddingsFinisher = EmbeddingsFinisher() \
    .setInputCols(["mpnet_embeddings"]) \
    .setOutputCols("finished_embeddings") \
    .setOutputAsVector(True)
pipeline = Pipeline().setStages([
    documentAssembler,
    embeddings,
    embeddingsFinisher
])

data = spark.createDataFrame([["This is an example sentence", "Each sentence is converted"]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[[0.022502584, -0.078291744, -0.023030775, -0.0051000593, -0.080340415, 0.039...|
|[[0.041702367, 0.0010974605, -0.015534201, 0.07092203, -0.0017729357, 0.04661...|
+--------------------------------------------------------------------------------+

MPNetForQuestionAnswering

MPNetForQuestionAnswering can load MPNet Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits).

Pretrained models can be loaded with pretrained of the companion object:

val spanClassifier = MPNetForQuestionAnswering.pretrained()
  .setInputCols(Array("document_question", "document_context"))
  .setOutputCol("answer")

The default model is "mpnet_base_question_answering_squad2", if no name is provided.

For available pretrained models please see the Models Hub.

To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see MPNetForQuestionAnsweringTestSpec.

Input Annotator Types: ``

Output Annotator Type: CHUNK

Python API: MPNetForQuestionAnswering

Scala API: MPNetForQuestionAnswering

Source: MPNetForQuestionAnswering

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = MultiDocumentAssembler() \
    .setInputCols(["question", "context"]) \
    .setOutputCol(["document_question", "document_context"])

spanClassifier = MPNetForQuestionAnswering.pretrained() \
    .setInputCols(["document_question", "document_context"]) \
    .setOutputCol("answer") \
    .setCaseSensitive(False)

pipeline = Pipeline().setStages([
    documentAssembler,
    spanClassifier
])

data = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")
result = pipeline.fit(data).transform(data)
result.select("answer.result").show(truncate=False)
+---------------------+
|result               |
+---------------------+
|[Clara]              |
++--------------------+

MPNetForSequenceClassification

MPNetForSequenceClassification can load MPNet Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.

Note that currently, only SetFit models can be imported.

Pretrained models can be loaded with pretrained of the companion object:

val sequenceClassifier = MPNetForSequenceClassification.pretrained()
  .setInputCols("token", "document")
  .setOutputCol("label")

The default model is "mpnet_sequence_classifier_ukr_message", if no name is provided.

For available pretrained models please see the Models Hub.

To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see MPNetForSequenceClassificationTestSpec.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CATEGORY

Python API: MPNetForSequenceClassification

Scala API: MPNetForSequenceClassification

Source: MPNetForSequenceClassification

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

document = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

sequenceClassifier = MPNetForSequenceClassification \
    .pretrained() \
    .setInputCols(["document", "token"]) \
    .setOutputCol("label")

data = spark.createDataFrame([
    ["I love driving my car."],
    ["The next bus will arrive in 20 minutes."],
    ["pineapple on pizza is the worst 🤮"],
]).toDF("text")

pipeline = Pipeline().setStages([document, tokenizer, sequenceClassifier])
pipelineModel = pipeline.fit(data)
results = pipelineModel.transform(data)
results.select("label.result").show()
+--------------------+
|              result|
+--------------------+
|     [TRANSPORT/CAR]|
|[TRANSPORT/MOVEMENT]|
|              [FOOD]|
+--------------------+

MarianTransformer

MarianTransformer: Fast Neural Machine Translation

Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. MarianTransformer uses the models trained by MarianNMT.

It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects.

Note that this model only supports inputs up to 512 tokens. If you are working with longer inputs, consider splitting them first. For example, you can use the SentenceDetectorDL annotator to split longer texts into sentences.

Pretrained models can be loaded with pretrained of the companion object:

val marian = MarianTransformer.pretrained()
  .setInputCols("sentence")
  .setOutputCol("translation")

The default model is "opus_mt_en_fr", default language is "xx" (meaning multi-lingual), if no values are provided. For available pretrained models please see the Models Hub.

For extended examples of usage, see the Examples and the MarianTransformerTestSpec.

Sources :

MarianNMT at GitHub

Marian: Fast Neural Machine Translation in C++

Paper Abstract:

We present Marian, an efficient and self-contained Neural Machine Translation framework with an integrated automatic differentiation engine based on dynamic computation graphs. Marian is written entirely in C++. We describe the design of the encoder-decoder framework and demonstrate that a research-friendly toolkit can achieve high training and translation speed.

Note:

This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.

Input Annotator Types: DOCUMENT

Output Annotator Type: DOCUMENT

Python API: MarianTransformer

Scala API: MarianTransformer

Source: MarianTransformer

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") \
    .setInputCols("document") \
    .setOutputCol("sentence")

marian = MarianTransformer.pretrained() \
    .setInputCols("sentence") \
    .setOutputCol("translation") \
    .setMaxInputLength(30)

pipeline = Pipeline() \
    .setStages([
      documentAssembler,
      sentence,
      marian
    ])

data = spark.createDataFrame([["What is the capital of France? We should know this in french."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(translation.result) as result").show(truncate=False)
+-------------------------------------+
|result                               |
+-------------------------------------+
|Quelle est la capitale de la France ?|
|On devrait le savoir en français.    |
+-------------------------------------+

OpenAICompletion

Transformer that makes a request for OpenAI Completion API for each executor.

See the OpenAI API Doc for reference.

Input Annotator Types: DOCUMENT

Output Annotator Type: DOCUMENT

Python API: OpenAICompletion

Scala API: OpenAICompletion

Source: OpenAICompletion

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.common import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")
openai_completion = OpenAICompletion() \
    .setInputCols("document") \
    .setOutputCol("completion") \
    .setModel("text-davinci-003") \
    .setMaxTokens(100)
pipeline = Pipeline().setStages([
    documentAssembler,
    openai_completion
])

empty_df = spark.createDataFrame([[""]], ["text"])
sample_text= [["Generate a restaurant review."], ["Write a review for a local eatery."], ["Create a JSON with a review of a dining experience."]]
sample_df = spark.createDataFrame(sample_text).toDF("text")
sample_df.show()
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|completion                                                                                                                                                                                                                                                                                        |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{document, 0, 258, \n\nI had the pleasure of dining at La Fiorita recently, and it was a truly delightful experience! The menu boasted a wonderful selection of classic Italian dishes, all exquisitely prepared and presented. The service staff was friendly and attentive and really, {}, []}]|
|[{document, 0, 227, \n\nI recently visited Barbecue Joe's for dinner and it was amazing! The menu had so many items to choose from including pulled pork, smoked turkey, brisket, pork ribs, and sandwiches. I opted for the pulled pork sandwich and let, {}, []}]                               |
|[{document, 0, 172, \n\n{ \n   "review": { \n      "overallRating": 4, \n      "reviewBody": "I enjoyed my meal at this restaurant. The food was flavourful, well-prepared and beautifully presented., {}, []}]                                                                                   |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

RoBertaEmbeddings

The RoBERTa model was proposed in RoBERTa: A Robustly Optimized BERT Pretraining Approach by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. It is based on Google’s BERT model released in 2018.

It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates.

Pretrained models can be loaded with pretrained of the companion object:

val embeddings = RoBertaEmbeddings.pretrained()
  .setInputCols("document", "token")
  .setOutputCol("embeddings")

The default model is "roberta_base", if no name is provided. For available pretrained models please see the Models Hub.

For extended examples of usage, see the Examples and the RoBertaEmbeddingsTestSpec. Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see Import Transformers into Spark NLP 🚀.

Paper Abstract:

Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.

Tips:

RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a different pretraining scheme.
RoBERTa doesn’t have :obj:token_type_ids, you don’t need to indicate which token belongs to which segment. Just separate your segments with the separation token :obj:tokenizer.sep_token (or :obj:</s>)

The original code can be found here https://github.com/pytorch/fairseq/tree/master/examples/roberta.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: WORD_EMBEDDINGS

Python API: RoBertaEmbeddings

Scala API: RoBertaEmbeddings

Source: RoBertaEmbeddings

Show Examples

This example shows how to predict classes by using the embeddings generated by the Transformer.

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

# First extract the prerequisites for the NerDLModel
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

# Use the transformer embeddings
embeddings = RoBertaEmbeddings.pretrained('roberta_base', 'en') \
      .setInputCols(["token", "document"]) \
      .setOutputCol("embeddings")

# This pretrained model requires those specific transformer embeddings
ner_model = NerDLModel.pretrained('ner_conll_roberta_base', 'en') \
    .setInputCols(['document', 'token', 'embeddings']) \
    .setOutputCol('ner')

pipeline = Pipeline().setStages([
    documentAssembler,
    sentence,
    tokenizer,
    embeddings,
    ner_model
])

data = spark.createDataFrame([["U.N. official Ekeus heads for Baghdad."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.select("ner.result").show(truncate=False)
+------------------------------------+
|result                              |
+------------------------------------+
|[B-ORG, O, O, B-PER, O, O, B-LOC, O]|
+------------------------------------+

RoBertaForQuestionAnswering

RoBertaForQuestionAnswering can load RoBERTa Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits).

Pretrained models can be loaded with pretrained of the companion object:

val spanClassifier = RoBertaForQuestionAnswering.pretrained()
  .setInputCols(Array("document_question", "document_context"))
  .setOutputCol("answer")

The default model is "roberta_base_qa_squad2", if no name is provided.

For available pretrained models please see the Models Hub.

To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669. and the RoBertaForQuestionAnsweringTestSpec.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CHUNK

Python API: RoBertaForQuestionAnswering

Scala API: RoBertaForQuestionAnswering

Source: RoBertaForQuestionAnswering

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = MultiDocumentAssembler() \
    .setInputCols(["question", "context"]) \
    .setOutputCols(["document_question", "document_context"])

spanClassifier = RoBertaForQuestionAnswering.pretrained() \
    .setInputCols(["document_question", "document_context"]) \
    .setOutputCol("answer") \
    .setCaseSensitive(False)

pipeline = Pipeline().setStages([
    documentAssembler,
    spanClassifier
])

data = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")

result = pipeline.fit(data).transform(data)

result.select("answer.result").show(truncate=False)
+--------------------+
|result              |
+--------------------+
|[Clara]             |
+--------------------+

RoBertaForSequenceClassification

RoBertaForSequenceClassification can load RoBERTa Models with sequence classification/regression head on top (a linear layer on top of the pooled output), e.g. for document classification tasks.

For multi-class, use setActivation("softmax"). For multi-label, use setActivation("sigmoid").

Pretrained models can be loaded with pretrained of the companion object:

val sequenceClassifier = RoBertaForSequenceClassification.pretrained()
  .setInputCols("token", "document")
  .setOutputCol("label")

The default model is "roberta_base_sequence_classifier_imdb", if no name is provided.

For available pretrained models please see the Models Hub.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CATEGORY

Python API: RoBertaForSequenceClassification

Scala API: RoBertaForSequenceClassification

Source: RoBertaForSequenceClassification

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \\
    .setInputCol("text") \\
    .setOutputCol("document")
tokenizer = Tokenizer() \\
    .setInputCols(["document"]) \\
    .setOutputCol("token")
sequenceClassifier = RoBertaForSequenceClassification.pretrained() \\
    .setInputCols(["token", "document"]) \\
    .setOutputCol("label") \\
    .setCaseSensitive(True)
pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    sequenceClassifier
])
data = spark.createDataFrame([["I loved this movie when I was a child.", "It was pretty boring."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("label.result").show(truncate=False)
+------+
|result|
+------+
|[pos] |
|[neg] |
+------+

RoBertaForTokenClassification

RoBertaForTokenClassification can load RoBERTa Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

Pretrained models can be loaded with pretrained of the companion object:

val tokenClassifier = RoBertaForTokenClassification.pretrained()
  .setInputCols("token", "document")
  .setOutputCol("label")

The default model is "roberta_base_token_classifier_conll03", if no name is provided.

For available pretrained models please see the Models Hub.

and the RoBertaForTokenClassificationTestSpec. Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see Import Transformers into Spark NLP 🚀.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: NAMED_ENTITY

Python API: RoBertaForTokenClassification

Scala API: RoBertaForTokenClassification

Source: RoBertaForTokenClassification

Show Examples

This example shows how to predict classes by using the embeddings generated by the Transformer.

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

tokenClassifier = RoBertaForTokenClassification.pretrained() \
    .setInputCols(["token", "document"]) \
    .setOutputCol("label") \
    .setCaseSensitive(True)

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    tokenClassifier
])

data = spark.createDataFrame([["John Lenon was born in London and lived in Paris. My name is Sarah and I live in London"]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.select("label.result").show(truncate=False)
+------------------------------------------------------------------------------------+
|result                                                                              |
+------------------------------------------------------------------------------------+
|[B-PER, I-PER, O, O, O, B-LOC, O, O, O, B-LOC, O, O, O, O, B-PER, O, O, O, O, B-LOC]|
+------------------------------------------------------------------------------------+

RoBertaForZeroShotClassification

RoBertaForZeroShotClassification using a ModelForSequenceClassification trained on NLI (natural language inference) tasks. Equivalent of RoBertaForZeroShotClassification models, but these models don’t require a hardcoded number of potential classes, they can be chosen at runtime. It usually means it’s slower but it is much more flexible.

Note that the model will loop through all provided labels. So the more labels you have, the longer this process will take.

Any combination of sequences and labels can be passed and each combination will be posed as a premise/hypothesis pair and passed to the pretrained model.

Pretrained models can be loaded with pretrained of the companion object:

val sequenceClassifier = RoBertaForZeroShotClassification .pretrained()
  .setInputCols("token", "document")
  .setOutputCol("label")

The default model is "roberta_base_zero_shot_classifier_nli", if no name is provided.

For available pretrained models please see the Models Hub.

To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CATEGORY

Python API: RoBertaForZeroShotClassification

Scala API: RoBertaForZeroShotClassification

Source: RoBertaForZeroShotClassification

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

sequenceClassifier = RoBertaForZeroShotClassification.pretrained() \
    .setInputCols(["token", "document"]) \
    .setOutputCol("label") \
    .setCaseSensitive(True)

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    sequenceClassifier
])

data = spark.createDataFrame([["I loved this movie when I was a child."], ["It was pretty boring."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("label.result").show(truncate=False)
+------+
|result|
+------+
|[pos] |
|[neg] |
+------+

RoBertaSentenceEmbeddings

Sentence-level embeddings using RoBERTa. The RoBERTa model was proposed in RoBERTa: A Robustly Optimized BERT Pretraining Approach by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. It is based on Google’s BERT model released in 2018.

It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates.

Pretrained models can be loaded with pretrained of the companion object:

val embeddings = RoBertaSentenceEmbeddings.pretrained()
  .setInputCols("sentence")
  .setOutputCol("sentence_embeddings")

The default model is "sent_roberta_base", if no name is provided. For available pretrained models please see the Models Hub.

Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see Import Transformers into Spark NLP 🚀.

Paper Abstract:

Tips:

RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a different pretraining scheme.
RoBERTa doesn’t have :obj:token_type_ids, you don’t need to indicate which token belongs to which segment. Just separate your segments with the separation token :obj:tokenizer.sep_token (or :obj:</s>)

The original code can be found here https://github.com/pytorch/fairseq/tree/master/examples/roberta.

Input Annotator Types: DOCUMENT

Output Annotator Type: SENTENCE_EMBEDDINGS

Python API: RoBertaSentenceEmbeddings

Scala API: RoBertaSentenceEmbeddings

Source: RoBertaSentenceEmbeddings

Show Examples

This example shows how to predict classes by using the embeddings generated by the Transformer.

# Coming Soon!

This example shows how to train an Approach Annotator by using the embeddings generated by the Transformer.

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

smallCorpus = spark.read.option("header","True").csv("sentiment.csv")

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

embeddings = RoBertaSentenceEmbeddings.pretrained() \
  .setInputCols(["document"])\
  .setOutputCol("sentence_embeddings")

# Then the training can start with the transformer embeddings
docClassifier = ClassifierDLApproach() \
    .setInputCols("sentence_embeddings") \
    .setOutputCol("category") \
    .setLabelColumn("label") \
    .setBatchSize(64) \
    .setMaxEpochs(20) \
    .setLr(5e-3) \
    .setDropout(0.5)

pipeline = Pipeline().setStages([
    documentAssembler,
    embeddings,
    docClassifier
])

pipelineModel = pipeline.fit(smallCorpus)

This example shows how to extract the embeddings generated by the Transformer.

import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

sentenceEmbeddings = RoBertaSentenceEmbeddings.pretrained() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence_embeddings") \
    .setCaseSensitive(True)

# you can either use the output to train ClassifierDL, SentimentDL, or MultiClassifierDL
# or you can use EmbeddingsFinisher to prepare the results for Spark ML functions

embeddingsFinisher = EmbeddingsFinisher() \
    .setInputCols(["sentence_embeddings"]) \
    .setOutputCols("finished_embeddings") \
    .setOutputAsVector(True) \
    .setCleanAnnotations(False)

pipeline = Pipeline() \
    .setStages([
      documentAssembler,
      tokenizer,
      sentenceEmbeddings,
      embeddingsFinisher
    ])

data = spark.createDataFrame([["This is a sentence."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[0.18792399764060974,-0.14591649174690247,0.20547787845134735,0.1468472778797...|
|[0.22845706343650818,0.18073144555091858,0.09725798666477203,-0.0417917296290...|
|[0.07037967443466187,-0.14801117777824402,-0.03603338822722435,-0.17893412709...|
|[-0.08734266459941864,0.2486150562763214,-0.009067727252840996,-0.24408400058...|
|[0.22409197688102722,-0.4312366545200348,0.1401449590921402,0.356410235166549...|
+--------------------------------------------------------------------------------+

SpanBertCoref

A coreference resolution model based on SpanBert

A coreference resolution model identifies expressions which refer to the same entity in a text. For example, given a sentence “John told Mary he would like to borrow a book from her.” the model will link “he” to “John” and “her” to “Mary”.

This model is based on SpanBert, which is fine-tuned on the OntoNotes 5.0 data set.

Pretrained models can be loaded with pretrained of the companion object:

val dependencyParserApproach = SpanBertCorefModel.pretrained()
  .setInputCols("sentence", "token")
  .setOutputCol("corefs")

The default model is "spanbert_base_coref", if no name is provided. For available pretrained models please see the Models Hub.

References: https://github.com/mandarjoshi90/coref

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: DEPENDENCY

Python API: SpanBertCorefModel

Scala API: SpanBertCorefModel

Source: SpanBertCorefModel

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

corefResolution = SpanBertCorefModel() \
    .pretrained() \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("corefs") \

pipeline = Pipeline().setStages([
    documentAssembler,
    sentence,
    tokenizer,
    corefResolution
])

data = spark.createDataFrame([
    ["John told Mary he would like to borrow a book from her."]
]).toDF("text")
results = pipeline.fit(data).transform(data))
results \
    .selectExpr("explode(corefs) AS coref")
    .selectExpr("coref.result as token", "coref.metadata")
    .show(truncate=False)
+-----+------------------------------------------------------------------------------------+
|token|metadata                                                                            |
+-----+------------------------------------------------------------------------------------+
|John |{head.sentence -> -1, head -> ROOT, head.begin -> -1, head.end -> -1, sentence -> 0}|
|he   |{head.sentence -> 0, head -> John, head.begin -> 0, head.end -> 3, sentence -> 0}   |
|Mary |{head.sentence -> -1, head -> ROOT, head.begin -> -1, head.end -> -1, sentence -> 0}|
|her  |{head.sentence -> 0, head -> Mary, head.begin -> 10, head.end -> 13, sentence -> 0} |
+-----+------------------------------------------------------------------------------------+

SwinForImageClassification

SwinImageClassification is an image classifier based on Swin.

The Swin Transformer was proposed in Swin Transformer: Hierarchical Vision Transformer using Shifted Windows by Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo.

It is basically a hierarchical Transformer whose representation is computed with shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection.

Pretrained models can be loaded with pretrained of the companion object:

val imageClassifier = SwinForImageClassification.pretrained()
  .setInputCols("image_assembler")
  .setOutputCol("class")

The default model is "image_classifier_swin_base_patch_4_window_7_224", if no name is provided.

For available pretrained models please see the Models Hub.

References:

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Paper Abstract:

This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with Shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. This hierarchical architecture has the flexibility to model at various scales and has linear computational complexity with respect to image size. These qualities of Swin Transformer make it compatible with a broad range of vision tasks, including image classification (87.3 top-1 accuracy on ImageNet-1K) and dense prediction tasks such as object detection (58.7 box AP and 51.1 mask AP on COCO test- dev) and semantic segmentation (53.5 mIoU on ADE20K val). Its performance surpasses the previous state-of-the- art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones. The hierarchical design and the shifted window approach also prove beneficial for all-MLP architectures.

Input Annotator Types: IMAGE

Output Annotator Type: CATEGORY

Python API: SwinForImageClassification

Scala API: SwinForImageClassification

Source: SwinForImageClassification

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

imageDF: DataFrame = spark.read \
    .format("image") \
    .option("dropInvalid", value = True) \
    .load("src/test/resources/image/")

imageAssembler = ImageAssembler() \
    .setInputCol("image") \
    .setOutputCol("image_assembler")

imageClassifier = SwinForImageClassification \
    .pretrained() \
    .setInputCols(["image_assembler"]) \
    .setOutputCol("class")

pipeline = Pipeline().setStages([imageAssembler, imageClassifier])
pipelineDF = pipeline.fit(imageDF).transform(imageDF)

pipelineDF \
    .selectExpr("reverse(split(image.origin, '/'))[0] as image_name", "class.result") \
    .show(truncate=False)
+-----------------+----------------------------------------------------------+
|image_name       |result                                                    |
+-----------------+----------------------------------------------------------+
|palace.JPEG      |[palace]                                                  |
|egyptian_cat.jpeg|[tabby, tabby cat]                                        |
|hippopotamus.JPEG|[hippopotamus, hippo, river horse, Hippopotamus amphibius]|
|hen.JPEG         |[hen]                                                     |
|ostrich.JPEG     |[ostrich, Struthio camelus]                               |
|junco.JPEG       |[junco, snowbird]                                         |
|bluetick.jpg     |[bluetick]                                                |
|chihuahua.jpg    |[Chihuahua]                                               |
|tractor.JPEG     |[tractor]                                                 |
|ox.JPEG          |[ox]                                                      |
+-----------------+----------------------------------------------------------+

T5Transformer

T5: the Text-To-Text Transfer Transformer

T5 reconsiders all NLP tasks into a unified text-to-text-format where the input and output are always text strings, in contrast to BERT-style models that can only output either a class label or a span of the input. The text-to-text framework is able to use the same model, loss function, and hyper-parameters on any NLP task, including machine translation, document summarization, question answering, and classification tasks (e.g., sentiment analysis). T5 can even apply to regression tasks by training it to predict the string representation of a number instead of the number itself.

Pretrained models can be loaded with pretrained of the companion object:

val t5 = T5Transformer.pretrained()
  .setTask("summarize:")
  .setInputCols("document")
  .setOutputCol("summaries")

The default model is "t5_small", if no name is provided. For available pretrained models please see the Models Hub.

For extended examples of usage, see the Examples and the T5TestSpec.

Sources:

Paper Abstract:

Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new Colossal Clean Crawled Corpus, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

Note:

This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.

Input Annotator Types: DOCUMENT

Output Annotator Type: DOCUMENT

Python API: T5Transformer

Scala API: T5Transformer

Source: T5Transformer

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("documents")

t5 = T5Transformer.pretrained("t5_small") \
    .setTask("summarize:") \
    .setInputCols(["documents"]) \
    .setMaxOutputLength(200) \
    .setOutputCol("summaries")

pipeline = Pipeline().setStages([documentAssembler, t5])

data = spark.createDataFrame([[
    "Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a " +
      "downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness" +
      " of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this " +
      "paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework " +
      "that converts all text-based language problems into a text-to-text format. Our systematic study compares " +
      "pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens " +
      "of language understanding tasks. By combining the insights from our exploration with scale and our new " +
      "Colossal Clean Crawled Corpus, we achieve state-of-the-art results on many benchmarks covering " +
      "summarization, question answering, text classification, and more. To facilitate future work on transfer " +
      "learning for NLP, we release our data set, pre-trained models, and code."
]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.select("summaries.result").show(truncate=False)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                        |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[transfer learning has emerged as a powerful technique in natural language processing (NLP) the effectiveness of transfer learning has given rise to a diversity of approaches, methodologies, and practice .]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

TapasForQuestionAnswering

TapasForQuestionAnswering is an implementation of TaPas - a BERT-based model specifically designed for answering questions about tabular data. It takes TABLE and DOCUMENT annotations as input and tries to answer the questions in the document by using the data from the table. The model is based in BertForQuestionAnswering and shares all its parameters with it.

Pretrained models can be loaded with pretrained of the companion object:

val tapas = TapasForQuestionAnswering.pretrained()
  .setInputCols(Array("document_question", "table"))
  .setOutputCol("answer")

The default model is "table_qa_tapas_base_finetuned_wtq", if no name is provided.

For available pretrained models please see the Models Hub.

Input Annotator Types: TABLE, DOCUMENT

Output Annotator Type: CHUNK

Python API: TapasForQuestionAnswering

Scala API: TapasForQuestionAnswering

Source: TapasForQuestionAnswering

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

document_assembler = MultiDocumentAssembler()\
    .setInputCols("table_json", "questions")\
    .setOutputCols("document_table", "document_questions")

sentence_detector = SentenceDetector()\
    .setInputCols(["document_questions"])\
    .setOutputCol("questions")

table_assembler = TableAssembler()\
    .setInputCols(["document_table"])\
    .setOutputCol("table")

tapas = TapasForQuestionAnswering\
    .pretrained()\
    .setInputCols(["questions", "table"])\
    .setOutputCol("answers")

pipeline = Pipeline(stages=[
    document_assembler,
    sentence_detector,
    table_assembler,
    tapas])

json_data = \"\"\"
{
    "header": ["name", "money", "age"],
    "rows": [
    ["Donald Trump", "$100,000,000", "75"],
    ["Elon Musk", "$20,000,000,000,000", "55"]
    ]
}
\"\"\"
model = pipeline.fit(data)
model\
    .transform(data)\
    .selectExpr("explode(answers) AS answer")\
    .select("answer.metadata.question", "answer.result")\
    .show(truncate=False)
+-----------------------+----------------------------------------+
|question               |result                                  |
+-----------------------+----------------------------------------+
|Who earns 100,000,000? |Donald Trump                            |
|Who has more money?    |Elon Musk                               |
|How much they all earn?|COUNT($100,000,000, $20,000,000,000,000)|
|How old are they?      |AVERAGE(75, 55)                         |
+-----------------------+----------------------------------------+

UAEEmbeddings

Sentence embeddings using Universal AnglE Embedding (UAE).

UAE is a novel angle-optimized text embedding model, designed to improve semantic textual similarity tasks, which are crucial for Large Language Model (LLM) applications. By introducing angle optimization in a complex space, AnglE effectively mitigates saturation of the cosine similarity function.

Pretrained models can be loaded with pretrained of the companion object:

val embeddings = UAEEmbeddings.pretrained()
  .setInputCols("document")
  .setOutputCol("UAE_embeddings")

The default model is "uae_large_v1", if no name is provided.

For available pretrained models please see the Models Hub.

For extended examples of usage, see UAEEmbeddingsTestSpec.

Sources :

AnglE-optimized Text Embeddings

UAE Github Repository

Paper abstract

High-quality text embedding is pivotal in improving semantic textual similarity (STS) tasks, which are crucial components in Large Language Model (LLM) applications. However, a common challenge existing text embedding models face is the problem of vanishing gradients, primarily due to their reliance on the cosine function in the optimization objective, which has saturation zones. To address this issue, this paper proposes a novel angle-optimized text embedding model called AnglE. The core idea of AnglE is to introduce angle optimization in a complex space. This novel approach effectively mitigates the adverse effects of the saturation zone in the cosine function, which can impede gradient and hinder optimization processes. To set up a comprehensive STS evaluation, we experimented on existing short-text STS datasets and a newly collected long-text STS dataset from GitHub Issues. Furthermore, we examine domain-specific STS scenarios with limited labeled data and explore how AnglE works with LLM-annotated data. Extensive experiments were conducted on various tasks including short-text STS, long-text STS, and domain-specific STS tasks. The results show that AnglE outperforms the state-of-the-art (SOTA) STS models that ignore the cosine saturation zone. These findings demonstrate the ability of AnglE to generate high-quality text embeddings and the usefulness of angle optimization in STS.

Input Annotator Types: DOCUMENT

Output Annotator Type: SENTENCE_EMBEDDINGS

Python API: UAEEmbeddings

Scala API: UAEEmbeddings

Source: UAEEmbeddings

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")
embeddings = UAEEmbeddings.pretrained() \
    .setInputCols(["document"]) \
    .setOutputCol("embeddings")
embeddingsFinisher = EmbeddingsFinisher() \
    .setInputCols("embeddings") \
    .setOutputCols("finished_embeddings") \
    .setOutputAsVector(True)
pipeline = Pipeline().setStages([
    documentAssembler,
    embeddings,
    embeddingsFinisher
])

data = spark.createDataFrame([["hello world", "hello moon"]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[0.50387806, 0.5861606, 0.35129607, -0.76046336, -0.32446072, -0.117674336, 0...|
|[0.6660665, 0.961762, 0.24854276, -0.1018044, -0.6569202, 0.027635604, 0.1915...|
+--------------------------------------------------------------------------------+

UniversalSentenceEncoder

The Universal Sentence Encoder encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks.

Pretrained models can be loaded with pretrained of the companion object:

val useEmbeddings = UniversalSentenceEncoder.pretrained()
  .setInputCols("sentence")
  .setOutputCol("sentence_embeddings")

The default model is "tfhub_use", if no name is provided. For available pretrained models please see the Models Hub.

For extended examples of usage, see the Examples and the UniversalSentenceEncoderTestSpec.

Sources:

Universal Sentence Encoder

https://tfhub.dev/google/universal-sentence-encoder/2

Paper abstract:

We present models for encoding sentences into embedding vectors that specifically target transfer learning to other NLP tasks. The models are efficient and result in accurate performance on diverse transfer tasks. Two variants of the encoding models allow for trade-offs between accuracy and compute resources. For both variants, we investigate and report the relationship between model complexity, resource consumption, the availability of transfer task training data, and task performance. Comparisons are made with baselines that use word level transfer learning via pretrained word embeddings as well as baselines do not use any transfer learning. We find that transfer learning using sentence embeddings tends to outperform word level transfer. With transfer learning via sentence embeddings, we observe surprisingly good performance with minimal amounts of supervised training data for a transfer task. We obtain encouraging results on Word Embedding Association Tests (WEAT) targeted at detecting model bias. Our pre-trained sentence encoding models are made freely available for download and on TF Hub.

Input Annotator Types: DOCUMENT

Output Annotator Type: SENTENCE_EMBEDDINGS

Python API: UniversalSentenceEncoder

Scala API: UniversalSentenceEncoder

Source: UniversalSentenceEncoder

Show Examples

This example shows how to predict classes by using the embeddings generated by the Transformer.

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

# Use the transformer embeddings
embeddings = UniversalSentenceEncoder.pretrained('tfhub_use', lang="en") \
    .setInputCols(["document"]) \
    .setOutputCol("sentence_embeddings")

# This pretrained model requires those specific transformer embeddings
classifier = SentimentDLModel().pretrained('sentimentdl_use_imdb') \
    .setInputCols(["sentence_embeddings"]) \
    .setOutputCol("sentiment")

pipeline = Pipeline().setStages([
    documentAssembler,
    embeddings,
    classifier
])

data = spark.createDataFrame([["That was a fantastic movie!"]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.select("sentiment.result").show(truncate=False)
+------+
|result|
+------+
|[pos] |
+------+

ViTForImageClassification

Vision Transformer (ViT) for image classification.

ViT is a transformer based alternative to the convolutional neural networks usually used for image recognition tasks.

Pretrained models can be loaded with pretrained of the companion object:

val imageClassifier = ViTForImageClassification.pretrained()
  .setInputCols("image_assembler")
  .setOutputCol("class")

The default model is "image_classifier_vit_base_patch16_224", if no name is provided.

For available pretrained models please see the Models Hub.

References:

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Paper Abstract:

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.

Input Annotator Types: IMAGE

Output Annotator Type: CATEGORY

Python API: ViTForImageClassification

Scala API: ViTForImageClassification

Source: ViTForImageClassification

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

imageDF: DataFrame = spark.read \\
    .format("image") \\
    .option("dropInvalid", value = True) \\
    .load("src/test/resources/image/")
imageAssembler = ImageAssembler() \\
    .setInputCol("image") \\
    .setOutputCol("image_assembler")
imageClassifier = ViTForImageClassification \\
    .pretrained() \\
    .setInputCols(["image_assembler"]) \\
    .setOutputCol("class")

pipeline = Pipeline().setStages([imageAssembler, imageClassifier])
pipelineDF = pipeline.fit(imageDF).transform(imageDF)

pipelineDF \
    .selectExpr("reverse(split(image.origin, '/'))[0] as image_name", "class.result") \
    .show(truncate=False)
+-----------------+----------------------------------------------------------+
|image_name       |result                                                    |
+-----------------+----------------------------------------------------------+
|palace.JPEG      |[palace]                                                  |
|egyptian_cat.jpeg|[Egyptian cat]                                            |
|hippopotamus.JPEG|[hippopotamus, hippo, river horse, Hippopotamus amphibius]|
|hen.JPEG         |[hen]                                                     |
|ostrich.JPEG     |[ostrich, Struthio camelus]                               |
|junco.JPEG       |[junco, snowbird]                                         |
|bluetick.jpg     |[bluetick]                                                |
|chihuahua.jpg    |[Chihuahua]                                               |
|tractor.JPEG     |[tractor]                                                 |
|ox.JPEG          |[ox]                                                      |
+-----------------+----------------------------------------------------------+

VisionEncoderDecoderForImageCaptioning

VisionEncoderDecoder model that converts images into text captions. It allows for the use of pretrained vision auto-encoding models, such as ViT, BEiT, or DeiT as the encoder, in combination with pretrained language models, like RoBERTa, GPT2, or BERT as the decoder.

Pretrained models can be loaded with pretrained of the companion object:

val imageClassifier = VisionEncoderDecoderForImageCaptioning.pretrained()
  .setInputCols("image_assembler")
  .setOutputCol("caption")

The default model is "image_captioning_vit_gpt2", if no name is provided.

For available pretrained models please see the Models Hub.

Note:

This is a very computationally expensive module especially on larger batch sizes. The use of an accelerator such as GPU is recommended.

Input Annotator Types: IMAGE

Output Annotator Type: DOCUMENT

Python API: VisionEncoderDecoderForImageCaptioning

Scala API: VisionEncoderDecoderForImageCaptioning

Source: VisionEncoderDecoderForImageCaptioning

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
imageDF = spark.read \
    .format("image") \
    .option("dropInvalid", value = True) \
    .load("src/test/resources/image/")
imageAssembler = ImageAssembler() \
    .setInputCol("image") \
    .setOutputCol("image_assembler")
imageCaptioning = VisionEncoderDecoderForImageCaptioning \
    .pretrained() \
    .setBeamSize(2) \
    .setDoSample(False) \
    .setInputCols(["image_assembler"]) \
    .setOutputCol("caption")
pipeline = Pipeline().setStages([imageAssembler, imageCaptioning])
pipelineDF = pipeline.fit(imageDF).transform(imageDF)
pipelineDF \
    .selectExpr("reverse(split(image.origin, '/'))[0] as image_name", "caption.result") \
    .show(truncate = False)
+-----------------+---------------------------------------------------------+
|image_name       |result                                                   |
+-----------------+---------------------------------------------------------+
|palace.JPEG      |[a large room filled with furniture and a large window]  |
|egyptian_cat.jpeg|[a cat laying on a couch next to another cat]            |
|hippopotamus.JPEG|[a brown bear in a body of water]                        |
|hen.JPEG         |[a flock of chickens standing next to each other]        |
|ostrich.JPEG     |[a large bird standing on top of a lush green field]     |
|junco.JPEG       |[a small bird standing on a wet ground]                  |
|bluetick.jpg     |[a small dog standing on a wooden floor]                 |
|chihuahua.jpg    |[a small brown dog wearing a blue sweater]               |
|tractor.JPEG     |[a man is standing in a field with a tractor]            |
|ox.JPEG          |[a large brown cow standing on top of a lush green field]|
+-----------------+---------------------------------------------------------+

Wav2Vec2ForCTC

Wav2Vec2 Model with a language modeling head on top for Connectionist Temporal Classification (CTC). Wav2Vec2 was proposed in wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli.

The annotator takes audio files and transcribes it as text. The audio needs to be provided pre-processed an array of floats.

Note that this annotator is currently not supported on Apple Silicon processors such as the M1. This is due to the processor not supporting instructions for XLA.

Pretrained models can be loaded with pretrained of the companion object:

val speechToText = Wav2Vec2ForCTC.pretrained()
  .setInputCols("audio_assembler")
  .setOutputCol("text")

The default model is "asr_wav2vec2_base_960h", if no name is provided.

For available pretrained models please see the Models Hub.

To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see Wav2Vec2ForCTCTestSpec.

Input Annotator Types: AUDIO

Output Annotator Type: DOCUMENT

Python API: Wav2Vec2ForCTC

Scala API: Wav2Vec2ForCTC

Source: Wav2Vec2ForCTC

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

audioAssembler = AudioAssembler() \
    .setInputCol("audio_content") \
    .setOutputCol("audio_assembler")
speechToText = Wav2Vec2ForCTC \
    .pretrained() \
    .setInputCols(["audio_assembler"]) \
    .setOutputCol("text")

pipeline = Pipeline().setStages([audioAssembler, speechToText])
processedAudioFloats = spark.createDataFrame([[rawFloats]]).toDF("audio_content")
result = pipeline.fit(processedAudioFloats).transform(processedAudioFloats)
result.select("text.result").show(truncate = False)
+------------------------------------------------------------------------------------------+
|result                                                                                    |
+------------------------------------------------------------------------------------------+
|[MISTER QUILTER IS THE APOSTLE OF THE MIDLE CLASES AND WE ARE GLAD TO WELCOME HIS GOSPEL ]|
+------------------------------------------------------------------------------------------+

WhisperForCTC

Whisper Model with a language modeling head on top for Connectionist Temporal Classification (CTC).

Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. It transcribe in multiple languages, as well as translate from those languages into English.

The audio needs to be provided pre-processed an array of floats.

Note that at the moment, this annotator only supports greedy search and only Spark Versions 3.4 and up are supported.

For multilingual models, the language and the task (transcribe or translate) can be set with setLanguage and setTask.

Pretrained models can be loaded with pretrained of the companion object:

val speechToText = WhisperForCTC.pretrained()
  .setInputCols("audio_assembler")
  .setOutputCol("text")

The default model is "asr_whisper_tiny_opt", if no name is provided.

For available pretrained models please see the Models Hub.

To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see WhisperForCTCTestSpec.

References:

Robust Speech Recognition via Large-Scale Weak Supervision

Paper Abstract:

We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zero- shot transfer setting without the need for any fine- tuning. When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing.

Input Annotator Types: AUDIO

Output Annotator Type: DOCUMENT

Python API: WhisperForCTC

Scala API: WhisperForCTC

Source: WhisperForCTC

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

audioAssembler = AudioAssembler() \
    .setInputCol("audio_content") \
    .setOutputCol("audio_assembler")

speechToText = WhisperForCTC.pretrained() \
    .setInputCols(["audio_assembler"]) \
    .setOutputCol("text")

pipeline = Pipeline().setStages([audioAssembler, speechToText])
processedAudioFloats = spark.createDataFrame([[rawFloats]]).toDF("audio_content")
result = pipeline.fit(processedAudioFloats).transform(processedAudioFloats)
result.select("text.result").show(truncate = False)
+------------------------------------------------------------------------------------------+
|result                                                                                    |
+------------------------------------------------------------------------------------------+
|[ Mr. Quilter is the apostle of the middle classes and we are glad to welcome his gospel.]|
+------------------------------------------------------------------------------------------+

XlmRoBertaEmbeddings

The XLM-RoBERTa model was proposed in Unsupervised Cross-lingual Representation Learning at Scale by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco GuzmÃ¡n, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov. It is based on Facebook’s RoBERTa model released in 2019. It is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data.

Pretrained models can be loaded with pretrained of the companion object:

val embeddings = XlmRoBertaEmbeddings.pretrained()
  .setInputCols("document", "token")
  .setOutputCol("embeddings")

The default model is "xlm_roberta_base", default language is "xx" (meaning multi-lingual), if no values are provided. For available pretrained models please see the Models Hub.

For extended examples of usage, see the Examples and the XlmRoBertaEmbeddingsTestSpec. Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see Import Transformers into Spark NLP 🚀.

Paper Abstract:

This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +13.8% average accuracy on XNLI, +12.3% average F1 score on MLQA, and +2.1% average F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 11.8% in XNLI accuracy for Swahili and 9.2% for Urdu over the previous XLM model. We also present a detailed empirical evaluation of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-Ris very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We will make XLM-R code, data, and models publicly available.

Tips:

XLM-RoBERTa is a multilingual model trained on 100 different languages. Unlike some XLM multilingual models, it does not require lang parameter to understand which language is used, and should be able to determine the correct language from the input ids.
This implementation is the same as RoBERTa. Refer to the RoBertaEmbeddings for usage examples as well as the information relative to the inputs and outputs.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: WORD_EMBEDDINGS

Python API: XlmRoBertaEmbeddings

Scala API: XlmRoBertaEmbeddings

Source: XlmRoBertaEmbeddings

Show Examples

This example shows how to predict classes by using the embeddings generated by the Transformer.

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

# First extract the prerequisites for the NerDLModel
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

sentence = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")

tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")

# Use the transformer embeddings
embeddings = XlmRoBertaEmbeddings.pretrained('xlm_roberta_base', 'xx') \
    .setInputCols(["token", "document"]) \
    .setOutputCol("embeddings")

# This pretrained model requires those specific transformer embeddings
ner_model = NerDLModel.pretrained('ner_conll_xlm_roberta_base', 'en') \
    .setInputCols(['document', 'token', 'embeddings']) \
    .setOutputCol('ner')

pipeline = Pipeline().setStages([
    documentAssembler,
    sentence,
    tokenizer,
    embeddings,
    ner_model
])

data = spark.createDataFrame([["U.N. official Ekeus heads for Baghdad."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.select("ner.result").show(truncate=False)
+------------------------------------+
|result                              |
+------------------------------------+
|[B-ORG, O, O, B-PER, O, O, B-LOC, O]|
+------------------------------------+

XlmRoBertaForQuestionAnswering

XlmRoBertaForQuestionAnswering can load XLM-RoBERTa Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits).

Pretrained models can be loaded with pretrained of the companion object:

val spanClassifier = XlmRoBertaForQuestionAnswering.pretrained()
  .setInputCols(Array("document_question", "document_context"))
  .setOutputCol("answer")

The default model is "xlm_roberta_base_qa_squad2", if no name is provided.

For available pretrained models please see the Models Hub.

To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669. and the XlmRoBertaForQuestionAnsweringTestSpec.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CHUNK

Python API: XlmRoBertaForQuestionAnswering

Scala API: XlmRoBertaForQuestionAnswering

Source: XlmRoBertaForQuestionAnswering

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = MultiDocumentAssembler() \
    .setInputCols(["question", "context"]) \
    .setOutputCols(["document_question", "document_context"])

spanClassifier = XlmRoBertaForQuestionAnswering.pretrained() \
    .setInputCols(["document_question", "document_context"]) \
    .setOutputCol("answer") \
    .setCaseSensitive(False)

pipeline = Pipeline().setStages([
    documentAssembler,
    spanClassifier
])

data = spark.createDataFrame([["What's my name?", "My name is Clara and I live in Berkeley."]]).toDF("question", "context")

result = pipeline.fit(data).transform(data)

result.select("answer.result").show(truncate=False)
+--------------------+
|result              |
+--------------------+
|[Clara]             |
+--------------------+

XlmRoBertaForSequenceClassification

XlmRoBertaForSequenceClassification can load XLM-RoBERTa Models with sequence classification/regression head on top (a linear layer on top of the pooled output), e.g. for document classification tasks.

For multi-class, use setActivation("softmax"). For multi-label, use setActivation("sigmoid").

Pretrained models can be loaded with pretrained of the companion object:

val sequenceClassifier = XlmRoBertaForSequenceClassification.pretrained()
  .setInputCols("token", "document")
  .setOutputCol("label")

The default model is "xlm_roberta_base_sequence_classifier_imdb", if no name is provided.

For available pretrained models please see the Models Hub.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CATEGORY

Python API: XlmRoBertaForSequenceClassification

Scala API: XlmRoBertaForSequenceClassification

Source: XlmRoBertaForSequenceClassification

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

sequenceClassifier = XlmRoBertaForSequenceClassification.pretrained() \
    .setInputCols(["token", "document"]) \
    .setOutputCol("label") \
    .setCaseSensitive(True)

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    sequenceClassifier
])

data = spark.createDataFrame([["I loved this movie when I was a child.", "It was pretty boring."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("label.result").show(truncate=False)
+------+
|result|
+------+
|[pos] |
|[neg] |
+------+

XlmRoBertaForTokenClassification

XlmRoBertaForTokenClassification can load XLM-RoBERTa Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

Pretrained models can be loaded with pretrained of the companion object:

val tokenClassifier = XlmRoBertaForTokenClassification.pretrained()
  .setInputCols("token", "document")
  .setOutputCol("label")

The default model is "xlm_roberta_base_token_classifier_conll03", if no name is provided.

For available pretrained models please see the Models Hub.

and the XlmRoBertaForTokenClassificationTestSpec. Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see Import Transformers into Spark NLP 🚀.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: NAMED_ENTITY

Python API: XlmRoBertaForTokenClassification

Scala API: XlmRoBertaForTokenClassification

Source: XlmRoBertaForTokenClassification

Show Examples

This example shows how to predict classes by using the embeddings generated by the Transformer.

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

tokenClassifier = XlmRoBertaForTokenClassification.pretrained() \
    .setInputCols(["token", "document"]) \
    .setOutputCol("label") \
    .setCaseSensitive(True)

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    tokenClassifier
])

data = spark.createDataFrame([["John Lenon was born in London and lived in Paris. My name is Sarah and I live in London"]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.select("label.result").show(truncate=False)
+------------------------------------------------------------------------------------+
|result                                                                              |
+------------------------------------------------------------------------------------+
|[B-PER, I-PER, O, O, O, B-LOC, O, O, O, B-LOC, O, O, O, O, B-PER, O, O, O, O, B-LOC]|
+------------------------------------------------------------------------------------+

XlmRoBertaForZeroShotClassification

XlmRoBertaForZeroShotClassification using a ModelForSequenceClassification trained on NLI (natural language inference) tasks. Equivalent of XlmRoBertaForZeroShotClassification models, but these models don’t require a hardcoded number of potential classes, they can be chosen at runtime. It usually means it’s slower but it is much more flexible.

Note that the model will loop through all provided labels. So the more labels you have, the longer this process will take.

Any combination of sequences and labels can be passed and each combination will be posed as a premise/hypothesis pair and passed to the pretrained model.

Pretrained models can be loaded with pretrained of the companion object:

val sequenceClassifier = XlmRoBertaForZeroShotClassification .pretrained()
  .setInputCols("token", "document")
  .setOutputCol("label")

The default model is "xlm_roberta_large_zero_shot_classifier_xnli_anli", if no name is provided.

For available pretrained models please see the Models Hub.

To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.

Input Annotator Types: TOKEN, DOCUMENT

Output Annotator Type: CATEGORY

Python API: XlmRoBertaForZeroShotClassification

Scala API: XlmRoBertaForZeroShotClassification

Source: XlmRoBertaForZeroShotClassification

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")
tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")
sequenceClassifier = XlmRoBertaForZeroShotClassification.pretrained() \
    .setInputCols(["token", "document"]) \
    .setOutputCol("label") \
    .setCaseSensitive(True)
pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    sequenceClassifier
])

data = spark.createDataFrame([["I loved this movie when I was a child.", "It was pretty boring."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("label.result").show(truncate=False)
+------+
|result|
+------+
|[pos] |
|[neg] |
+------+

XlmRoBertaSentenceEmbeddings

Sentence-level embeddings using XLM-RoBERTa. The XLM-RoBERTa model was proposed in Unsupervised Cross-lingual Representation Learning at Scale by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco GuzmÃ¡n, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov. It is based on Facebook’s RoBERTa model released in 2019. It is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data.

Pretrained models can be loaded with pretrained of the companion object:

val embeddings = XlmRoBertaSentenceEmbeddings.pretrained()
  .setInputCols("document")
  .setOutputCol("sentence_embeddings")

The default model is "sent_xlm_roberta_base", default language is "xx" (meaning multi-lingual), if no values are provided. For available pretrained models please see the Models Hub.

Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see Import Transformers into Spark NLP 🚀.

Paper Abstract:

Tips:

XLM-RoBERTa is a multilingual model trained on 100 different languages. Unlike some XLM multilingual models, it does not require lang parameter to understand which language is used, and should be able to determine the correct language from the input ids.
This implementation is the same as RoBERTa. Refer to the RoBertaEmbeddings for usage examples as well as the information relative to the inputs and outputs.

Input Annotator Types: DOCUMENT

Output Annotator Type: SENTENCE_EMBEDDINGS

Python API: XlmRoBertaSentenceEmbeddings

Scala API: XlmRoBertaSentenceEmbeddings

Source: XlmRoBertaSentenceEmbeddings

Show Examples

This example shows how to predict classes by using the embeddings generated by the Transformer.

# Coming Soon!

This example shows how to train an Approach Annotator by using the embeddings generated by the Transformer.

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

smallCorpus = spark.read.option("header","True").csv("sentiment.csv")

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

embeddings = XlmRoBertaSentenceEmbeddings.pretrained() \
  .setInputCols(["document"])\
  .setOutputCol("sentence_embeddings")

# Then the training can start with the transformer embeddings
docClassifier = ClassifierDLApproach() \
    .setInputCols("sentence_embeddings") \
    .setOutputCol("category") \
    .setLabelColumn("label") \
    .setBatchSize(64) \
    .setMaxEpochs(20) \
    .setLr(5e-3) \
    .setDropout(0.5)

pipeline = Pipeline().setStages([
    documentAssembler,
    embeddings,
    docClassifier
])

pipelineModel = pipeline.fit(smallCorpus)

This example shows how to extract the embeddings generated by the Transformer.

import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

sentenceEmbeddings = XlmRoBertaSentenceEmbeddings.pretrained() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence_embeddings") \
    .setCaseSensitive(True)

# you can either use the output to train ClassifierDL, SentimentDL, or MultiClassifierDL
# or you can use EmbeddingsFinisher to prepare the results for Spark ML functions

embeddingsFinisher = EmbeddingsFinisher() \
    .setInputCols(["sentence_embeddings"]) \
    .setOutputCols("finished_embeddings") \
    .setOutputAsVector(True) \
    .setCleanAnnotations(False)

pipeline = Pipeline() \
    .setStages([
      documentAssembler,
      tokenizer,
      sentenceEmbeddings,
      embeddingsFinisher
    ])

data = spark.createDataFrame([["This is a sentence."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[-0.05969233065843582,-0.030789051204919815,0.04443822056055069,0.09564960747...|
|[-0.038839809596538544,0.011712731793522835,0.019954433664679527,0.0667808502...|
|[-0.03952755779027939,-0.03455188870429993,0.019103847444057465,0.04311436787...|
|[-0.09579929709434509,0.02494969218969345,-0.014753809198737144,0.10259044915...|
|[0.004710011184215546,-0.022148698568344116,0.011723337695002556,-0.013356896...|
+--------------------------------------------------------------------------------+

XlnetEmbeddings

XlnetEmbeddings (XLNet): Generalized Autoregressive Pretraining for Language Understanding

XLNet is a new unsupervised language representation learning method based on a novel generalized permutation language modeling objective. Additionally, XLNet employs Transformer-XL as the backbone model, exhibiting excellent performance for language tasks involving long context. Overall, XLNet achieves state-of-the-art (SOTA) results on various downstream language tasks including question answering, natural language inference, sentiment analysis, and document ranking.

These word embeddings represent the outputs generated by the XLNet models.

Note that this is a very computationally expensive module compared to word embedding modules that only perform embedding lookups. The use of an accelerator is recommended.

Spark NLP Model	Google Model	Model Properties
`"xlnet_large_cased"`	XLNet-Large	24-layer, 1024-hidden, 16-heads
`"xlnet_base_cased"`	XLNet-Base	12-layer, 768-hidden, 12-heads. This model is trained on full data (different from the one in the paper).

Pretrained models can be loaded with pretrained of the companion object:

val embeddings = XlnetEmbeddings.pretrained()
  .setInputCols("sentence", "token")
  .setOutputCol("embeddings")

# Offline - Download the pretrained model manually and extract it
xlnet = XlnetEmbeddings.load("/xlnet_large_cased_en_2.5.0_2.4_1588074397954") \
        .setInputCols("sentence", "token") \
        .setOutputCol("xlnet")

The default model is "xlnet_base_cased", if no name is provided.

For extended examples of usage, see the Examples and the XlnetEmbeddingsTestSpec.

Sources :

XLNet: Generalized Autoregressive Pretraining for Language Understanding

https://github.com/zihangdai/xlnet

Paper abstract:

With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: WORD_EMBEDDINGS

Python API: XlnetEmbeddings

Scala API: XlnetEmbeddings

Source: XlnetEmbeddings

Show Examples

This example shows how to predict classes by using the embeddings generated by the Transformer.

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

# First extract the prerequisites for the NerDLModel
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

# Use the transformer embeddings
embeddings = XlnetEmbeddings.pretrained("xlnet_base_cased") \
    .setInputCols(['document', 'token']) \
    .setOutputCol('embeddings')

# This pretrained model requires those specific transformer embeddings
ner_model = NerDLModel.pretrained("ner_conll_xlnet_base_cased", "en") \
    .setInputCols(["document", "token", "embeddings"]) \
    .setOutputCol("ner")

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    embeddings,
    ner_model
])

data = spark.createDataFrame([["U.N. official Ekeus heads for Baghdad."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.select("ner.result").show(truncate=False)
+------------------------------------+
|result                              |
+------------------------------------+
|[I-LOC, O, O, I-PER, O, O, I-LOC, O]|
+------------------------------------+

XlnetForSequenceClassification

XlnetForSequenceClassification can load XLNet Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.

Pretrained models can be loaded with pretrained of the companion object:

val sequenceClassifier = XlnetForSequenceClassification.pretrained()
  .setInputCols("token", "document")
  .setOutputCol("label")

The default model is "xlnet_base_sequence_classifier_imdb", if no name is provided.

For available pretrained models please see the Models Hub.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: CATEGORY

Python API: XlnetForSequenceClassification

Scala API: XlnetForSequenceClassification

Source: XlnetForSequenceClassification

Show Example

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

sequenceClassifier = XlnetForSequenceClassification.pretrained() \
    .setInputCols(["token", "document"]) \
    .setOutputCol("label") \
    .setCaseSensitive(True)

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    sequenceClassifier
])

data = spark.createDataFrame([["I loved this movie when I was a child.", "It was pretty boring."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("label.result").show(truncate=False)
+------+
|result|
+------+
|[pos] |
|[neg] |
+------+

XlnetForTokenClassification

XlnetForTokenClassification can load XLNet Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

Pretrained models can be loaded with pretrained of the companion object:

val tokenClassifier = XlnetForTokenClassification.pretrained()
  .setInputCols("token", "document")
  .setOutputCol("label")

The default model is "xlnet_base_token_classifier_conll03", if no name is provided.

For available pretrained models please see the Models Hub.

and the XlnetForTokenClassificationTestSpec. Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. To see which models are compatible and how to import them see Import Transformers into Spark NLP 🚀.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: NAMED_ENTITY

Python API: XlnetForTokenClassification

Scala API: XlnetForTokenClassification

Source: XlnetForTokenClassification

Show Examples

This example shows how to predict classes by using the embeddings generated by the Transformer.

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

tokenizer = Tokenizer() \
    .setInputCols(["document"]) \
    .setOutputCol("token")

tokenClassifier = XlnetForTokenClassification.pretrained() \
    .setInputCols(["token", "document"]) \
    .setOutputCol("label") \
    .setCaseSensitive(True)

pipeline = Pipeline().setStages([
    documentAssembler,
    tokenizer,
    tokenClassifier
])

data = spark.createDataFrame([["John Lenon was born in London and lived in Paris. My name is Sarah and I live in London"]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.select("label.result").show(truncate=False)
+------------------------------------------------------------------------------------+
|result                                                                              |
+------------------------------------------------------------------------------------+
|[B-PER, I-PER, O, O, O, B-LOC, O, O, O, B-LOC, O, O, O, O, B-PER, O, O, O, O, B-LOC]|
+------------------------------------------------------------------------------------+

ZeroShotNer

ZeroShotNerModel implements zero shot named entity recognition by utilizing RoBERTa transformer models fine tuned on a question answering task.

Its input is a list of document annotations and it automatically generates questions which are used to recognize entities. The definitions of entities is given by a dictionary structures, specifying a set of questions for each entity. The model is based on RoBertaForQuestionAnswering.

For more extended examples see the Examples.

Pretrained models can be loaded with pretrained of the companion object:

val zeroShotNer = ZeroShotNerModel.pretrained()
  .setInputCols("document")
  .setOutputCol("zer_shot_ner")

For available pretrained models please see the Models Hub.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: NAMED_ENTITY

Python API: ZeroShotNerModel

Scala API: ZeroShotNerModel

Source: ZeroShotNerModel

Show Example

document_assembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")
sentence_detector = SentenceDetector() \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")
tokenizer = Tokenizer() \
    .setInputCols(["sentence"]) \
    .setOutputCol("token")
zero_shot_ner = ZeroShotNerModel() \
    .pretrained() \
    .setEntityDefinitions(
        {
            "NAME": ["What is his name?", "What is my name?", "What is her name?"],
            "CITY": ["Which city?", "Which is the city?"]
        }) \
    .setInputCols(["sentence", "token"]) \
    .setOutputCol("zero_shot_ner") \
data = spark.createDataFrame(
        [["My name is Clara, I live in New York and Hellen lives in Paris."]]
    ).toDF("text")
Pipeline() \
    .setStages([document_assembler, sentence_detector, tokenizer, zero_shot_ner]) \
    .fit(data) \
    .transform(data) \
    .selectExpr("document", "explode(zero_shot_ner) AS entity") \
    .select(
        "document.result",
        "entity.result",
        "entity.metadata.word",
        "entity.metadata.confidence",
        "entity.metadata.question") \
    .show(truncate=False)

Import Transformers into Spark NLP

Overview

We have extended support for HuggingFace 🤗 and TF Hub exported models since 3.1.0 to equivalent Spark NLP 🚀 annotators. Starting this release, you can easily use the saved_model feature in HuggingFace within a few lines of codes and import any BERT, DistilBERT, CamemBERT, RoBERTa, DeBERTa, XLM-RoBERTa, Longformer, BertForTokenClassification, DistilBertForTokenClassification, AlbertForTokenClassification, RoBertaForTokenClassification, DeBertaForTokenClassification, XlmRoBertaForTokenClassification, XlnetForTokenClassification, LongformerForTokenClassification, CamemBertForTokenClassification, CamemBertForSequenceClassification, CamemBertForQuestionAnswering, BertForSequenceClassification, DistilBertForSequenceClassification, AlbertForSequenceClassification, RoBertaForSequenceClassification, DeBertaForSequenceClassification, XlmRoBertaForSequenceClassification, XlnetForSequenceClassification, LongformerForSequenceClassification, AlbertForQuestionAnswering, BertForQuestionAnswering, DeBertaForQuestionAnswering, DistilBertForQuestionAnswering, LongformerForQuestionAnswering, RoBertaForQuestionAnswering, XlmRoBertaForQuestionAnswering, TapasForQuestionAnswering, Vision Transformers (ViT), HubertForCTC, SwinForImageClassification, and ConvNextForImageClassification models to Spark NLP. We will work on the remaining annotators and extend this support to the rest with each release 😊

Compatibility

Spark NLP: The equivalent annotator in Spark NLP TF Hub: Models from TF Hub HuggingFace: Models from HuggingFace ONNX: Models from HuggingFace in ONNX format Model Architecture: Which architecture is compatible with that annotator Flags:

Fully supported ✅
Partially supported (requires workarounds) ✔️
Under development ❎
Not supported ❌

Spark NLP	TF Hub	HuggingFace	ONNX	Model Architecture
AlbertEmbeddings	✅	✅	✅	ALBERT
AlbertForQuestionAnswering		✅	❎	TFAlbertForQuestionAnswering
AlbertForSequenceClassification		✅	❎	TFAlbertForSequenceClassification
AlbertForTokenClassification		✅	❎	TFAlbertForTokenClassification
Automatic Speech Recognition (Wav2Vec2ForCTC)		❎	❎	TFWav2Vec2ForCTC
BartForZeroShotClassification		✅	❎	TFBartForSequenceClassification
BartTransformer		✅	❎	TFBartForConditionalGeneration
BertEmbeddings	✅	✅	✅	BERT - Small BERT - ELECTRA
BertForQuestionAnswering		✅	✅	TFBertForQuestionAnswering
BertForSequenceClassification		✅	✅	TFBertForSequenceClassification
BertForTokenClassification		✅	✅	TFBertForTokenClassification
BertForZeroShotClassification		✅	❎	TFBertForSequenceClassification
BertSentenceEmbeddings	✅	✅	❎	BERT - Small BERT - ELECTRA
CamemBertEmbeddings		✅	✅	CamemBERT
CamemBertForQuestionAnswering		✅	❎	TFCamembertForQuestionAnswering
CamemBertForSequenceClassification		✅	❎	TFCamemBertForSequenceClassification
CamemBertForTokenClassification		✅	❎	TFCamemBertForTokenClassification
CLIPForZeroShotClassification		✅	✅	CLIP
ConvNextForImageClassification		❎	❎	TFConvNextForImageClassification
DeBertaEmbeddings		✅	✅	DeBERTa-v2 - DeBERTa-v3
DeBertaForQuestionAnswering		✅	❎	TFDebertaV2ForQuestionAnswering
DeBertaForSequenceClassification		✅	❎	TFDebertaV2ForSequenceClassification
DeBertaForTokenClassification		✅	❎	TFDebertaV2ForTokenClassification
DistilBertEmbeddings		✅	✅	DistilBERT
DistilBertForQuestionAnswering		✅	✅	TFDistilBertForQuestionAnswering
DistilBertForSequenceClassification		✅	✅	TFDistilBertForSequenceClassification
DistilBertForTokenClassification		✅	✅	TFDistilBertForTokenClassification
DistilBertForZeroShotClassification		✅	❎	TFDistilBertForSequenceClassification
E5Embeddings		✅	✅	SentenceTransformer
ElmoEmbeddings	❎		❌
HubertForCTC		❎	❎	TFHubertForCTC
InstructorEmbeddings		✅	❎	INSTRUCTOR
LongformerEmbeddings		✅	❌	Longformer
LongformerForQuestionAnswering		✅	❎	TFLongformerForQuestionAnswering
LongformerForSequenceClassification		✅	❎	TFLongformerForSequenceClassification
LongformerForTokenClassification		✅	❎	TFLongformerForTokenClassification
MarianTransformer		❌	❎
MPNetEmbeddings		✅	✅	SentenceTransformer
OpenAI GPT2		❌	❎
RoBertaEmbeddings		✅	✅	RoBERTa - DistilRoBERTa
RoBertaForQuestionAnswering		✅	❎	TFRobertaForQuestionAnswering
RoBertaForSequenceClassification		✅	❎	TFRobertaForSequenceClassification
RoBertaForTokenClassification		✅	❎	TFRobertaForTokenClassification
RoBertaForZeroShotClassification		✅	❎	TFRobertaForSequenceClassification
RoBertaSentenceEmbeddings		✅	✅	RoBERTa - DistilRoBERTa
SwinForImageClassification		❎	❎	TFSwinForImageClassification
T5Transformer		❌	❎
TapasForQuestionAnswering		❎	❎	TFTapasForQuestionAnswering
UniversalSentenceEncoder	❎		❌
VisionEncoderDecoderForImageCaptioning		✅	❎	VisionEncoderDecoderModel
ViTForImageClassification	❌	✅	❎	TFViTForImageClassification
WhisperForCTC		✅	✅	WhisperForConditionalGeneration
XlmRoBertaEmbeddings		✅	✅	XLM-RoBERTa
XlmRoBertaForQuestionAnswering		✅	❎	TFXLMRobertaForQuestionAnswering
XlmRoBertaForSequenceClassification		✅	❎	TFXLMRobertaForSequenceClassification
XlmRoBertaForTokenClassification		✅	❎	TFXLMRobertaForTokenClassification
XlmRoBertaForZeroShotClassification		✅	❎	TFXLMRobertaForSequenceClassification
XlmRoBertaSentenceEmbeddings		✅	❎	SentenceTransformer
XlnetEmbeddings		✅	❌	XLNet
XlnetForSequenceClassification		✅	❎	TFXLNetForSequenceClassification
XlnetForTokenClassification		✅	❎	TFXLNetForTokenClassificationet
ZeroShotNerModel		✅	❎	TFRobertaForSequenceClassification

Example Notebooks

HuggingFace, Optimum, PyTorch, and ONNX Runtime to Spark NLP (ONNX)

Spark NLP	Notebooks	Colab
AlbertForQuestionAnswering	HuggingFace ONNX in Spark NLP AlbertForQuestionAnswering
AlbertForSequenceClassification	HuggingFace ONNX in Spark NLP AlbertForSequenceClassification
AlbertForTokenClassification	HuggingFace ONNX in Spark NLP AlbertForTokenClassification
BertEmbeddings	HuggingFace ONNX in Spark NLP BERT
BertForQuestionAnswering	HuggingFace ONNX in Spark NLP BertForQuestionAnswering
BertForSequenceClassification	HuggingFace ONNX in Spark NLP BertForSequenceClassification
BertForTokenClassification	HuggingFace ONNX in Spark NLP BertForTokenClassification
BertSentenceEmbeddings	HuggingFace ONNX in Spark NLP BertSentenceEmbeddings
CLIPForZeroShotClassification	HuggingFace ONNX in Spark NLP CLIP
DeBertaEmbeddings	HuggingFace ONNX in Spark NLP DeBERTa
DistilBertEmbeddings	HuggingFace ONNX in Spark NLP DistilBERT
DistilBertForQuestionAnswering	HuggingFace ONNX in Spark NLP DistilBertForQuestionAnswering
DistilBertForSequenceClassification	HuggingFace ONNX in Spark NLP DistilBertForSequenceClassification
DistilBertForTokenClassification	HuggingFace ONNX in Spark NLP DistilBertForTokenClassification
E5Embeddings	HuggingFace ONNX in Spark NLP E5
MarianTransformer	HuggingFace ONNX in Spark NLP Marian
MPNet	HuggingFace ONNX in Spark NLP MPNet
RoBertaEmbeddings	HuggingFace ONNX in Spark NLP RoBERTa
RobertaForQuestionAnswering	HuggingFace ONNX in Spark NLP RoBertaForQuestionAnswering
RoBertaForSequenceClassification	HuggingFace ONNX in Spark NLP RoBertaForSequenceClassification
RoBertaForTokenClassification	HuggingFace ONNX in Spark NLP RoBertaForTokenClassification
T5Transformer	HuggingFace ONNX in Spark NLP T5
WhisperForCTC	HuggingFace ONNX in Spark NLP MPNet
XlmRoBertaSentenceEmbeddings	HuggingFace ONNX in Spark NLP XlmRoBertaSentenceEmbeddings

HuggingFace to Spark NLP (TensorFlow)

Spark NLP	Notebooks	Colab
AlbertEmbeddings	HuggingFace in Spark NLP - ALBERT
AlbertForQuestionAnswering	HuggingFace in Spark NLP - AlbertForQuestionAnswering
AlbertForSequenceClassification	HuggingFace in Spark NLP - AlbertForSequenceClassification
AlbertForTokenClassification	HuggingFace in Spark NLP - AlbertForTokenClassification
BertEmbeddings	HuggingFace in Spark NLP - BERT
BertForQuestionAnswering	HuggingFace in Spark NLP - BertForQuestionAnswering
BertForSequenceClassification	HuggingFace in Spark NLP - BertForSequenceClassification
BertForTokenClassification	HuggingFace in Spark NLP - BertForTokenClassification
BertForZeroShotClassification	HuggingFace in Spark NLP - BertForZeroShotClassification
BertSentenceEmbeddings	HuggingFace in Spark NLP - BERT Sentence
BertSentenceEmbeddings - Fine Tuned	Fine Tuned Sentence Bert in Spark NLP
CamemBertEmbeddings	HuggingFace in Spark NLP - CamemBERT
CamemBertForQuestionAnswering	HuggingFace in Spark NLP - CamemBertForQuestionAnswering
CamemBertForSequenceClassification	HuggingFace in Spark NLP - CamemBertForSequenceClassification
CamemBertForTokenClassification	HuggingFace in Spark NLP - CamemBertForTokenClassification
ConvNextForImageClassification	HuggingFace in Spark NLP - ConvNextForImageClassification
DeBertaEmbeddings	HuggingFace in Spark NLP - DeBERTa
DeBertaForQuestionAnswering	HuggingFace in Spark NLP - DeBertaForQuestionAnswering
DistilBertEmbeddings	HuggingFace in Spark NLP - DistilBERT
DistilBertForQuestionAnswering	HuggingFace in Spark NLP - DistilBertForQuestionAnswering
DistilBertForSequenceClassification	HuggingFace in Spark NLP - DistilBertForSequenceClassification
DistilBertForTokenClassification	HuggingFace in Spark NLP - DistilBertForTokenClassification
DistilBertForZeroClassification	HuggingFace in Spark NLP - DistilBertForZeroClassification
DistilBertForZeroShotClassification	HuggingFace in Spark NLP - DistilBertForZeroShotClassification
LongformerEmbeddings	HuggingFace in Spark NLP - Longformer
LongformerForQuestionAnswering	HuggingFace in Spark NLP - LongformerForQuestionAnswering
LongformerForSequenceClassification	HuggingFace in Spark NLP - LongformerForSequenceClassification
RoBertaEmbeddings	HuggingFace in Spark NLP - RoBERTa
RoBertaForQuestionAnswering	HuggingFace in Spark NLP - RoBertaForQuestionAnswering
RoBertaForSequenceClassification	HuggingFace in Spark NLP - RoBertaForSequenceClassification
RoBertaForTokenClassification	HuggingFace in Spark NLP - RoBertaForTokenClassification
RoBertaForZeroShotClassification	HuggingFace in Spark NLP - RoBertaForZeroShotClassification
SwinForImageClassification	HuggingFace in Spark NLP - SwinForImageClassification
ViTForImageClassification	HuggingFace in Spark NLP - ViTForImageClassification
WhisperForCTC	HuggingFace in Spark NLP - WhisperForCTC
XlmRoBertaEmbeddings	HuggingFace in Spark NLP - XLM-RoBERTa
XlmRobertaForQuestionAnswering	HuggingFace in Spark NLP - XlmRobertaForQuestionAnswering
XlmRoBertaForSequenceClassification	HuggingFace in Spark NLP - XlmRoBertaForSequenceClassification
XlmRoBertaForTokenClassification	HuggingFace in Spark NLP - XlmRoBertaForTokenClassification
XlnetEmbeddings	HuggingFace in Spark NLP - XLNet
XlnetForSequenceClassification	HuggingFace in Spark NLP - XlnetForSequenceClassification
T5Transformer	HuggingFace in Spark NLP - T5

TF Hub to Spark NLP

Spark NLP	TF Hub Notebooks	Colab
AlbertEmbeddings	TF Hub in Spark NLP - ALBERT
BertEmbeddings	TF Hub in Spark NLP - BERT
BertSentenceEmbeddings	TF Hub in Spark NLP - BERT Sentence

PREVIOUSAnnotators

NEXTTraining