# Transformers

These are the Transformers that are available in Spark NLP. The created embeddings can be combined with DL based Annotators, such as the ClassifierDL and NerDL to achieve state-of-the-art results in a wide range of language tasks.

## Overview

Spark NLP 🚀 3.1.0 is out! We have extended support for HuggingFace 🤗 exported models in equivalent Spark NLP annotators. Starting this release, you can easily use saved_model feature in HuggingFace within a few lines of codes and import any BERT, DistilBERT, RoBERTa, and XLM-RoBERTa models to Spark NLP. We will work on the remaining annotators and extend this support to the rest with each release 😊

## Compatibility

Spark NLP: The equivalent annotator in Spark NLP TF Hub: Models from TF Hub HuggingFace: Models from HuggingFace Model Architecture: Which architecture is compatible with that annotator Flags:

• Fully supported ✅
• Partially supported (requires workarounds) ✔️
• Under development ❎
• Not supported ❌
Spark NLP TF Hub HuggingFace Model Architecture
BertEmbeddings BERT - Small BERT - ELECTRA
BertSentenceEmbeddings BERT - Small BERT - ELECTRA
DistilBertEmbeddings   DistilBERT
RoBertaEmbeddings   RoBERTa - DistilRoBERTa
XlmRoBertaEmbeddings   XLM-RoBERTa
AlbertEmbeddings
XlnetEmbeddings
ElmoEmbeddings
UniversalSentenceEncoder
T5Transformer
MarianTransformer

## Notebooks

Spark NLP HuggingFace Notebooks
BertEmbeddings HuggingFace in Spark NLP - BERT
BertSentenceEmbeddings HuggingFace in Spark NLP - BERT Sentence
DistilBertEmbeddings HuggingFace in Spark NLP - DistilBERT
RoBertaEmbeddings HuggingFace in Spark NLP - RoBERTa
XlmRoBertaEmbeddings HuggingFace in Spark NLP - XLM-RoBERTa
AlbertEmbeddings HuggingFace in Spark NLP - ALBERT
XlnetEmbeddings HuggingFace in Spark NLP - XLNet
Spark NLP TF Hub Notebooks
BertEmbeddings TF Hub in Spark NLP - BERT
BertSentenceEmbeddings TF Hub in Spark NLP - BERT Sentence
AlbertEmbeddings TF Hub in Spark NLP - ALBERT

## Limitations

• If you are importing models from HuggingFace as Embeddings they must be for Fill-Mask task. Meaning you cannot use a model in BertEmbeddings if they were trained or fine-tuned on token/text classification tasks in HuggingFace. They have a different architecture.
• There is a 2G size limitation with loading a TF SavedModel model in Spark NLP. Your model cannot be larger than 2G size or you will see the following error: Required array size too large. (We are working on going around this Java limitation, however, for the time being, there are some models which are over 2G and they are not compatible)

## AlbertEmbeddings

ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS - Google Research, Toyota Technological Institute at Chicago

These word embeddings represent the outputs generated by the Albert model. All official Albert releases by google in TF-HUB are supported with this Albert Wrapper:

Ported TF-Hub Models:

 "albert_base_uncased" albert_base 768-embed-dim, 12-layer, 12-heads, 12M parameters
 "albert_large_uncased" albert_large 1024-embed-dim, 24-layer, 16-heads, 18M parameters
 "albert_xlarge_uncased" albert_xlarge 2048-embed-dim, 24-layer, 32-heads, 60M parameters
 "albert_xxlarge_uncased" albert_xxlarge 4096-embed-dim, 12-layer, 64-heads, 235M parameters

This model requires input tokenization with SentencePiece model, which is provided by Spark-NLP (See tokenizers package).

Pretrained models can be loaded with pretrained of the companion object:

val embeddings = AlbertEmbeddings.pretrained()
.setInputCols("sentence", "token")
.setOutputCol("embeddings")

.setInputCols("sentence", "token") \
.setOutputCol("albert")


The default model is "albert_base_uncased", if no name is provided.

For extended examples of usage, see the Spark NLP Workshop and the AlbertEmbeddingsTestSpec.

Sources:

ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS

https://tfhub.dev/s?q=albert

Paper abstract:

Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations and longer training times. To address these problems, we present two parameter reduction techniques to lower memory consumption and increase the training speed of BERT (Devlin et al., 2019). Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better compared to the original BERT. We also use a self-supervised loss that focuses on modeling inter-sentence coherence, and show it consistently helps downstream tasks with multi-sentence inputs. As a result, our best model establishes new state-of-the-art results on the GLUE, RACE, and SQuAD benchmarks while having fewer parameters compared to BERT-large.

Tips: ALBERT uses repeating layers which results in a small memory footprint, however the computational cost remains similar to a BERT-like architecture with the same number of hidden layers as it has to iterate through the same number of (repeating) layers.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: WORD_EMBEDDINGS

 API: AlbertEmbeddings Source: AlbertEmbeddings
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")

tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")

embeddings = AlbertEmbeddings.pretrained() \
.setInputCols(["token", "document"]) \
.setOutputCol("embeddings")

embeddingsFinisher = EmbeddingsFinisher() \
.setInputCols(["embeddings"]) \
.setOutputCols("finished_embeddings") \
.setOutputAsVector(True) \
.setCleanAnnotations(False)

pipeline = Pipeline().setStages([
documentAssembler,
tokenizer,
embeddings,
embeddingsFinisher
])

data = spark.createDataFrame([["This is a sentence."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[1.1342473030090332,-1.3855540752410889,0.9818322062492371,-0.784737348556518...|
|[0.847029983997345,-1.047153353691101,-0.1520637571811676,-0.6245765686035156...|
|[-0.009860038757324219,-0.13450059294700623,2.707749128341675,1.2916892766952...|
|[-0.04192575812339783,-0.5764210224151611,-0.3196685314178467,-0.527840495109...|
|[0.15583214163780212,-0.1614152491092682,-0.28423872590065,-0.135491415858268...|
+--------------------------------------------------------------------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.AlbertEmbeddings
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")

val embeddings = AlbertEmbeddings.pretrained()
.setInputCols("token", "document")
.setOutputCol("embeddings")

val embeddingsFinisher = new EmbeddingsFinisher()
.setInputCols("embeddings")
.setOutputCols("finished_embeddings")
.setOutputAsVector(true)
.setCleanAnnotations(false)

val pipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
embeddings,
embeddingsFinisher
))

val data = Seq("This is a sentence.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[1.1342473030090332,-1.3855540752410889,0.9818322062492371,-0.784737348556518...|
|[0.847029983997345,-1.047153353691101,-0.1520637571811676,-0.6245765686035156...|
|[-0.009860038757324219,-0.13450059294700623,2.707749128341675,1.2916892766952...|
|[-0.04192575812339783,-0.5764210224151611,-0.3196685314178467,-0.527840495109...|
|[0.15583214163780212,-0.1614152491092682,-0.28423872590065,-0.135491415858268...|
+--------------------------------------------------------------------------------+


## BertEmbeddings

Token-level embeddings using BERT. BERT (Bidirectional Encoder Representations from Transformers) provides dense vector representations for natural language by using a deep, pre-trained neural network with the Transformer architecture.

Pretrained models can be loaded with pretrained of the companion object:

val embeddings = BertEmbeddings.pretrained()
.setInputCols("token", "document")
.setOutputCol("bert_embeddings")


The default model is "small_bert_L2_768", if no name is provided.

For available pretrained models please see the Models Hub.

For extended examples of usage, see the Spark NLP Workshop and the BertEmbeddingsTestSpec.

Sources :

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Paper abstract

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: WORD_EMBEDDINGS

 API: BertEmbeddings Source: BertEmbeddings
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")

tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")

embeddings = BertEmbeddings.pretrained("small_bert_L2_128", "en") \
.setInputCols(["token", "document"]) \
.setOutputCol("bert_embeddings")

embeddingsFinisher = EmbeddingsFinisher() \
.setInputCols(["bert_embeddings"]) \
.setOutputCols("finished_embeddings") \
.setOutputAsVector(True)

pipeline = Pipeline().setStages([
documentAssembler,
tokenizer,
embeddings,
embeddingsFinisher
])

data = spark.createDataFrame([["This is a sentence."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[-2.3497989177703857,0.480538547039032,-0.3238905668258667,-1.612930893898010...|
|[-2.1357314586639404,0.32984697818756104,-0.6032363176345825,-1.6791689395904...|
|[-1.8244884014129639,-0.27088963985443115,-1.059438943862915,-0.9817547798156...|
|[-1.1648050546646118,-0.4725411534309387,-0.5938255786895752,-1.5780693292617...|
|[-0.9125322699546814,0.4563939869403839,-0.3975459933280945,-1.81611204147338...|
+--------------------------------------------------------------------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.BertEmbeddings
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")

val embeddings = BertEmbeddings.pretrained("small_bert_L2_128", "en")
.setInputCols("token", "document")
.setOutputCol("bert_embeddings")

val embeddingsFinisher = new EmbeddingsFinisher()
.setInputCols("bert_embeddings")
.setOutputCols("finished_embeddings")
.setOutputAsVector(true)

val pipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
embeddings,
embeddingsFinisher
))

val data = Seq("This is a sentence.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[-2.3497989177703857,0.480538547039032,-0.3238905668258667,-1.612930893898010...|
|[-2.1357314586639404,0.32984697818756104,-0.6032363176345825,-1.6791689395904...|
|[-1.8244884014129639,-0.27088963985443115,-1.059438943862915,-0.9817547798156...|
|[-1.1648050546646118,-0.4725411534309387,-0.5938255786895752,-1.5780693292617...|
|[-0.9125322699546814,0.4563939869403839,-0.3975459933280945,-1.81611204147338...|
+--------------------------------------------------------------------------------+


## BertSentenceEmbeddings

Sentence-level embeddings using BERT. BERT (Bidirectional Encoder Representations from Transformers) provides dense vector representations for natural language by using a deep, pre-trained neural network with the Transformer architecture.

Pretrained models can be loaded with pretrained of the companion object:

val embeddings = BertSentenceEmbeddings.pretrained()
.setInputCols("sentence")
.setOutputCol("sentence_bert_embeddings")


The default model is "sent_small_bert_L2_768", if no name is provided.

For available pretrained models please see the Models Hub.

For extended examples of usage, see the Spark NLP Workshop and the BertSentenceEmbeddingsTestSpec.

Sources :

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Paper abstract

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement).

Input Annotator Types: DOCUMENT

Output Annotator Type: SENTENCE_EMBEDDINGS

 API: BertSentenceEmbeddings Source: BertSentenceEmbeddings
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")

sentence = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")

embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L2_128") \
.setInputCols(["sentence"]) \
.setOutputCol("sentence_bert_embeddings")

embeddingsFinisher = EmbeddingsFinisher() \
.setInputCols(["sentence_bert_embeddings"]) \
.setOutputCols("finished_embeddings") \
.setOutputAsVector(True)

pipeline = Pipeline().setStages([
documentAssembler,
sentence,
embeddings,
embeddingsFinisher
])

data = spark.createDataFrame([["John loves apples. Mary loves oranges. John loves Mary."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[-0.8951074481010437,0.13753940165042877,0.3108254075050354,-1.65693199634552...|
|[-0.6180210709571838,-0.12179657071828842,-0.191165953874588,-1.4497021436691...|
|[-0.822715163230896,0.7568016648292542,-0.1165061742067337,-1.59048593044281,...|
+--------------------------------------------------------------------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.embeddings.BertSentenceEmbeddings
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

val sentence = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")

val embeddings = BertSentenceEmbeddings.pretrained("sent_small_bert_L2_128")
.setInputCols("sentence")
.setOutputCol("sentence_bert_embeddings")

val embeddingsFinisher = new EmbeddingsFinisher()
.setInputCols("sentence_bert_embeddings")
.setOutputCols("finished_embeddings")
.setOutputAsVector(true)

val pipeline = new Pipeline().setStages(Array(
documentAssembler,
sentence,
embeddings,
embeddingsFinisher
))

val data = Seq("John loves apples. Mary loves oranges. John loves Mary.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[-0.8951074481010437,0.13753940165042877,0.3108254075050354,-1.65693199634552...|
|[-0.6180210709571838,-0.12179657071828842,-0.191165953874588,-1.4497021436691...|
|[-0.822715163230896,0.7568016648292542,-0.1165061742067337,-1.59048593044281,...|
+--------------------------------------------------------------------------------+


## DistilBertEmbeddings

DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT base. It has 40% less parameters than bert-base-uncased, runs 60% faster while preserving over 95% of BERT’s performances as measured on the GLUE language understanding benchmark.

Pretrained models can be loaded with pretrained of the companion object:

val embeddings = DistilBertEmbeddings.pretrained()
.setInputCols("document", "token")
.setOutputCol("embeddings")


The default model is "distilbert_base_cased", if no name is provided. For available pretrained models please see the Models Hub.

For extended examples of usage, see the Spark NLP Workshop and the DistilBertEmbeddingsTestSpec. Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. The Spark NLP Workshop example shows how to import them.

The DistilBERT model was proposed in the paper DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.

Paper Abstract:

As Transfer Learning from large-scale pre-trained models becomes more prevalent in Natural Language Processing (NLP), operating these large models in on-the-edge and/or under constrained computational training or inference budgets remains challenging. In this work, we propose a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts. While most prior work investigated the use of distillation for building task-specific models, we leverage knowledge distillation during the pretraining phase and show that it is possible to reduce the size of a BERT model by 40%, while retaining 97% of its language understanding capabilities and being 60% faster. To leverage the inductive biases learned by larger models during pretraining, we introduce a triple loss combining language modeling, distillation and cosine-distance losses. Our smaller, faster and lighter model is cheaper to pre-train and we demonstrate its capabilities for on-device computations in a proof-of-concept experiment and a comparative on-device study.

Tips:

• DistilBERT doesn’t have :obj:token_type_ids, you don’t need to indicate which token belongs to which segment. Just separate your segments with the separation token :obj:tokenizer.sep_token (or :obj:[SEP]).
• DistilBERT doesn’t have options to select the input positions (:obj:position_ids input). This could be added if necessary though, just let us know if you need this option.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: WORD_EMBEDDINGS

 API: DistilBertEmbeddings Source: DistilBertEmbeddings
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")

tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")

embeddings = DistilBertEmbeddings.pretrained() \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)

embeddingsFinisher = EmbeddingsFinisher() \
.setInputCols(["embeddings"]) \
.setOutputCols("finished_embeddings") \
.setOutputAsVector(True) \
.setCleanAnnotations(False)

pipeline = Pipeline() \
.setStages([
documentAssembler,
tokenizer,
embeddings,
embeddingsFinisher
])

data = spark.createDataFrame([["This is a sentence."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[0.1127224713563919,-0.1982710212469101,0.5360898375511169,-0.272536993026733...|
|[0.35534414649009705,0.13215228915214539,0.40981462597846985,0.14036104083061...|
|[0.328085333108902,-0.06269335001707077,-0.017595693469047546,-0.024373905733...|
|[0.15617232024669647,0.2967822253704071,0.22324979305267334,-0.04568954557180...|
|[0.45411425828933716,0.01173491682857275,0.190129816532135,0.1178255230188369...|
+--------------------------------------------------------------------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.DistilBertEmbeddings
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")

val embeddings = DistilBertEmbeddings.pretrained()
.setInputCols("document", "token")
.setOutputCol("embeddings")
.setCaseSensitive(true)

val embeddingsFinisher = new EmbeddingsFinisher()
.setInputCols("embeddings")
.setOutputCols("finished_embeddings")
.setOutputAsVector(true)
.setCleanAnnotations(false)

val pipeline = new Pipeline()
.setStages(Array(
documentAssembler,
tokenizer,
embeddings,
embeddingsFinisher
))

val data = Seq("This is a sentence.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[0.1127224713563919,-0.1982710212469101,0.5360898375511169,-0.272536993026733...|
|[0.35534414649009705,0.13215228915214539,0.40981462597846985,0.14036104083061...|
|[0.328085333108902,-0.06269335001707077,-0.017595693469047546,-0.024373905733...|
|[0.15617232024669647,0.2967822253704071,0.22324979305267334,-0.04568954557180...|
|[0.45411425828933716,0.01173491682857275,0.190129816532135,0.1178255230188369...|
+--------------------------------------------------------------------------------+


## ElmoEmbeddings

Word embeddings from ELMo (Embeddings from Language Models), a language model trained on the 1 Billion Word Benchmark.

Note that this is a very computationally expensive module compared to word embedding modules that only perform embedding lookups. The use of an accelerator is recommended.

Pretrained models can be loaded with pretrained of the companion object:

val embeddings = ElmoEmbeddings.pretrained()
.setInputCols("sentence", "token")
.setOutputCol("elmo_embeddings")

.setInputCols("sentence", "token") \
.setOutputCol("elmo")


The default model is "elmo", if no name is provided.

For available pretrained models please see the Models Hub.

The pooling layer can be set with setPoolingLayer to the following values:

• "word_emb": the character-based word representations with shape [batch_size, max_length, 512].
• "lstm_outputs1": the first LSTM hidden state with shape [batch_size, max_length, 1024].
• "lstm_outputs2": the second LSTM hidden state with shape [batch_size, max_length, 1024].
• "elmo": the weighted sum of the 3 layers, where the weights are trainable. This tensor has shape [batch_size, max_length, 1024].

For extended examples of usage, see the Spark NLP Workshop and the ElmoEmbeddingsTestSpec.

Sources:

Deep contextualized word representations

Paper abstract:

We introduce a new type of deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). Our word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. We show that these representations can be easily added to existing models and significantly improve the state of the art across six challenging NLP problems, including question answering, textual entailment and sentiment analysis. We also present an analysis showing that exposing the deep internals of the pre-trained network is crucial, allowing downstream models to mix different types of semi-supervision signals.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: WORD_EMBEDDINGS

 API: ElmoEmbeddings Source: ElmoEmbeddings
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")

tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")

embeddings = ElmoEmbeddings.pretrained() \
.setPoolingLayer("word_emb") \
.setInputCols(["token", "document"]) \
.setOutputCol("embeddings")

embeddingsFinisher = EmbeddingsFinisher() \
.setInputCols(["embeddings"]) \
.setOutputCols("finished_embeddings") \
.setOutputAsVector(True) \
.setCleanAnnotations(False)

pipeline = Pipeline().setStages([
documentAssembler,
tokenizer,
embeddings,
embeddingsFinisher
])

data = spark.createDataFrame([["This is a sentence."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[6.662458181381226E-4,-0.2541114091873169,-0.6275503039360046,0.5787073969841...|
|[0.19154725968837738,0.22998669743537903,-0.2894386649131775,0.21524395048618...|
|[0.10400570929050446,0.12288510054349899,-0.07056470215320587,-0.246389418840...|
|[0.49932169914245605,-0.12706467509269714,0.30969417095184326,0.2643227577209...|
|[-0.8871506452560425,-0.20039963722229004,-1.0601330995559692,0.0348707810044...|
+--------------------------------------------------------------------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.ElmoEmbeddings
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")

val embeddings = ElmoEmbeddings.pretrained()
.setPoolingLayer("word_emb")
.setInputCols("token", "document")
.setOutputCol("embeddings")

val embeddingsFinisher = new EmbeddingsFinisher()
.setInputCols("embeddings")
.setOutputCols("finished_embeddings")
.setOutputAsVector(true)
.setCleanAnnotations(false)

val pipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
embeddings,
embeddingsFinisher
))

val data = Seq("This is a sentence.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[6.662458181381226E-4,-0.2541114091873169,-0.6275503039360046,0.5787073969841...|
|[0.19154725968837738,0.22998669743537903,-0.2894386649131775,0.21524395048618...|
|[0.10400570929050446,0.12288510054349899,-0.07056470215320587,-0.246389418840...|
|[0.49932169914245605,-0.12706467509269714,0.30969417095184326,0.2643227577209...|
|[-0.8871506452560425,-0.20039963722229004,-1.0601330995559692,0.0348707810044...|
+--------------------------------------------------------------------------------+


## MarianTransformer

MarianTransformer: Fast Neural Machine Translation

Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. MarianTransformer uses the models trained by MarianNMT.

It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects.

Pretrained models can be loaded with pretrained of the companion object:

val marian = MarianTransformer.pretrained()
.setInputCols("sentence")
.setOutputCol("translation")


The default model is "opus_mt_en_fr", default language is "xx" (meaning multi-lingual), if no values are provided. For available pretrained models please see the Models Hub.

For extended examples of usage, see the Spark NLP Workshop and the MarianTransformerTestSpec.

Sources :

MarianNMT at GitHub

Marian: Fast Neural Machine Translation in C++

Paper Abstract:

We present Marian, an efficient and self-contained Neural Machine Translation framework with an integrated automatic differentiation engine based on dynamic computation graphs. Marian is written entirely in C++. We describe the design of the encoder-decoder framework and demonstrate that a research-friendly toolkit can achieve high training and translation speed.

Note:

This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.

Input Annotator Types: DOCUMENT

Output Annotator Type: DOCUMENT

 API: MarianTransformer Source: MarianTransformer
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")

sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") \
.setInputCols("document") \
.setOutputCol("sentence")

marian = MarianTransformer.pretrained() \
.setInputCols("sentence") \
.setOutputCol("translation") \
.setMaxInputLength(30)

pipeline = Pipeline() \
.setStages([
documentAssembler,
sentence,
marian
])

data = spark.createDataFrame([["What is the capital of France? We should know this in french."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(translation.result) as result").show(truncate=False)
+-------------------------------------+
|result                               |
+-------------------------------------+
|Quelle est la capitale de la France ?|
|On devrait le savoir en français.    |
+-------------------------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.SentenceDetectorDLModel
import com.johnsnowlabs.nlp.annotators.seq2seq.MarianTransformer
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
.setInputCols("document")
.setOutputCol("sentence")

val marian = MarianTransformer.pretrained()
.setInputCols("sentence")
.setOutputCol("translation")
.setMaxInputLength(30)

val pipeline = new Pipeline()
.setStages(Array(
documentAssembler,
sentence,
marian
))

val data = Seq("What is the capital of France? We should know this in french.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(translation.result) as result").show(false)
+-------------------------------------+
|result                               |
+-------------------------------------+
|Quelle est la capitale de la France ?|
|On devrait le savoir en français.    |
+-------------------------------------+


## RoBertaEmbeddings

The RoBERTa model was proposed in RoBERTa: A Robustly Optimized BERT Pretraining Approach by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. It is based on Google’s BERT model released in 2018.

It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates.

Pretrained models can be loaded with pretrained of the companion object:

val embeddings = RoBertaEmbeddings.pretrained()
.setInputCols("document", "token")
.setOutputCol("embeddings")


The default model is "roberta_base", if no name is provided. For available pretrained models please see the Models Hub.

For extended examples of usage, see the Spark NLP Workshop and the RoBertaEmbeddingsTestSpec. Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. The Spark NLP Workshop example shows how to import them.

Paper Abstract:

Language model pretraining has led to significant performance gains but careful comparison between different approaches is challenging. Training is computationally expensive, often done on private datasets of different sizes, and, as we will show, hyperparameter choices have significant impact on the final results. We present a replication study of BERT pretraining (Devlin et al., 2019) that carefully measures the impact of many key hyperparameters and training data size. We find that BERT was significantly undertrained, and can match or exceed the performance of every model published after it. Our best model achieves state-of-the-art results on GLUE, RACE and SQuAD. These results highlight the importance of previously overlooked design choices, and raise questions about the source of recently reported improvements. We release our models and code.

Tips:

• RoBERTa has the same architecture as BERT, but uses a byte-level BPE as a tokenizer (same as GPT-2) and uses a different pretraining scheme.
• RoBERTa doesn’t have :obj:token_type_ids, you don’t need to indicate which token belongs to which segment. Just separate your segments with the separation token :obj:tokenizer.sep_token (or :obj:</s>)

The original code can be found here https://github.com/pytorch/fairseq/tree/master/examples/roberta.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: WORD_EMBEDDINGS

 API: RoBertaEmbeddings Source: RoBertaEmbeddings
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")

tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")

embeddings = RoBertaEmbeddings.pretrained() \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)

embeddingsFinisher = EmbeddingsFinisher() \
.setInputCols(["embeddings"]) \
.setOutputCols("finished_embeddings") \
.setOutputAsVector(True) \
.setCleanAnnotations(False)

pipeline = Pipeline() \
.setStages([
documentAssembler,
tokenizer,
embeddings,
embeddingsFinisher
])

data = spark.createDataFrame([["This is a sentence."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[0.18792399764060974,-0.14591649174690247,0.20547787845134735,0.1468472778797...|
|[0.22845706343650818,0.18073144555091858,0.09725798666477203,-0.0417917296290...|
|[0.07037967443466187,-0.14801117777824402,-0.03603338822722435,-0.17893412709...|
|[-0.08734266459941864,0.2486150562763214,-0.009067727252840996,-0.24408400058...|
|[0.22409197688102722,-0.4312366545200348,0.1401449590921402,0.356410235166549...|
+--------------------------------------------------------------------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.RoBertaEmbeddings
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")

val embeddings = RoBertaEmbeddings.pretrained()
.setInputCols("document", "token")
.setOutputCol("embeddings")
.setCaseSensitive(true)

val embeddingsFinisher = new EmbeddingsFinisher()
.setInputCols("embeddings")
.setOutputCols("finished_embeddings")
.setOutputAsVector(true)
.setCleanAnnotations(false)

val pipeline = new Pipeline()
.setStages(Array(
documentAssembler,
tokenizer,
embeddings,
embeddingsFinisher
))

val data = Seq("This is a sentence.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[0.18792399764060974,-0.14591649174690247,0.20547787845134735,0.1468472778797...|
|[0.22845706343650818,0.18073144555091858,0.09725798666477203,-0.0417917296290...|
|[0.07037967443466187,-0.14801117777824402,-0.03603338822722435,-0.17893412709...|
|[-0.08734266459941864,0.2486150562763214,-0.009067727252840996,-0.24408400058...|
|[0.22409197688102722,-0.4312366545200348,0.1401449590921402,0.356410235166549...|
+--------------------------------------------------------------------------------+


## T5Transformer

T5: the Text-To-Text Transfer Transformer

T5 reconsiders all NLP tasks into a unified text-to-text-format where the input and output are always text strings, in contrast to BERT-style models that can only output either a class label or a span of the input. The text-to-text framework is able to use the same model, loss function, and hyper-parameters on any NLP task, including machine translation, document summarization, question answering, and classification tasks (e.g., sentiment analysis). T5 can even apply to regression tasks by training it to predict the string representation of a number instead of the number itself.

Pretrained models can be loaded with pretrained of the companion object:

val t5 = T5Transformer.pretrained()
.setInputCols("document")
.setOutputCol("summaries")


The default model is "t5_small", if no name is provided. For available pretrained models please see the Models Hub.

For extended examples of usage, see the Spark NLP Workshop and the T5TestSpec.

Sources:

Paper Abstract:

Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new Colossal Clean Crawled Corpus, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

Note:

This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.

Input Annotator Types: DOCUMENT

Output Annotator Type: DOCUMENT

 API: T5Transformer Source: T5Transformer
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("documents")

t5 = T5Transformer.pretrained("t5_small") \
.setInputCols(["documents"]) \
.setMaxOutputLength(200) \
.setOutputCol("summaries")

pipeline = Pipeline().setStages([documentAssembler, t5])

data = spark.createDataFrame([[
"Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a " +
"downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness" +
" of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this " +
"paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework " +
"that converts all text-based language problems into a text-to-text format. Our systematic study compares " +
"pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens " +
"of language understanding tasks. By combining the insights from our exploration with scale and our new " +
"Colossal Clean Crawled Corpus, we achieve state-of-the-art results on many benchmarks covering " +
"summarization, question answering, text classification, and more. To facilitate future work on transfer " +
"learning for NLP, we release our data set, pre-trained models, and code."
]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.select("summaries.result").show(truncate=False)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                        |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[transfer learning has emerged as a powerful technique in natural language processing (NLP) the effectiveness of transfer learning has given rise to a diversity of approaches, methodologies, and practice .]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.seq2seq.T5Transformer
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("documents")

val t5 = T5Transformer.pretrained("t5_small")
.setInputCols(Array("documents"))
.setMaxOutputLength(200)
.setOutputCol("summaries")

val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))

val data = Seq(
"Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a " +
"downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness" +
" of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this " +
"paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework " +
"that converts all text-based language problems into a text-to-text format. Our systematic study compares " +
"pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens " +
"of language understanding tasks. By combining the insights from our exploration with scale and our new " +
"Colossal Clean Crawled Corpus, we achieve state-of-the-art results on many benchmarks covering " +
"summarization, question answering, text classification, and more. To facilitate future work on transfer " +
"learning for NLP, we release our data set, pre-trained models, and code."
).toDF("text")
val result = pipeline.fit(data).transform(data)

result.select("summaries.result").show(false)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|result                                                                                                                                                                                                        |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[transfer learning has emerged as a powerful technique in natural language processing (NLP) the effectiveness of transfer learning has given rise to a diversity of approaches, methodologies, and practice .]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+


## UniversalSentenceEncoder

The Universal Sentence Encoder encodes text into high dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks.

Pretrained models can be loaded with pretrained of the companion object:

val useEmbeddings = UniversalSentenceEncoder.pretrained()
.setInputCols("sentence")
.setOutputCol("sentence_embeddings")


The default model is "tfhub_use", if no name is provided. For available pretrained models please see the Models Hub.

For extended examples of usage, see the Spark NLP Workshop and the UniversalSentenceEncoderTestSpec.

Sources:

Universal Sentence Encoder

Paper abstract:

We present models for encoding sentences into embedding vectors that specifically target transfer learning to other NLP tasks. The models are efficient and result in accurate performance on diverse transfer tasks. Two variants of the encoding models allow for trade-offs between accuracy and compute resources. For both variants, we investigate and report the relationship between model complexity, resource consumption, the availability of transfer task training data, and task performance. Comparisons are made with baselines that use word level transfer learning via pretrained word embeddings as well as baselines do not use any transfer learning. We find that transfer learning using sentence embeddings tends to outperform word level transfer. With transfer learning via sentence embeddings, we observe surprisingly good performance with minimal amounts of supervised training data for a transfer task. We obtain encouraging results on Word Embedding Association Tests (WEAT) targeted at detecting model bias. Our pre-trained sentence encoding models are made freely available for download and on TF Hub.

Input Annotator Types: DOCUMENT

Output Annotator Type: SENTENCE_EMBEDDINGS

 API: UniversalSentenceEncoder Source: UniversalSentenceEncoder
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")

sentence = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")

embeddings = UniversalSentenceEncoder.pretrained() \
.setInputCols(["sentence"]) \
.setOutputCol("sentence_embeddings")

embeddingsFinisher = EmbeddingsFinisher() \
.setInputCols(["sentence_embeddings"]) \
.setOutputCols("finished_embeddings") \
.setOutputAsVector(True) \
.setCleanAnnotations(False)

pipeline = Pipeline() \
.setStages([
documentAssembler,
sentence,
embeddings,
embeddingsFinisher
])

data = spark.createDataFrame([["This is a sentence."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[0.04616805538535118,0.022307956591248512,-0.044395286589860916,-0.0016493503...|
+--------------------------------------------------------------------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotator.SentenceDetector
import com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

val sentence = new SentenceDetector()
.setInputCols("document")
.setOutputCol("sentence")

val embeddings = UniversalSentenceEncoder.pretrained()
.setInputCols("sentence")
.setOutputCol("sentence_embeddings")

val embeddingsFinisher = new EmbeddingsFinisher()
.setInputCols("embeddings")
.setOutputCols("finished_embeddings")
.setOutputAsVector(true)
.setCleanAnnotations(false)

val pipeline = new Pipeline()
.setStages(Array(
documentAssembler,
sentence,
embeddings,
embeddingsFinisher
))

val data = Seq("This is a sentence.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[0.04616805538535118,0.022307956591248512,-0.044395286589860916,-0.0016493503...|
+--------------------------------------------------------------------------------+


## XlmRoBertaEmbeddings

The XLM-RoBERTa model was proposed in Unsupervised Cross-lingual Representation Learning at Scale by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco GuzmÃ¡n, Edouard Grave, Myle Ott, Luke Zettlemoyer and Veselin Stoyanov. It is based on Facebook’s RoBERTa model released in 2019. It is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data.

Pretrained models can be loaded with pretrained of the companion object:

val embeddings = XlmRoBertaEmbeddings.pretrained()
.setInputCols("document", "token")
.setOutputCol("embeddings")


The default model is "xlm_roberta_base", default language is "xx" (meaning multi-lingual), if no values are provided. For available pretrained models please see the Models Hub.

For extended examples of usage, see the Spark NLP Workshop and the XlmRoBertaEmbeddingsTestSpec. Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. The Spark NLP Workshop example shows how to import them.

Paper Abstract:

This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +13.8% average accuracy on XNLI, +12.3% average F1 score on MLQA, and +2.1% average F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 11.8% in XNLI accuracy for Swahili and 9.2% for Urdu over the previous XLM model. We also present a detailed empirical evaluation of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-Ris very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We will make XLM-R code, data, and models publicly available.

Tips:

• XLM-RoBERTa is a multilingual model trained on 100 different languages. Unlike some XLM multilingual models, it does not require lang parameter to understand which language is used, and should be able to determine the correct language from the input ids.
• This implementation is the same as RoBERTa. Refer to the RoBertaEmbeddings for usage examples as well as the information relative to the inputs and outputs.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: WORD_EMBEDDINGS

 API: XlmRoBertaEmbeddings Source: XlmRoBertaEmbeddings
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")

tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")

embeddings = XlmRoBertaEmbeddings.pretrained() \
.setInputCols(["document", "token"]) \
.setOutputCol("embeddings") \
.setCaseSensitive(True)

embeddingsFinisher = EmbeddingsFinisher() \
.setInputCols(["embeddings"]) \
.setOutputCols("finished_embeddings") \
.setOutputAsVector(True) \
.setCleanAnnotations(False)

pipeline = Pipeline() \
.setStages([
documentAssembler,
tokenizer,
embeddings,
embeddingsFinisher
])

data = spark.createDataFrame([["This is a sentence."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[-0.05969233065843582,-0.030789051204919815,0.04443822056055069,0.09564960747...|
|[-0.038839809596538544,0.011712731793522835,0.019954433664679527,0.0667808502...|
|[-0.03952755779027939,-0.03455188870429993,0.019103847444057465,0.04311436787...|
|[-0.09579929709434509,0.02494969218969345,-0.014753809198737144,0.10259044915...|
|[0.004710011184215546,-0.022148698568344116,0.011723337695002556,-0.013356896...|
+--------------------------------------------------------------------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.XlmRoBertaEmbeddings
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

val tokenizer = new Tokenizer()
.setInputCols(Array("document"))
.setOutputCol("token")

val embeddings = XlmRoBertaEmbeddings.pretrained()
.setInputCols("document", "token")
.setOutputCol("embeddings")
.setCaseSensitive(true)

val embeddingsFinisher = new EmbeddingsFinisher()
.setInputCols("embeddings")
.setOutputCols("finished_embeddings")
.setOutputAsVector(true)
.setCleanAnnotations(false)

val pipeline = new Pipeline()
.setStages(Array(
documentAssembler,
tokenizer,
embeddings,
embeddingsFinisher
))

val data = Seq("This is a sentence.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[-0.05969233065843582,-0.030789051204919815,0.04443822056055069,0.09564960747...|
|[-0.038839809596538544,0.011712731793522835,0.019954433664679527,0.0667808502...|
|[-0.03952755779027939,-0.03455188870429993,0.019103847444057465,0.04311436787...|
|[-0.09579929709434509,0.02494969218969345,-0.014753809198737144,0.10259044915...|
|[0.004710011184215546,-0.022148698568344116,0.011723337695002556,-0.013356896...|
+--------------------------------------------------------------------------------+


## XlnetEmbeddings

XlnetEmbeddings (XLNet): Generalized Autoregressive Pretraining for Language Understanding

XLNet is a new unsupervised language representation learning method based on a novel generalized permutation language modeling objective. Additionally, XLNet employs Transformer-XL as the backbone model, exhibiting excellent performance for language tasks involving long context. Overall, XLNet achieves state-of-the-art (SOTA) results on various downstream language tasks including question answering, natural language inference, sentiment analysis, and document ranking.

These word embeddings represent the outputs generated by the XLNet models.

Note that this is a very computationally expensive module compared to word embedding modules that only perform embedding lookups. The use of an accelerator is recommended.

 "xlnet_large_cased" = XLNet-Large 24-layer, 1024-hidden, 16-heads
 "xlnet_base_cased" = XLNet-Base 12-layer, 768-hidden, 12-heads. This model is trained on full data (different from the one in the paper).

Pretrained models can be loaded with pretrained of the companion object:

val embeddings = XlnetEmbeddings.pretrained()
.setInputCols("sentence", "token")
.setOutputCol("embeddings")

.setInputCols("sentence", "token") \
.setOutputCol("xlnet")


The default model is "xlnet_base_cased", if no name is provided.

For extended examples of usage, see the Spark NLP Workshop and the XlnetEmbeddingsTestSpec.

Sources :

XLNet: Generalized Autoregressive Pretraining for Language Understanding

https://github.com/zihangdai/xlnet

Paper abstract:

With the capability of modeling bidirectional contexts, denoising autoencoding based pretraining like BERT achieves better performance than pretraining approaches based on autoregressive language modeling. However, relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy. In light of these pros and cons, we propose XLNet, a generalized autoregressive pretraining method that (1) enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and (2) overcomes the limitations of BERT thanks to its autoregressive formulation. Furthermore, XLNet integrates ideas from Transformer-XL, the state-of-the-art autoregressive model, into pretraining. Empirically, under comparable experiment settings, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.

Input Annotator Types: DOCUMENT, TOKEN

Output Annotator Type: WORD_EMBEDDINGS

 API: XlnetEmbeddings Source: XlnetEmbeddings
import sparknlp
from sparknlp.base import *
from sparknlp.common import *
from sparknlp.annotator import *
from sparknlp.training import *
from pyspark.ml import Pipeline

documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")

tokenizer = Tokenizer() \
.setInputCols(["document"]) \
.setOutputCol("token")

embeddings = XlnetEmbeddings.pretrained() \
.setInputCols(["token", "document"]) \
.setOutputCol("embeddings")

embeddingsFinisher = EmbeddingsFinisher() \
.setInputCols(["embeddings"]) \
.setOutputCols("finished_embeddings") \
.setOutputAsVector(True) \
.setCleanAnnotations(False)

pipeline = Pipeline().setStages([
documentAssembler,
tokenizer,
embeddings,
embeddingsFinisher
])

data = spark.createDataFrame([["This is a sentence."]]).toDF("text")
result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[-0.6287205219268799,-0.4865287244319916,-0.186111718416214,0.234187275171279...|
|[-1.1967450380325317,0.2746637463569641,0.9481253027915955,0.3431355059146881...|
|[-1.0777631998062134,-2.092679977416992,-1.5331977605819702,-1.11190271377563...|
|[-0.8349916934967041,-0.45627787709236145,-0.7890847325325012,-1.028069257736...|
|[-0.134845569729805,-0.11672890186309814,0.4945235550403595,-0.66587203741073...|
+--------------------------------------------------------------------------------+

import spark.implicits._
import com.johnsnowlabs.nlp.base.DocumentAssembler
import com.johnsnowlabs.nlp.annotators.Tokenizer
import com.johnsnowlabs.nlp.embeddings.XlnetEmbeddings
import com.johnsnowlabs.nlp.EmbeddingsFinisher
import org.apache.spark.ml.Pipeline

val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("document")

val tokenizer = new Tokenizer()
.setInputCols("document")
.setOutputCol("token")

val embeddings = XlnetEmbeddings.pretrained()
.setInputCols("token", "document")
.setOutputCol("embeddings")

val embeddingsFinisher = new EmbeddingsFinisher()
.setInputCols("embeddings")
.setOutputCols("finished_embeddings")
.setOutputAsVector(true)
.setCleanAnnotations(false)

val pipeline = new Pipeline().setStages(Array(
documentAssembler,
tokenizer,
embeddings,
embeddingsFinisher
))

val data = Seq("This is a sentence.").toDF("text")
val result = pipeline.fit(data).transform(data)

result.selectExpr("explode(finished_embeddings) as result").show(5, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[-0.6287205219268799,-0.4865287244319916,-0.186111718416214,0.234187275171279...|
|[-1.1967450380325317,0.2746637463569641,0.9481253027915955,0.3431355059146881...|
|[-1.0777631998062134,-2.092679977416992,-1.5331977605819702,-1.11190271377563...|
|[-0.8349916934967041,-0.45627787709236145,-0.7890847325325012,-1.028069257736...|
|[-0.134845569729805,-0.11672890186309814,0.4945235550403595,-0.66587203741073...|
+--------------------------------------------------------------------------------+
`
Last updated