Packages

package seq2seq

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. All

Type Members

  1. class GPT2Transformer extends AnnotatorModel[GPT2Transformer] with HasBatchedAnnotate[GPT2Transformer] with ParamsAndFeaturesWritable with WriteTensorflowModel

    GPT-2: the OpenAI Text-To-Text Transformer

    GPT-2: the OpenAI Text-To-Text Transformer

    GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset of 8 million web pages. GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains. GPT-2 is a direct scale-up of GPT, with more than 10X the parameters and trained on more than 10X the amount of data.

    GPT-2 displays a broad set of capabilities, including the ability to generate conditional synthetic text samples of unprecedented quality, where we prime the model with an input and have it generate a lengthy continuation. In addition, GPT-2 outperforms other language models trained on specific domains (like Wikipedia, news, or books) without needing to use these domain-specific training datasets. On language tasks like question answering, reading comprehension, summarization, and translation, GPT-2 begins to learn these tasks from the raw text, using no task-specific training data. While scores on these downstream tasks are far from state-of-the-art, they suggest that the tasks can benefit from unsupervised techniques, given sufficient (unlabeled) data and compute.

    Pretrained models can be loaded with pretrained of the companion object:

    val gpt2 = GPT2Transformer.pretrained()
      .setInputCols("document")
      .setOutputCol("generation")

    The default model is "gpt2", if no name is provided. For available pretrained models please see the Models Hub.

    For extended examples of usage, see GPT2TestSpec.

    References:

    Paper Abstract:

    Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on taskspecific datasets. We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText. When conditioned on a document plus questions, the answers generated by the language model reach F1 on the CoQA dataset - matching or exceeding the performance of 3 out of 4 baseline systems without using the 127,000+ training examples. The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting but still underfits WebText. Samples from the model reflect these improvements and contain coherent paragraphs of text. These findings suggest a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.

    Note:

    This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.seq2seq.GPT2Transformer
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("documents")
    
    val gpt2 = GPT2Transformer.pretrained("gpt2")
      .setInputCols(Array("documents"))
      .setMinOutputLength(10)
      .setMaxOutputLength(50)
      .setDoSample(false)
      .setTopK(50)
      .setNoRepeatNgramSize(3)
      .setOutputCol("generation")
    
    val pipeline = new Pipeline().setStages(Array(documentAssembler, gpt2))
    
    val data = Seq(
      "My name is Leonardo."
    ).toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    results.select("generation.result").show(truncate = false)
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |result                                                                                                                                                                                              |
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |[ My name is Leonardo. I am a man of letters. I have been a man for many years. I was born in the year 1776. I came to the United States in 1776, and I have lived in the United Kingdom since 1776]|
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
  2. class MarianTransformer extends AnnotatorModel[MarianTransformer] with HasBatchedAnnotate[MarianTransformer] with WriteTensorflowModel with WriteSentencePieceModel

    MarianTransformer: Fast Neural Machine Translation

    MarianTransformer: Fast Neural Machine Translation

    Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. MarianTransformer uses the models trained by MarianNMT.

    It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects.

    Pretrained models can be loaded with pretrained of the companion object:

    val marian = MarianTransformer.pretrained()
      .setInputCols("sentence")
      .setOutputCol("translation")

    The default model is "opus_mt_en_fr", default language is "xx" (meaning multi-lingual), if no values are provided. For available pretrained models please see the Models Hub.

    For extended examples of usage, see the Spark NLP Workshop and the MarianTransformerTestSpec.

    Sources :

    MarianNMT at GitHub

    Marian: Fast Neural Machine Translation in C++

    Paper Abstract:

    We present Marian, an efficient and self-contained Neural Machine Translation framework with an integrated automatic differentiation engine based on dynamic computation graphs. Marian is written entirely in C++. We describe the design of the encoder-decoder framework and demonstrate that a research-friendly toolkit can achieve high training and translation speed.

    Note:

    This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.SentenceDetectorDLModel
    import com.johnsnowlabs.nlp.annotators.seq2seq.MarianTransformer
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx")
      .setInputCols("document")
      .setOutputCol("sentence")
    
    val marian = MarianTransformer.pretrained()
      .setInputCols("sentence")
      .setOutputCol("translation")
      .setMaxInputLength(30)
    
    val pipeline = new Pipeline()
      .setStages(Array(
        documentAssembler,
        sentence,
        marian
      ))
    
    val data = Seq("What is the capital of France? We should know this in french.").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("explode(translation.result) as result").show(false)
    +-------------------------------------+
    |result                               |
    +-------------------------------------+
    |Quelle est la capitale de la France ?|
    |On devrait le savoir en français.    |
    +-------------------------------------+
  3. trait ReadGPT2TransformerTensorflowModel extends ReadTensorflowModel
  4. trait ReadMarianMTTensorflowModel extends ReadTensorflowModel with ReadSentencePieceModel
  5. trait ReadT5TransformerTensorflowModel extends ReadTensorflowModel with ReadSentencePieceModel
  6. trait ReadablePretrainedGPT2TransformerModel extends ParamsAndFeaturesReadable[GPT2Transformer] with HasPretrained[GPT2Transformer]
  7. trait ReadablePretrainedMarianMTModel extends ParamsAndFeaturesReadable[MarianTransformer] with HasPretrained[MarianTransformer]
  8. trait ReadablePretrainedT5TransformerModel extends ParamsAndFeaturesReadable[T5Transformer] with HasPretrained[T5Transformer]
  9. class T5Transformer extends AnnotatorModel[T5Transformer] with HasBatchedAnnotate[T5Transformer] with ParamsAndFeaturesWritable with WriteTensorflowModel with WriteSentencePieceModel

    T5: the Text-To-Text Transfer Transformer

    T5: the Text-To-Text Transfer Transformer

    T5 reconsiders all NLP tasks into a unified text-to-text-format where the input and output are always text strings, in contrast to BERT-style models that can only output either a class label or a span of the input. The text-to-text framework is able to use the same model, loss function, and hyper-parameters on any NLP task, including machine translation, document summarization, question answering, and classification tasks (e.g., sentiment analysis). T5 can even apply to regression tasks by training it to predict the string representation of a number instead of the number itself.

    Pretrained models can be loaded with pretrained of the companion object:

    val t5 = T5Transformer.pretrained()
      .setTask("summarize:")
      .setInputCols("document")
      .setOutputCol("summaries")

    The default model is "t5_small", if no name is provided. For available pretrained models please see the Models Hub.

    For extended examples of usage, see the Spark NLP Workshop and the T5TestSpec.

    References:

    Paper Abstract:

    Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new Colossal Clean Crawled Corpus, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

    Note:

    This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.seq2seq.T5Transformer
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("documents")
    
    val t5 = T5Transformer.pretrained("t5_small")
      .setTask("summarize:")
      .setInputCols(Array("documents"))
      .setMaxOutputLength(200)
      .setOutputCol("summaries")
    
    val pipeline = new Pipeline().setStages(Array(documentAssembler, t5))
    
    val data = Seq(
      "Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a " +
        "downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness" +
        " of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this " +
        "paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework " +
        "that converts all text-based language problems into a text-to-text format. Our systematic study compares " +
        "pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens " +
        "of language understanding tasks. By combining the insights from our exploration with scale and our new " +
        "Colossal Clean Crawled Corpus, we achieve state-of-the-art results on many benchmarks covering " +
        "summarization, question answering, text classification, and more. To facilitate future work on transfer " +
        "learning for NLP, we release our data set, pre-trained models, and code."
    ).toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("summaries.result").show(false)
    +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |result                                                                                                                                                                                                        |
    +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |[transfer learning has emerged as a powerful technique in natural language processing (NLP) the effectiveness of transfer learning has given rise to a diversity of approaches, methodologies, and practice .]|
    +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Value Members

  1. object GPT2Transformer extends ReadablePretrainedGPT2TransformerModel with ReadGPT2TransformerTensorflowModel with Serializable
  2. object MarianTransformer extends ReadablePretrainedMarianMTModel with ReadMarianMTTensorflowModel with ReadSentencePieceModel with Serializable

    This is the companion object of MarianTransformer.

    This is the companion object of MarianTransformer. Please refer to that class for the documentation.

  3. object T5Transformer extends ReadablePretrainedT5TransformerModel with ReadT5TransformerTensorflowModel with ReadSentencePieceModel with Serializable

    This is the companion object of T5Transformer.

    This is the companion object of T5Transformer. Please refer to that class for the documentation.

Ungrouped