Packages

package dl

Ordering
  1. Alphabetic
Visibility
  1. Public
  2. All

Type Members

  1. class AlbertForQuestionAnswering extends AnnotatorModel[AlbertForQuestionAnswering] with HasBatchedAnnotate[AlbertForQuestionAnswering] with WriteTensorflowModel with WriteSentencePieceModel with HasCaseSensitiveProperties with HasEngine

    AlbertForQuestionAnswering can load ALBERT Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits).

    AlbertForQuestionAnswering can load ALBERT Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits).

    Pretrained models can be loaded with pretrained of the companion object:

    val spanClassifier = AlbertForQuestionAnswering.pretrained()
      .setInputCols(Array("document_question", "document_context"))
      .setOutputCol("answer")

    The default model is "albert_base_qa_squad2", if no name is provided.

    For available pretrained models please see the Models Hub.

    To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see AlbertForQuestionAnsweringTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val document = new MultiDocumentAssembler()
      .setInputCols("question", "context")
      .setOutputCols("document_question", "document_context")
    
    val questionAnswering = AlbertForQuestionAnswering.pretrained()
      .setInputCols(Array("document_question", "document_context"))
      .setOutputCol("answer")
      .setCaseSensitive(false)
    
    val pipeline = new Pipeline().setStages(Array(
      document,
      questionAnswering
    ))
    
    val data = Seq("What's my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
    val result = pipeline.fit(data).transform(data)
    
    result.select("label.result").show(false)
    +---------------------+
    |result               |
    +---------------------+
    |[Clara]              |
    ++--------------------+
    See also

    AlbertForSequenceClassification for sequence-level classification

    Annotators Main Page for a list of transformer based classifiers

  2. class AlbertForSequenceClassification extends AnnotatorModel[AlbertForSequenceClassification] with HasBatchedAnnotate[AlbertForSequenceClassification] with WriteTensorflowModel with WriteSentencePieceModel with HasCaseSensitiveProperties with HasClassifierActivationProperties with HasEngine

    AlbertForSequenceClassification can load ALBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g.

    AlbertForSequenceClassification can load ALBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.

    Pretrained models can be loaded with pretrained of the companion object:

    val sequenceClassifier = AlbertForSequenceClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")

    The default model is "albert_base_sequence_classifier_imdb", if no name is provided.

    For available pretrained models please see the Models Hub.

    To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see AlbertForSequenceClassification.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val sequenceClassifier = AlbertForSequenceClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")
      .setCaseSensitive(true)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      sequenceClassifier
    ))
    
    val data = Seq("I loved this movie when I was a child.", "It was pretty boring.").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("label.result").show(false)
    +------+
    |result|
    +------+
    |[pos] |
    |[neg] |
    +------+
    See also

    AlbertForSequenceClassification for sequence-level classification

    Annotators Main Page for a list of transformer based classifiers

  3. class AlbertForTokenClassification extends AnnotatorModel[AlbertForTokenClassification] with HasBatchedAnnotate[AlbertForTokenClassification] with WriteTensorflowModel with WriteSentencePieceModel with HasCaseSensitiveProperties with HasEngine

    AlbertForTokenClassification can load ALBERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g.

    AlbertForTokenClassification can load ALBERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

    Pretrained models can be loaded with pretrained of the companion object:

    val tokenClassifier = AlbertForTokenClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")

    The default model is "albert_base_token_classifier_conll03", if no name is provided.

    For available pretrained models please see the Models Hub.

    and the AlbertForTokenClassificationTestSpec. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val tokenClassifier = AlbertForTokenClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")
      .setCaseSensitive(true)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      tokenClassifier
    ))
    
    val data = Seq("John Lenon was born in London and lived in Paris. My name is Sarah and I live in London").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("label.result").show(false)
    +------------------------------------------------------------------------------------+
    |result                                                                              |
    +------------------------------------------------------------------------------------+
    |[B-PER, I-PER, O, O, O, B-LOC, O, O, O, B-LOC, O, O, O, O, B-PER, O, O, O, O, B-LOC]|
    +------------------------------------------------------------------------------------+
    See also

    AlbertForTokenClassification for token-level classification

    Annotators Main Page for a list of transformer based classifiers

  4. class BertForQuestionAnswering extends AnnotatorModel[BertForQuestionAnswering] with HasBatchedAnnotate[BertForQuestionAnswering] with WriteTensorflowModel with HasCaseSensitiveProperties with HasEngine

    BertForQuestionAnswering can load Bert Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits).

    BertForQuestionAnswering can load Bert Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits).

    Pretrained models can be loaded with pretrained of the companion object:

    val spanClassifier = BertForQuestionAnswering.pretrained()
      .setInputCols(Array("document_question", "document_context"))
      .setOutputCol("answer")

    The default model is "bert_base_cased_qa_squad2", if no name is provided.

    For available pretrained models please see the Models Hub.

    Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. The Spark NLP Workshop example shows how to import them https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see BertForQuestionAnsweringTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val document = new MultiDocumentAssembler()
      .setInputCols("question", "context")
      .setOutputCols("document_question", "document_context")
    
    val questionAnswering = BertForQuestionAnswering.pretrained()
      .setInputCols(Array("document_question", "document_context"))
      .setOutputCol("answer")
      .setCaseSensitive(true)
    
    val pipeline = new Pipeline().setStages(Array(
      document,
      questionAnswering
    ))
    
    val data = Seq("What's my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
    val result = pipeline.fit(data).transform(data)
    
    result.select("label.result").show(false)
    +---------------------+
    |result               |
    +---------------------+
    |[Clara]              |
    ++--------------------+
    See also

    BertForSequenceClassification for span-level classification

    Annotators Main Page for a list of transformer based classifiers

  5. class BertForSequenceClassification extends AnnotatorModel[BertForSequenceClassification] with HasBatchedAnnotate[BertForSequenceClassification] with WriteTensorflowModel with HasCaseSensitiveProperties with HasClassifierActivationProperties with HasEngine

    BertForSequenceClassification can load Bert Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g.

    BertForSequenceClassification can load Bert Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.

    Pretrained models can be loaded with pretrained of the companion object:

    val sequenceClassifier = BertForSequenceClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")

    The default model is "bert_base_sequence_classifier_imdb", if no name is provided.

    For available pretrained models please see the Models Hub.

    To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see BertForSequenceClassificationTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val sequenceClassifier = BertForSequenceClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")
      .setCaseSensitive(true)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      sequenceClassifier
    ))
    
    val data = Seq("I loved this movie when I was a child.", "It was pretty boring.").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("label.result").show(false)
    +------+
    |result|
    +------+
    |[pos] |
    |[neg] |
    +------+
    See also

    BertForSequenceClassification for sequnece-level classification

    Annotators Main Page for a list of transformer based classifiers

  6. class BertForTokenClassification extends AnnotatorModel[BertForTokenClassification] with HasBatchedAnnotate[BertForTokenClassification] with WriteTensorflowModel with HasCaseSensitiveProperties with HasEngine

    BertForTokenClassification can load Bert Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g.

    BertForTokenClassification can load Bert Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

    Pretrained models can be loaded with pretrained of the companion object:

    val tokenClassifier = BertForTokenClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")

    The default model is "bert_base_token_classifier_conll03", if no name is provided.

    For available pretrained models please see the Models Hub.

    To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see BertForTokenClassificationTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val tokenClassifier = BertForTokenClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")
      .setCaseSensitive(true)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      tokenClassifier
    ))
    
    val data = Seq("John Lenon was born in London and lived in Paris. My name is Sarah and I live in London").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("label.result").show(false)
    +------------------------------------------------------------------------------------+
    |result                                                                              |
    +------------------------------------------------------------------------------------+
    |[B-PER, I-PER, O, O, O, B-LOC, O, O, O, B-LOC, O, O, O, O, B-PER, O, O, O, O, B-LOC]|
    +------------------------------------------------------------------------------------+
    See also

    BertForTokenClassification for token-level classification

    Annotators Main Page for a list of transformer based classifiers

  7. class CamemBertForSequenceClassification extends AnnotatorModel[CamemBertForSequenceClassification] with HasBatchedAnnotate[CamemBertForSequenceClassification] with WriteTensorflowModel with WriteSentencePieceModel with HasCaseSensitiveProperties with HasClassifierActivationProperties with HasEngine

    CamemBertForSequenceClassification can load CamemBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g.

    CamemBertForSequenceClassification can load CamemBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.

    Pretrained models can be loaded with pretrained of the companion object:

    val sequenceClassifier = CamemBertForSequenceClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")

    The default model is camembert_base_sequence_classifier_allocine", if no name is provided.

    For available pretrained models please see the Models Hub.

    To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see CamemBertForSequenceClassification.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val sequenceClassifier = CamemBertForSequenceClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")
      .setCaseSensitive(true)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      sequenceClassifier
    ))
    
    val data = Seq("j'ai adoré ce film lorsque j'étais enfant.", "Je déteste ça.").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("label.result").show(false)
    +------+
    |result|
    +------+
    |[pos] |
    |[neg] |
    +------+
    See also

    CamemBertForSequenceClassification for sequence-level classification

    Annotators Main Page for a list of transformer based classifiers

  8. class CamemBertForTokenClassification extends AnnotatorModel[CamemBertForTokenClassification] with HasBatchedAnnotate[CamemBertForTokenClassification] with WriteTensorflowModel with WriteSentencePieceModel with HasCaseSensitiveProperties with HasEngine

    CamemBertForTokenClassification can load CamemBERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g.

    CamemBertForTokenClassification can load CamemBERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

    Pretrained models can be loaded with pretrained of the companion object:

    val tokenClassifier = CamemBertForTokenClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")

    The default model is "camembert_base_token_classifier_wikiner", if no name is provided.

    For available pretrained models please see the Models Hub.

    and the CamemBertForTokenClassificationTestSpec. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val tokenClassifier = CamemBertForTokenClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")
      .setCaseSensitive(true)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      tokenClassifier
    ))
    
    val data = Seq("george washington est allé à washington").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("label.result").show(false)
    +------------------------------+
    |result                        |
    +------------------------------+
    |[I-PER, I-PER, O, O, O, I-LOC]|
    +------------------------------+
    See also

    CamemBertForTokenClassification for token-level classification

    Annotators Main Page for a list of transformer based classifiers

  9. class ClassifierDLApproach extends AnnotatorApproach[ClassifierDLModel] with ParamsAndFeaturesWritable with ClassifierEncoder

    Trains a ClassifierDL for generic Multi-class Text Classification.

    Trains a ClassifierDL for generic Multi-class Text Classification.

    ClassifierDL uses the state-of-the-art Universal Sentence Encoder as an input for text classifications. The ClassifierDL annotator uses a deep learning model (DNNs) we have built inside TensorFlow and supports up to 100 classes.

    For instantiated/pretrained models, see ClassifierDLModel.

    Notes:

    Setting a test dataset to monitor model metrics can be done with .setTestDataset. The method expects a path to a parquet file containing a dataframe that has the same required columns as the training dataframe. The pre-processing steps for the training dataframe should also be applied to the test dataframe. The following example will show how to create the test dataset:

    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val embeddings = UniversalSentenceEncoder.pretrained()
      .setInputCols("document")
      .setOutputCol("sentence_embeddings")
    
    val preProcessingPipeline = new Pipeline().setStages(Array(documentAssembler, embeddings))
    
    val Array(train, test) = data.randomSplit(Array(0.8, 0.2))
    preProcessingPipeline
      .fit(test)
      .transform(test)
      .write
      .mode("overwrite")
      .parquet("test_data")
    
    val classifier = new ClassifierDLApproach()
      .setInputCols("sentence_embeddings")
      .setOutputCol("category")
      .setLabelColumn("label")
      .setTestDataset("test_data")

    For extended examples of usage, see the Spark NLP Workshop [1] [2] and the ClassifierDLTestSpec.

    Example

    In this example, the training data "sentiment.csv" has the form of

    text,label
    This movie is the best movie I have wached ever! In my opinion this movie can win an award.,0
    This was a terrible movie! The acting was bad really bad!,1
    ...

    Then traning can be done like so:

    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder
    import com.johnsnowlabs.nlp.annotators.classifier.dl.ClassifierDLApproach
    import org.apache.spark.ml.Pipeline
    
    val smallCorpus = spark.read.option("header","true").csv("src/test/resources/classifier/sentiment.csv")
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val useEmbeddings = UniversalSentenceEncoder.pretrained()
      .setInputCols("document")
      .setOutputCol("sentence_embeddings")
    
    val docClassifier = new ClassifierDLApproach()
      .setInputCols("sentence_embeddings")
      .setOutputCol("category")
      .setLabelColumn("label")
      .setBatchSize(64)
      .setMaxEpochs(20)
      .setLr(5e-3f)
      .setDropout(0.5f)
    
    val pipeline = new Pipeline()
      .setStages(
        Array(
          documentAssembler,
          useEmbeddings,
          docClassifier
        )
      )
    
    val pipelineModel = pipeline.fit(smallCorpus)
    See also

    MultiClassifierDLApproach for multi-class classification

    SentimentDLApproach for sentiment analysis

  10. class ClassifierDLModel extends AnnotatorModel[ClassifierDLModel] with HasSimpleAnnotate[ClassifierDLModel] with WriteTensorflowModel with HasStorageRef with ParamsAndFeaturesWritable with HasEngine

    ClassifierDL for generic Multi-class Text Classification.

    ClassifierDL for generic Multi-class Text Classification.

    ClassifierDL uses the state-of-the-art Universal Sentence Encoder as an input for text classifications. The ClassifierDL annotator uses a deep learning model (DNNs) we have built inside TensorFlow and supports up to 100 classes.

    This is the instantiated model of the ClassifierDLApproach. For training your own model, please see the documentation of that class.

    Pretrained models can be loaded with pretrained of the companion object:

    val classifierDL = ClassifierDLModel.pretrained()
      .setInputCols("sentence_embeddings")
      .setOutputCol("classification")

    The default model is "classifierdl_use_trec6", if no name is provided. It uses embeddings from the UniversalSentenceEncoder and is trained on the TREC-6 dataset. For available pretrained models please see the Models Hub.

    For extended examples of usage, see the Spark NLP Workshop and the ClassifierDLTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.SentenceDetector
    import com.johnsnowlabs.nlp.annotators.classifier.dl.ClassifierDLModel
    import com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val sentence = new SentenceDetector()
      .setInputCols("document")
      .setOutputCol("sentence")
    
    val useEmbeddings = UniversalSentenceEncoder.pretrained()
      .setInputCols("document")
      .setOutputCol("sentence_embeddings")
    
    val sarcasmDL = ClassifierDLModel.pretrained("classifierdl_use_sarcasm")
      .setInputCols("sentence_embeddings")
      .setOutputCol("sarcasm")
    
    val pipeline = new Pipeline()
      .setStages(Array(
        documentAssembler,
        sentence,
        useEmbeddings,
        sarcasmDL
      ))
    
    val data = Seq(
      "I'm ready!",
      "If I could put into words how much I love waking up at 6 am on Mondays I would."
    ).toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.selectExpr("explode(arrays_zip(sentence, sarcasm)) as out")
      .selectExpr("out.sentence.result as sentence", "out.sarcasm.result as sarcasm")
      .show(false)
    +-------------------------------------------------------------------------------+-------+
    |sentence                                                                       |sarcasm|
    +-------------------------------------------------------------------------------+-------+
    |I'm ready!                                                                     |normal |
    |If I could put into words how much I love waking up at 6 am on Mondays I would.|sarcasm|
    +-------------------------------------------------------------------------------+-------+
    See also

    MultiClassifierDLModel for multi-class classification

    SentimentDLModel for sentiment analysis

  11. trait ClassifierEncoder extends EvaluationDLParams
  12. trait ClassifierMetrics extends Logging
  13. class DeBertaForQuestionAnswering extends AnnotatorModel[DeBertaForQuestionAnswering] with HasBatchedAnnotate[DeBertaForQuestionAnswering] with WriteTensorflowModel with WriteSentencePieceModel with HasCaseSensitiveProperties with HasEngine

    DeBertaForQuestionAnswering can load DeBERTa Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits).

    DeBertaForQuestionAnswering can load DeBERTa Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits).

    Pretrained models can be loaded with pretrained of the companion object:

    val spanClassifier = DeBertaForQuestionAnswering.pretrained()
      .setInputCols(Array("document_question", "document_context"))
      .setOutputCol("answer")

    The default model is "deberta_v3_xsmall_qa_squad2", if no name is provided.

    For available pretrained models please see the Models Hub.

    To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see DeBertaForQuestionAnsweringTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val document = new MultiDocumentAssembler()
      .setInputCols("question", "context")
      .setOutputCols("document_question", "document_context")
    
    val questionAnswering = DeBertaForQuestionAnswering.pretrained()
      .setInputCols(Array("document_question", "document_context"))
      .setOutputCol("answer")
      .setCaseSensitive(true)
    
    val pipeline = new Pipeline().setStages(Array(
      document,
      questionAnswering
    ))
    
    val data = Seq("What's my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
    val result = pipeline.fit(data).transform(data)
    
    result.select("label.result").show(false)
    +---------------------+
    |result               |
    +---------------------+
    |[Clara]              |
    ++--------------------+
    See also

    DeBertaForQuestionAnswering for span-level classification

    Annotators Main Page for a list of transformer based classifiers

  14. class DeBertaForSequenceClassification extends AnnotatorModel[DeBertaForSequenceClassification] with HasBatchedAnnotate[DeBertaForSequenceClassification] with WriteTensorflowModel with WriteSentencePieceModel with HasCaseSensitiveProperties with HasClassifierActivationProperties with HasEngine

    DeBertaForSequenceClassification can load DeBerta v2 & v3 Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g.

    DeBertaForSequenceClassification can load DeBerta v2 & v3 Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.

    Pretrained models can be loaded with pretrained of the companion object:

    val sequenceClassifier = DeBertaForSequenceClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")

    The default model is "deberta_v3_xsmall_sequence_classifier_imdb", if no name is provided.

    For available pretrained models please see the Models Hub.

    To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see DeBertaForSequenceClassification.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val sequenceClassifier = DeBertaForSequenceClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")
      .setCaseSensitive(true)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      sequenceClassifier
    ))
    
    val data = Seq("I loved this movie when I was a child.", "It was pretty boring.").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("label.result").show(false)
    +------+
    |result|
    +------+
    |[pos] |
    |[neg] |
    +------+
    See also

    DeBertaForSequenceClassification for sequence-level classification

    Annotators Main Page for a list of transformer based classifiers

  15. class DeBertaForTokenClassification extends AnnotatorModel[DeBertaForTokenClassification] with HasBatchedAnnotate[DeBertaForTokenClassification] with WriteTensorflowModel with WriteSentencePieceModel with HasCaseSensitiveProperties with HasEngine

    DeBertaForTokenClassification can load DeBERTA Models v2 and v3 with a token classification head on top (a linear layer on top of the hidden-states output) e.g.

    DeBertaForTokenClassification can load DeBERTA Models v2 and v3 with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

    Pretrained models can be loaded with pretrained of the companion object:

    val tokenClassifier = DeBertaForTokenClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")

    The default model is "deberta_v3_xsmall_token_classifier_conll03", if no name is provided.

    For available pretrained models please see the Models Hub.

    and the DeBertaForTokenClassificationTestSpec. Models from the HuggingFace 🤗 Transformers library are also compatible with Spark NLP 🚀. The Spark NLP Workshop example shows how to import them https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val tokenClassifier = DeBertaForTokenClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")
      .setCaseSensitive(true)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      tokenClassifier
    ))
    
    val data = Seq("John Lenon was born in London and lived in Paris. My name is Sarah and I live in London").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("label.result").show(false)
    +------------------------------------------------------------------------------------+
    |result                                                                              |
    +------------------------------------------------------------------------------------+
    |[B-PER, I-PER, O, O, O, B-LOC, O, O, O, B-LOC, O, O, O, O, B-PER, O, O, O, O, B-LOC]|
    +------------------------------------------------------------------------------------+
    See also

    DeBertaForTokenClassification for token-level classification

    Annotators Main Page for a list of transformer based classifiers

  16. class DistilBertForQuestionAnswering extends AnnotatorModel[DistilBertForQuestionAnswering] with HasBatchedAnnotate[DistilBertForQuestionAnswering] with WriteTensorflowModel with HasCaseSensitiveProperties with HasEngine

    DistilBertForQuestionAnswering can load DistilBert Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits).

    DistilBertForQuestionAnswering can load DistilBert Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits).

    Pretrained models can be loaded with pretrained of the companion object:

    val spanClassifier = DistilBertForQuestionAnswering.pretrained()
      .setInputCols(Array("document_question", "document_context"))
      .setOutputCol("answer")

    The default model is "distilbert_base_cased_qa_squad2", if no name is provided.

    For available pretrained models please see the Models Hub.

    To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see DistilBertForSequenceClassificationTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val document = new MultiDocumentAssembler()
      .setInputCols("question", "context")
      .setOutputCols("document_question", "document_context")
    
    val questionAnswering = DistilBertForQuestionAnswering.pretrained()
      .setInputCols(Array("document_question", "document_context"))
      .setOutputCol("answer")
      .setCaseSensitive(true)
    
    val pipeline = new Pipeline().setStages(Array(
      document,
      questionAnswering
    ))
    
    val data = Seq("What's my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
    val result = pipeline.fit(data).transform(data)
    
    result.select("label.result").show(false)
    +---------------------+
    |result               |
    +---------------------+
    |[Clara]              |
    ++--------------------+
    See also

    DistilBertForSequenceClassification for sequence-level classification

    Annotators Main Page for a list of transformer based classifiers

  17. class DistilBertForSequenceClassification extends AnnotatorModel[DistilBertForSequenceClassification] with HasBatchedAnnotate[DistilBertForSequenceClassification] with WriteTensorflowModel with HasCaseSensitiveProperties with HasClassifierActivationProperties with HasEngine

    DistilBertForSequenceClassification can load DistilBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g.

    DistilBertForSequenceClassification can load DistilBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.

    Pretrained models can be loaded with pretrained of the companion object:

    val sequenceClassifier = DistilBertForSequenceClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")

    The default model is "distilbert_base_sequence_classifier_imdb", if no name is provided.

    For available pretrained models please see the Models Hub.

    To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see DistilBertForSequenceClassificationTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val sequenceClassifier = DistilBertForSequenceClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")
      .setCaseSensitive(true)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      sequenceClassifier
    ))
    
    val data = Seq("I loved this movie when I was a child.", "It was pretty boring.").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("label.result").show(false)
    +------+
    |result|
    +------+
    |[pos] |
    |[neg] |
    +------+
    See also

    DistilBertForSequenceClassification for sequence-level classification

    Annotators Main Page for a list of transformer based classifiers

  18. class DistilBertForTokenClassification extends AnnotatorModel[DistilBertForTokenClassification] with HasBatchedAnnotate[DistilBertForTokenClassification] with WriteTensorflowModel with HasCaseSensitiveProperties with HasEngine

    DistilBertForTokenClassification can load Bert Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g.

    DistilBertForTokenClassification can load Bert Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

    Pretrained models can be loaded with pretrained of the companion object:

    val tokenClassifier = DistilBertForTokenClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")

    The default model is "distilbert_base_token_classifier_conll03", if no name is provided.

    For available pretrained models please see the Models Hub.

    To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see DistilBertForTokenClassificationTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val tokenClassifier = DistilBertForTokenClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")
      .setCaseSensitive(true)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      tokenClassifier
    ))
    
    val data = Seq("John Lenon was born in London and lived in Paris. My name is Sarah and I live in London").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("label.result").show(false)
    +------------------------------------------------------------------------------------+
    |result                                                                              |
    +------------------------------------------------------------------------------------+
    |[B-PER, I-PER, O, O, O, B-LOC, O, O, O, B-LOC, O, O, O, O, B-PER, O, O, O, O, B-LOC]|
    +------------------------------------------------------------------------------------+
    See also

    DistilBertForTokenClassification for token-level classification

    Annotators Main Page for a list of transformer based classifiers

  19. class LongformerForQuestionAnswering extends AnnotatorModel[LongformerForQuestionAnswering] with HasBatchedAnnotate[LongformerForQuestionAnswering] with WriteTensorflowModel with HasCaseSensitiveProperties with HasEngine

    LongformerForQuestionAnswering can load Longformer Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits).

    LongformerForQuestionAnswering can load Longformer Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits).

    Pretrained models can be loaded with pretrained of the companion object:

    val spanClassifier = LongformerForQuestionAnswering.pretrained()
      .setInputCols(Array("document_question", "document_context"))
      .setOutputCol("answer")

    The default model is "longformer_base_base_qa_squad2", if no name is provided.

    For available pretrained models please see the Models Hub.

    To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see LongformerForQuestionAnsweringTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val document = new MultiDocumentAssembler()
      .setInputCols("question", "context")
      .setOutputCols("document_question", "document_context")
    
    val questionAnswering = LongformerForQuestionAnswering.pretrained()
      .setInputCols(Array("document_question", "document_context"))
      .setOutputCol("answer")
      .setCaseSensitive(true)
    
    val pipeline = new Pipeline().setStages(Array(
      document,
      questionAnswering
    ))
    
    val data = Seq("What's my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
    val result = pipeline.fit(data).transform(data)
    
    result.select("label.result").show(false)
    +---------------------+
    |result               |
    +---------------------+
    |[Clara]              |
    ++--------------------+
    See also

    LongformerForSequenceClassification for sequence-level classification

    Annotators Main Page for a list of transformer based classifiers

  20. class LongformerForSequenceClassification extends AnnotatorModel[LongformerForSequenceClassification] with HasBatchedAnnotate[LongformerForSequenceClassification] with WriteTensorflowModel with HasCaseSensitiveProperties with HasClassifierActivationProperties with HasEngine

    LongformerForSequenceClassification can load Longformer Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g.

    LongformerForSequenceClassification can load Longformer Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.

    Pretrained models can be loaded with pretrained of the companion object:

    val sequenceClassifier = LongformerForSequenceClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")

    The default model is "longformer_base_sequence_classifier_imdb", if no name is provided.

    For available pretrained models please see the Models Hub.

    To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see LongformerForSequenceClassification.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val sequenceClassifier = LongformerForSequenceClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")
      .setCaseSensitive(true)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      sequenceClassifier
    ))
    
    val data = Seq("I loved this movie when I was a child.", "It was pretty boring.").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("label.result").show(false)
    +------+
    |result|
    +------+
    |[pos] |
    |[neg] |
    +------+
    See also

    LongformerForSequenceClassification for sequence-level classification

    Annotators Main Page for a list of transformer based classifiers

  21. class LongformerForTokenClassification extends AnnotatorModel[LongformerForTokenClassification] with HasBatchedAnnotate[LongformerForTokenClassification] with WriteTensorflowModel with HasCaseSensitiveProperties with HasEngine

    LongformerForTokenClassification can load Longformer Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g.

    LongformerForTokenClassification can load Longformer Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

    Pretrained models can be loaded with pretrained of the companion object:

    val tokenClassifier = LongformerForTokenClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")

    The default model is "longformer_base_token_classifier_conll03", if no name is provided.

    For available pretrained models please see the Models Hub.

    and the LongformerForTokenClassificationTestSpec. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val tokenClassifier = LongformerForTokenClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")
      .setCaseSensitive(true)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      tokenClassifier
    ))
    
    val data = Seq("John Lenon was born in London and lived in Paris. My name is Sarah and I live in London").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("label.result").show(false)
    +------------------------------------------------------------------------------------+
    |result                                                                              |
    +------------------------------------------------------------------------------------+
    |[B-PER, I-PER, O, O, O, B-LOC, O, O, O, B-LOC, O, O, O, O, B-PER, O, O, O, O, B-LOC]|
    +------------------------------------------------------------------------------------+
    See also

    LongformerForTokenClassification for token-level classification

    Annotators Main Page for a list of transformer based classifiers

  22. class MultiClassifierDLApproach extends AnnotatorApproach[MultiClassifierDLModel] with ParamsAndFeaturesWritable with ClassifierEncoder

    Trains a MultiClassifierDL for Multi-label Text Classification.

    Trains a MultiClassifierDL for Multi-label Text Classification.

    MultiClassifierDL uses a Bidirectional GRU with a convolutional model that we have built inside TensorFlow and supports up to 100 classes.

    For instantiated/pretrained models, see MultiClassifierDLModel.

    The input to MultiClassifierDL are Sentence Embeddings such as the state-of-the-art UniversalSentenceEncoder, BertSentenceEmbeddings, or SentenceEmbeddings.

    In machine learning, multi-label classification and the strongly related problem of multi-output classification are variants of the classification problem where multiple labels may be assigned to each instance. Multi-label classification is a generalization of multiclass classification, which is the single-label problem of categorizing instances into precisely one of more than two classes; in the multi-label problem there is no constraint on how many of the classes the instance can be assigned to. Formally, multi-label classification is the problem of finding a model that maps inputs x to binary vectors y (assigning a value of 0 or 1 for each element (label) in y).

    Notes:

    Setting a test dataset to monitor model metrics can be done with .setTestDataset. The method expects a path to a parquet file containing a dataframe that has the same required columns as the training dataframe. The pre-processing steps for the training dataframe should also be applied to the test dataframe. The following example will show how to create the test dataset:

    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val embeddings = UniversalSentenceEncoder.pretrained()
      .setInputCols("document")
      .setOutputCol("sentence_embeddings")
    
    val preProcessingPipeline = new Pipeline().setStages(Array(documentAssembler, embeddings))
    
    val Array(train, test) = data.randomSplit(Array(0.8, 0.2))
    preProcessingPipeline
      .fit(test)
      .transform(test)
      .write
      .mode("overwrite")
      .parquet("test_data")
    
    val multiClassifier = new MultiClassifierDLApproach()
      .setInputCols("sentence_embeddings")
      .setOutputCol("category")
      .setLabelColumn("label")
      .setTestDataset("test_data")

    For extended examples of usage, see the Spark NLP Workshop and the MultiClassifierDLTestSpec.

    Example

    In this example, the training data has the form (Note: labels can be arbitrary)

    mr,ref
    "name[Alimentum], area[city centre], familyFriendly[no], near[Burger King]",Alimentum is an adult establish found in the city centre area near Burger King.
    "name[Alimentum], area[city centre], familyFriendly[yes]",Alimentum is a family-friendly place in the city centre.
    ...

    It needs some pre-processing first, so the labels are of type Array[String]. This can be done like so:

    import spark.implicits._
    import com.johnsnowlabs.nlp.annotators.classifier.dl.MultiClassifierDLApproach
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder
    import org.apache.spark.ml.Pipeline
    import org.apache.spark.sql.functions.{col, udf}
    
    // Process training data to create text with associated array of labels
    def splitAndTrim = udf { labels: String =>
      labels.split(", ").map(x=>x.trim)
    }
    
    val smallCorpus = spark.read
      .option("header", true)
      .option("inferSchema", true)
      .option("mode", "DROPMALFORMED")
      .csv("src/test/resources/classifier/e2e.csv")
      .withColumn("labels", splitAndTrim(col("mr")))
      .withColumn("text", col("ref"))
      .drop("mr")
    
    smallCorpus.printSchema()
    // root
    // |-- ref: string (nullable = true)
    // |-- labels: array (nullable = true)
    // |    |-- element: string (containsNull = true)
    
    // Then create pipeline for training
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
      .setCleanupMode("shrink")
    
    val embeddings = UniversalSentenceEncoder.pretrained()
      .setInputCols("document")
      .setOutputCol("embeddings")
    
    val docClassifier = new MultiClassifierDLApproach()
      .setInputCols("embeddings")
      .setOutputCol("category")
      .setLabelColumn("labels")
      .setBatchSize(128)
      .setMaxEpochs(10)
      .setLr(1e-3f)
      .setThreshold(0.5f)
      .setValidationSplit(0.1f)
    
    val pipeline = new Pipeline()
      .setStages(
        Array(
          documentAssembler,
          embeddings,
          docClassifier
        )
      )
    
    val pipelineModel = pipeline.fit(smallCorpus)
    See also

    Multi-label classification on Wikipedia

    ClassifierDLApproach for single-class classification

    SentimentDLApproach for sentiment analysis

  23. class MultiClassifierDLModel extends AnnotatorModel[MultiClassifierDLModel] with HasSimpleAnnotate[MultiClassifierDLModel] with WriteTensorflowModel with HasStorageRef with ParamsAndFeaturesWritable with HasEngine

    MultiClassifierDL for Multi-label Text Classification.

    MultiClassifierDL for Multi-label Text Classification.

    MultiClassifierDL Bidirectional GRU with Convolution model we have built inside TensorFlow and supports up to 100 classes. The input to MultiClassifierDL is Sentence Embeddings such as state-of-the-art UniversalSentenceEncoder, BertSentenceEmbeddings, or SentenceEmbeddings.

    This is the instantiated model of the MultiClassifierDLApproach. For training your own model, please see the documentation of that class.

    Pretrained models can be loaded with pretrained of the companion object:

    val multiClassifier = MultiClassifierDLModel.pretrained()
      .setInputCols("sentence_embeddings")
      .setOutputCol("categories")

    The default model is "multiclassifierdl_use_toxic", if no name is provided. It uses embeddings from the UniversalSentenceEncoder and classifies toxic comments. The data is based on the Jigsaw Toxic Comment Classification Challenge. For available pretrained models please see the Models Hub.

    In machine learning, multi-label classification and the strongly related problem of multi-output classification are variants of the classification problem where multiple labels may be assigned to each instance. Multi-label classification is a generalization of multiclass classification, which is the single-label problem of categorizing instances into precisely one of more than two classes; in the multi-label problem there is no constraint on how many of the classes the instance can be assigned to. Formally, multi-label classification is the problem of finding a model that maps inputs x to binary vectors y (assigning a value of 0 or 1 for each element (label) in y).

    For extended examples of usage, see the Spark NLP Workshop and the MultiClassifierDLTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotators.classifier.dl.MultiClassifierDLModel
    import com.johnsnowlabs.nlp.embeddings.UniversalSentenceEncoder
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val useEmbeddings = UniversalSentenceEncoder.pretrained()
      .setInputCols("document")
      .setOutputCol("sentence_embeddings")
    
    val multiClassifierDl = MultiClassifierDLModel.pretrained()
      .setInputCols("sentence_embeddings")
      .setOutputCol("classifications")
    
    val pipeline = new Pipeline()
      .setStages(Array(
        documentAssembler,
        useEmbeddings,
        multiClassifierDl
      ))
    
    val data = Seq(
      "This is pretty good stuff!",
      "Wtf kind of crap is this"
    ).toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("text", "classifications.result").show(false)
    +--------------------------+----------------+
    |text                      |result          |
    +--------------------------+----------------+
    |This is pretty good stuff!|[]              |
    |Wtf kind of crap is this  |[toxic, obscene]|
    +--------------------------+----------------+
    See also

    Multi-label classification on Wikipedia

    ClassifierDLModel for single-class classification

    SentimentDLModel for sentiment analysis

  24. trait ReadAlbertForQATensorflowModel extends ReadTensorflowModel with ReadSentencePieceModel
  25. trait ReadAlbertForSequenceTensorflowModel extends ReadTensorflowModel with ReadSentencePieceModel
  26. trait ReadAlbertForTokenTensorflowModel extends ReadTensorflowModel with ReadSentencePieceModel
  27. trait ReadBertForQATensorflowModel extends ReadTensorflowModel
  28. trait ReadBertForSequenceTensorflowModel extends ReadTensorflowModel
  29. trait ReadBertForTokenTensorflowModel extends ReadTensorflowModel
  30. trait ReadCamemBertForSequenceTensorflowModel extends ReadTensorflowModel with ReadSentencePieceModel
  31. trait ReadCamemBertForTokenTensorflowModel extends ReadTensorflowModel with ReadSentencePieceModel
  32. trait ReadClassifierDLTensorflowModel extends ReadTensorflowModel
  33. trait ReadDeBertaForQATensorflowModel extends ReadTensorflowModel with ReadSentencePieceModel
  34. trait ReadDeBertaForSequenceTensorflowModel extends ReadTensorflowModel with ReadSentencePieceModel
  35. trait ReadDeBertaForTokenTensorflowModel extends ReadTensorflowModel with ReadSentencePieceModel
  36. trait ReadDistilBertForQATensorflowModel extends ReadTensorflowModel
  37. trait ReadDistilBertForSequenceTensorflowModel extends ReadTensorflowModel
  38. trait ReadDistilBertForTokenTensorflowModel extends ReadTensorflowModel
  39. trait ReadLongformerForQATensorflowModel extends ReadTensorflowModel
  40. trait ReadLongformerForSequenceTensorflowModel extends ReadTensorflowModel
  41. trait ReadLongformerForTokenTensorflowModel extends ReadTensorflowModel
  42. trait ReadMultiClassifierDLTensorflowModel extends ReadTensorflowModel
  43. trait ReadRoBertaForQATensorflowModel extends ReadTensorflowModel
  44. trait ReadRoBertaForSequenceTensorflowModel extends ReadTensorflowModel
  45. trait ReadRoBertaForTokenTensorflowModel extends ReadTensorflowModel
  46. trait ReadSentimentDLTensorflowModel extends ReadTensorflowModel
  47. trait ReadTapasForQATensorflowModel extends ReadTensorflowModel
  48. trait ReadXlmRoBertaForQATensorflowModel extends ReadTensorflowModel with ReadSentencePieceModel
  49. trait ReadXlmRoBertaForSequenceTensorflowModel extends ReadTensorflowModel with ReadSentencePieceModel
  50. trait ReadXlmRoBertaForTokenTensorflowModel extends ReadTensorflowModel with ReadSentencePieceModel
  51. trait ReadXlnetForSequenceTensorflowModel extends ReadTensorflowModel with ReadSentencePieceModel
  52. trait ReadXlnetForTokenTensorflowModel extends ReadTensorflowModel with ReadSentencePieceModel
  53. trait ReadablePretrainedAlbertForQAModel extends ParamsAndFeaturesReadable[AlbertForQuestionAnswering] with HasPretrained[AlbertForQuestionAnswering]
  54. trait ReadablePretrainedAlbertForSequenceModel extends ParamsAndFeaturesReadable[AlbertForSequenceClassification] with HasPretrained[AlbertForSequenceClassification]
  55. trait ReadablePretrainedAlbertForTokenModel extends ParamsAndFeaturesReadable[AlbertForTokenClassification] with HasPretrained[AlbertForTokenClassification]
  56. trait ReadablePretrainedBertForQAModel extends ParamsAndFeaturesReadable[BertForQuestionAnswering] with HasPretrained[BertForQuestionAnswering]
  57. trait ReadablePretrainedBertForSequenceModel extends ParamsAndFeaturesReadable[BertForSequenceClassification] with HasPretrained[BertForSequenceClassification]
  58. trait ReadablePretrainedBertForTokenModel extends ParamsAndFeaturesReadable[BertForTokenClassification] with HasPretrained[BertForTokenClassification]
  59. trait ReadablePretrainedCamemBertForSequenceModel extends ParamsAndFeaturesReadable[CamemBertForSequenceClassification] with HasPretrained[CamemBertForSequenceClassification]
  60. trait ReadablePretrainedCamemBertForTokenModel extends ParamsAndFeaturesReadable[CamemBertForTokenClassification] with HasPretrained[CamemBertForTokenClassification]
  61. trait ReadablePretrainedClassifierDL extends ParamsAndFeaturesReadable[ClassifierDLModel] with HasPretrained[ClassifierDLModel]
  62. trait ReadablePretrainedDeBertaForQAModel extends ParamsAndFeaturesReadable[DeBertaForQuestionAnswering] with HasPretrained[DeBertaForQuestionAnswering]
  63. trait ReadablePretrainedDeBertaForSequenceModel extends ParamsAndFeaturesReadable[DeBertaForSequenceClassification] with HasPretrained[DeBertaForSequenceClassification]
  64. trait ReadablePretrainedDeBertaForTokenModel extends ParamsAndFeaturesReadable[DeBertaForTokenClassification] with HasPretrained[DeBertaForTokenClassification]
  65. trait ReadablePretrainedDistilBertForQAModel extends ParamsAndFeaturesReadable[DistilBertForQuestionAnswering] with HasPretrained[DistilBertForQuestionAnswering]
  66. trait ReadablePretrainedDistilBertForSequenceModel extends ParamsAndFeaturesReadable[DistilBertForSequenceClassification] with HasPretrained[DistilBertForSequenceClassification]
  67. trait ReadablePretrainedDistilBertForTokenModel extends ParamsAndFeaturesReadable[DistilBertForTokenClassification] with HasPretrained[DistilBertForTokenClassification]
  68. trait ReadablePretrainedLongformerForQAModel extends ParamsAndFeaturesReadable[LongformerForQuestionAnswering] with HasPretrained[LongformerForQuestionAnswering]
  69. trait ReadablePretrainedLongformerForSequenceModel extends ParamsAndFeaturesReadable[LongformerForSequenceClassification] with HasPretrained[LongformerForSequenceClassification]
  70. trait ReadablePretrainedLongformerForTokenModel extends ParamsAndFeaturesReadable[LongformerForTokenClassification] with HasPretrained[LongformerForTokenClassification]
  71. trait ReadablePretrainedMultiClassifierDL extends ParamsAndFeaturesReadable[MultiClassifierDLModel] with HasPretrained[MultiClassifierDLModel]
  72. trait ReadablePretrainedRoBertaForQAModel extends ParamsAndFeaturesReadable[RoBertaForQuestionAnswering] with HasPretrained[RoBertaForQuestionAnswering]
  73. trait ReadablePretrainedRoBertaForSequenceModel extends ParamsAndFeaturesReadable[RoBertaForSequenceClassification] with HasPretrained[RoBertaForSequenceClassification]
  74. trait ReadablePretrainedRoBertaForTokenModel extends ParamsAndFeaturesReadable[RoBertaForTokenClassification] with HasPretrained[RoBertaForTokenClassification]
  75. trait ReadablePretrainedSentimentDL extends ParamsAndFeaturesReadable[SentimentDLModel] with HasPretrained[SentimentDLModel]
  76. trait ReadablePretrainedTapasForQAModel extends ParamsAndFeaturesReadable[TapasForQuestionAnswering] with HasPretrained[TapasForQuestionAnswering]
  77. trait ReadablePretrainedXlmRoBertaForQAModel extends ParamsAndFeaturesReadable[XlmRoBertaForQuestionAnswering] with HasPretrained[XlmRoBertaForQuestionAnswering]
  78. trait ReadablePretrainedXlmRoBertaForSequenceModel extends ParamsAndFeaturesReadable[XlmRoBertaForSequenceClassification] with HasPretrained[XlmRoBertaForSequenceClassification]
  79. trait ReadablePretrainedXlmRoBertaForTokenModel extends ParamsAndFeaturesReadable[XlmRoBertaForTokenClassification] with HasPretrained[XlmRoBertaForTokenClassification]
  80. trait ReadablePretrainedXlnetForSequenceModel extends ParamsAndFeaturesReadable[XlnetForSequenceClassification] with HasPretrained[XlnetForSequenceClassification]
  81. trait ReadablePretrainedXlnetForTokenModel extends ParamsAndFeaturesReadable[XlnetForTokenClassification] with HasPretrained[XlnetForTokenClassification]
  82. class RoBertaForQuestionAnswering extends AnnotatorModel[RoBertaForQuestionAnswering] with HasBatchedAnnotate[RoBertaForQuestionAnswering] with WriteTensorflowModel with HasCaseSensitiveProperties with HasEngine

    RoBertaForQuestionAnswering can load RoBERTa Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits).

    RoBertaForQuestionAnswering can load RoBERTa Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits).

    Pretrained models can be loaded with pretrained of the companion object:

    val spanClassifier = RoBertaForQuestionAnswering.pretrained()
      .setInputCols(Array("document_question", "document_context"))
      .setOutputCol("answer")

    The default model is "roberta_base_qa_squad2", if no name is provided.

    For available pretrained models please see the Models Hub.

    To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see RoBertaForQuestionAnsweringTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val document = new MultiDocumentAssembler()
      .setInputCols("question", "context")
      .setOutputCols("document_question", "document_context")
    
    val questionAnswering = RoBertaForQuestionAnswering.pretrained()
      .setInputCols(Array("document_question", "document_context"))
      .setOutputCol("answer")
      .setCaseSensitive(true)
    
    val pipeline = new Pipeline().setStages(Array(
      document,
      questionAnswering
    ))
    
    val data = Seq("What's my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
    val result = pipeline.fit(data).transform(data)
    
    result.select("label.result").show(false)
    +---------------------+
    |result               |
    +---------------------+
    |[Clara]              |
    ++--------------------+
    See also

    RoBertaForSequenceClassification for sequence-level classification

    Annotators Main Page for a list of transformer based classifiers

  83. class RoBertaForSequenceClassification extends AnnotatorModel[RoBertaForSequenceClassification] with HasBatchedAnnotate[RoBertaForSequenceClassification] with WriteTensorflowModel with HasCaseSensitiveProperties with HasClassifierActivationProperties with HasEngine

    RoBertaForSequenceClassification can load RoBERTa Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g.

    RoBertaForSequenceClassification can load RoBERTa Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.

    Pretrained models can be loaded with pretrained of the companion object:

    val sequenceClassifier = RoBertaForSequenceClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")

    The default model is "roberta_base_sequence_classifier_imdb", if no name is provided.

    For available pretrained models please see the Models Hub.

    To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see RoBertaForSequenceClassification.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val sequenceClassifier = RoBertaForSequenceClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")
      .setCaseSensitive(true)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      sequenceClassifier
    ))
    
    val data = Seq("I loved this movie when I was a child.", "It was pretty boring.").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("label.result").show(false)
    +------+
    |result|
    +------+
    |[pos] |
    |[neg] |
    +------+
    See also

    RoBertaForSequenceClassification for sequence-level classification

    Annotators Main Page for a list of transformer based classifiers

  84. class RoBertaForTokenClassification extends AnnotatorModel[RoBertaForTokenClassification] with HasBatchedAnnotate[RoBertaForTokenClassification] with WriteTensorflowModel with HasCaseSensitiveProperties with HasEngine

    RoBertaForTokenClassification can load RoBERTa Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g.

    RoBertaForTokenClassification can load RoBERTa Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

    Pretrained models can be loaded with pretrained of the companion object:

    val tokenClassifier = RoBertaForTokenClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")

    The default model is "roberta_base_token_classifier_conll03", if no name is provided.

    For available pretrained models please see the Models Hub.

    and the RoBertaForTokenClassificationTestSpec. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val tokenClassifier = RoBertaForTokenClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")
      .setCaseSensitive(true)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      tokenClassifier
    ))
    
    val data = Seq("John Lenon was born in London and lived in Paris. My name is Sarah and I live in London").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("label.result").show(false)
    +------------------------------------------------------------------------------------+
    |result                                                                              |
    +------------------------------------------------------------------------------------+
    |[B-PER, I-PER, O, O, O, B-LOC, O, O, O, B-LOC, O, O, O, O, B-PER, O, O, O, O, B-LOC]|
    +------------------------------------------------------------------------------------+
    See also

    RoBertaForTokenClassification for token-level classification

    Annotators Main Page for a list of transformer based classifiers

  85. class SentimentDLApproach extends AnnotatorApproach[SentimentDLModel] with ParamsAndFeaturesWritable with ClassifierEncoder

    Trains a SentimentDL, an annotator for multi-class sentiment analysis.

    Trains a SentimentDL, an annotator for multi-class sentiment analysis.

    In natural language processing, sentiment analysis is the task of classifying the affective state or subjective view of a text. A common example is if either a product review or tweet can be interpreted positively or negatively.

    For the instantiated/pretrained models, see SentimentDLModel.

    Notes:

    Setting a test dataset to monitor model metrics can be done with .setTestDataset. The method expects a path to a parquet file containing a dataframe that has the same required columns as the training dataframe. The pre-processing steps for the training dataframe should also be applied to the test dataframe. The following example will show how to create the test dataset:

    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val embeddings = UniversalSentenceEncoder.pretrained()
      .setInputCols("document")
      .setOutputCol("sentence_embeddings")
    
    val preProcessingPipeline = new Pipeline().setStages(Array(documentAssembler, embeddings))
    
    val Array(train, test) = data.randomSplit(Array(0.8, 0.2))
    preProcessingPipeline
      .fit(test)
      .transform(test)
      .write
      .mode("overwrite")
      .parquet("test_data")
    
    val classifier = new SentimentDLApproach()
      .setInputCols("sentence_embeddings")
      .setOutputCol("sentiment")
      .setLabelColumn("label")
      .setTestDataset("test_data")

    For extended examples of usage, see the Spark NLP Workshop and the SentimentDLTestSpec.

    Example

    In this example, sentiment.csv is in the form

    text,label
    This movie is the best movie I have watched ever! In my opinion this movie can win an award.,0
    This was a terrible movie! The acting was bad really bad!,1

    The model can then be trained with

    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.UniversalSentenceEncoder
    import com.johnsnowlabs.nlp.annotators.classifier.dl.{SentimentDLApproach, SentimentDLModel}
    import org.apache.spark.ml.Pipeline
    
    val smallCorpus = spark.read.option("header", "true").csv("src/test/resources/classifier/sentiment.csv")
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val useEmbeddings = UniversalSentenceEncoder.pretrained()
      .setInputCols("document")
      .setOutputCol("sentence_embeddings")
    
    val docClassifier = new SentimentDLApproach()
      .setInputCols("sentence_embeddings")
      .setOutputCol("sentiment")
      .setLabelColumn("label")
      .setBatchSize(32)
      .setMaxEpochs(1)
      .setLr(5e-3f)
      .setDropout(0.5f)
    
    val pipeline = new Pipeline()
      .setStages(
        Array(
          documentAssembler,
          useEmbeddings,
          docClassifier
        )
      )
    
    val pipelineModel = pipeline.fit(smallCorpus)
    See also

    ClassifierDLApproach for general single-class classification

    MultiClassifierDLApproach for general multi-class classification

  86. class SentimentDLModel extends AnnotatorModel[SentimentDLModel] with HasSimpleAnnotate[SentimentDLModel] with WriteTensorflowModel with HasStorageRef with ParamsAndFeaturesWritable with HasEngine

    SentimentDL, an annotator for multi-class sentiment analysis.

    SentimentDL, an annotator for multi-class sentiment analysis.

    In natural language processing, sentiment analysis is the task of classifying the affective state or subjective view of a text. A common example is if either a product review or tweet can be interpreted positively or negatively.

    This is the instantiated model of the SentimentDLApproach. For training your own model, please see the documentation of that class.

    Pretrained models can be loaded with pretrained of the companion object:

    val sentiment = SentimentDLModel.pretrained()
      .setInputCols("sentence_embeddings")
      .setOutputCol("sentiment")

    The default model is "sentimentdl_use_imdb", if no name is provided. It is english sentiment analysis trained on the IMDB dataset. For available pretrained models please see the Models Hub.

    For extended examples of usage, see the Spark NLP Workshop and the SentimentDLTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base.DocumentAssembler
    import com.johnsnowlabs.nlp.annotator.UniversalSentenceEncoder
    import com.johnsnowlabs.nlp.annotators.classifier.dl.SentimentDLModel
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val useEmbeddings = UniversalSentenceEncoder.pretrained()
      .setInputCols("document")
      .setOutputCol("sentence_embeddings")
    
    val sentiment = SentimentDLModel.pretrained("sentimentdl_use_twitter")
      .setInputCols("sentence_embeddings")
      .setThreshold(0.7F)
      .setOutputCol("sentiment")
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      useEmbeddings,
      sentiment
    ))
    
    val data = Seq(
      "Wow, the new video is awesome!",
      "bruh what a damn waste of time"
    ).toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("text", "sentiment.result").show(false)
    +------------------------------+----------+
    |text                          |result    |
    +------------------------------+----------+
    |Wow, the new video is awesome!|[positive]|
    |bruh what a damn waste of time|[negative]|
    +------------------------------+----------+
    See also

    ClassifierDLModel for general single-class classification

    MultiClassifierDLModel for general multi-class classification

  87. class TapasForQuestionAnswering extends BertForQuestionAnswering

    TapasForQuestionAnswering is an implementation of TaPas - a BERT-based model specifically designed for answering questions about tabular data.

    TapasForQuestionAnswering is an implementation of TaPas - a BERT-based model specifically designed for answering questions about tabular data. It takes TABLE and DOCUMENT annotations as input and tries to answer the questions in the document by using the data from the table. The model is based in BertForQuestionAnswering and shares all its parameters with it.

    Pretrained models can be loaded with pretrained of the companion object:

    val tapas = TapasForQuestionAnswering.pretrained()
      .setInputCols(Array("document_question", "table"))
      .setOutputCol("answer")

    The default model is "table_qa_tapas_base_finetuned_wtq", if no name is provided.

    For available pretrained models please see the Models Hub.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
     val questions =
       """
        |Who earns 100,000,000?
        |Who has more money?
        |How old are they?
        |""".stripMargin.trim
    
     val jsonData =
       """
        |{
        | "header": ["name", "money", "age"],
        | "rows": [
        |   ["Donald Trump", "$100,000,000", "75"],
        |   ["Elon Musk", "$20,000,000,000,000", "55"]
        | ]
        |}
        |""".stripMargin.trim
    
     val data = Seq((jsonData, questions))
      .toDF("json_table", "questions")
      .repartition(1)
    
    val docAssembler = new MultiDocumentAssembler()
      .setInputCols("json_table", "questions")
      .setOutputCols("document_table", "document_questions")
    
    val sentenceDetector = SentenceDetectorDLModel
      .pretrained()
      .setInputCols(Array("document_questions"))
      .setOutputCol("question")
    
    val tableAssembler = new TableAssembler()
      .setInputFormat("json")
      .setInputCols(Array("document_table"))
      .setOutputCol("table")
    
    val tapas = TapasForQuestionAnswering
      .pretrained()
      .setInputCols(Array("question", "table"))
      .setOutputCol("answer")
    
    val pipeline = new Pipeline()
      .setStages(
        Array(
          docAssembler,
          sentenceDetector,
          tableAssembler,
           tapas))
    
    val pipelineModel = pipeline.fit(data)
    val result = pipeline.fit(data).transform(data)
    
    result
      .selectExpr("explode(answer) as answer")
      .selectExpr(
        "answer.metadata.question",
        "answer.result")
    
    +-----------------------+----------------------------------------+
    |question               |result                                  |
    +-----------------------+----------------------------------------+
    |Who earns 100,000,000? |Donald Trump                            |
    |Who has more money?    |Elon Musk                               |
    |How much they all earn?|COUNT($100,000,000, $20,000,000,000,000)|
    |How old are they?      |AVERAGE(75, 55)                         |
    +-----------------------+----------------------------------------+
    See also

    https://aclanthology.org/2020.acl-main.398/ for more details about the TaPas model

    TableAssembler for loading tabular data

    Annotators Main Page for a list of transformer based classifiers

  88. class XlmRoBertaForQuestionAnswering extends AnnotatorModel[XlmRoBertaForQuestionAnswering] with HasBatchedAnnotate[XlmRoBertaForQuestionAnswering] with WriteTensorflowModel with WriteSentencePieceModel with HasCaseSensitiveProperties with HasEngine

    XlmRoBertaForQuestionAnswering can load XLM-RoBERTa Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits).

    XlmRoBertaForQuestionAnswering can load XLM-RoBERTa Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits).

    Pretrained models can be loaded with pretrained of the companion object:

    val spanClassifier = XlmRoBertaForQuestionAnswering.pretrained()
      .setInputCols(Array("document_question", "document_context"))
      .setOutputCol("answer")

    The default model is "xlm_roberta_base_qa_squad2", if no name is provided.

    For available pretrained models please see the Models Hub.

    To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see XlmRoBertaForQuestionAnsweringTestSpec.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val document = new MultiDocumentAssembler()
      .setInputCols("question", "context")
      .setOutputCols("document_question", "document_context")
    
    val questionAnswering = XlmRoBertaForQuestionAnswering.pretrained()
      .setInputCols(Array("document_question", "document_context"))
      .setOutputCol("answer")
      .setCaseSensitive(true)
    
    val pipeline = new Pipeline().setStages(Array(
      document,
      questionAnswering
    ))
    
    val data = Seq("What's my name?", "My name is Clara and I live in Berkeley.").toDF("question", "context")
    val result = pipeline.fit(data).transform(data)
    
    result.select("label.result").show(false)
    +---------------------+
    |result               |
    +---------------------+
    |[Clara]              |
    ++--------------------+
    See also

    XlmRoBertaForSequenceClassification for sequence-level classification

    Annotators Main Page for a list of transformer based classifiers

  89. class XlmRoBertaForSequenceClassification extends AnnotatorModel[XlmRoBertaForSequenceClassification] with HasBatchedAnnotate[XlmRoBertaForSequenceClassification] with WriteTensorflowModel with WriteSentencePieceModel with HasCaseSensitiveProperties with HasClassifierActivationProperties with HasEngine

    XlmRoBertaForSequenceClassification can load XLM-RoBERTa Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g.

    XlmRoBertaForSequenceClassification can load XLM-RoBERTa Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.

    Pretrained models can be loaded with pretrained of the companion object:

    val sequenceClassifier = XlmRoBertaForSequenceClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")

    The default model is "xlm_roberta_base_sequence_classifier_imdb", if no name is provided.

    For available pretrained models please see the Models Hub.

    To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see XlmRoBertaForSequenceClassification.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val sequenceClassifier = XlmRoBertaForSequenceClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")
      .setCaseSensitive(true)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      sequenceClassifier
    ))
    
    val data = Seq("I loved this movie when I was a child.", "It was pretty boring.").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("label.result").show(false)
    +------+
    |result|
    +------+
    |[pos] |
    |[neg] |
    +------+
    See also

    XlmRoBertaForSequenceClassification for sequence-level classification

    Annotators Main Page for a list of transformer based classifiers

  90. class XlmRoBertaForTokenClassification extends AnnotatorModel[XlmRoBertaForTokenClassification] with HasBatchedAnnotate[XlmRoBertaForTokenClassification] with WriteTensorflowModel with WriteSentencePieceModel with HasCaseSensitiveProperties with HasEngine

    XlmRoBertaForTokenClassification can load XLM-RoBERTa Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g.

    XlmRoBertaForTokenClassification can load XLM-RoBERTa Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

    Pretrained models can be loaded with pretrained of the companion object:

    val tokenClassifier = XlmRoBertaForTokenClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")

    The default model is "xlm_roberta_base_token_classifier_conll03", if no name is provided.

    For available pretrained models please see the Models Hub.

    and the XlmRoBertaForTokenClassificationTestSpec. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val tokenClassifier = XlmRoBertaForTokenClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")
      .setCaseSensitive(true)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      tokenClassifier
    ))
    
    val data = Seq("John Lenon was born in London and lived in Paris. My name is Sarah and I live in London").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("label.result").show(false)
    +------------------------------------------------------------------------------------+
    |result                                                                              |
    +------------------------------------------------------------------------------------+
    |[B-PER, I-PER, O, O, O, B-LOC, O, O, O, B-LOC, O, O, O, O, B-PER, O, O, O, O, B-LOC]|
    +------------------------------------------------------------------------------------+
    See also

    XlmRoBertaForTokenClassification for token-level classification

    Annotators Main Page for a list of transformer based classifiers

  91. class XlnetForSequenceClassification extends AnnotatorModel[XlnetForSequenceClassification] with HasBatchedAnnotate[XlnetForSequenceClassification] with WriteTensorflowModel with WriteSentencePieceModel with HasCaseSensitiveProperties with HasClassifierActivationProperties with HasEngine

    XlnetForSequenceClassification can load XLNet Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g.

    XlnetForSequenceClassification can load XLNet Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks.

    Pretrained models can be loaded with pretrained of the companion object:

    val sequenceClassifier = XlnetForSequenceClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")

    The default model is "xlnet_base_sequence_classifier_imdb", if no name is provided.

    For available pretrained models please see the Models Hub.

    To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669 and to see more extended examples, see XlnetForSequenceClassification.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val sequenceClassifier = XlnetForSequenceClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")
      .setCaseSensitive(true)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      sequenceClassifier
    ))
    
    val data = Seq("I loved this movie when I was a child.", "It was pretty boring.").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("label.result").show(false)
    +------+
    |result|
    +------+
    |[pos] |
    |[neg] |
    +------+
    See also

    XlnetForSequenceClassification for sequence-level classification

    Annotators Main Page for a list of transformer based classifiers

  92. class XlnetForTokenClassification extends AnnotatorModel[XlnetForTokenClassification] with HasBatchedAnnotate[XlnetForTokenClassification] with WriteTensorflowModel with WriteSentencePieceModel with HasCaseSensitiveProperties with HasEngine

    XlnetForTokenClassification can load XLNet Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g.

    XlnetForTokenClassification can load XLNet Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks.

    Pretrained models can be loaded with pretrained of the companion object:

    val tokenClassifier = XlnetForTokenClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")

    The default model is "xlnet_base_token_classifier_conll03", if no name is provided.

    For available pretrained models please see the Models Hub.

    and the XlnetForTokenClassificationTestSpec. To see which models are compatible and how to import them see https://github.com/JohnSnowLabs/spark-nlp/discussions/5669.

    Example

    import spark.implicits._
    import com.johnsnowlabs.nlp.base._
    import com.johnsnowlabs.nlp.annotator._
    import org.apache.spark.ml.Pipeline
    
    val documentAssembler = new DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    
    val tokenizer = new Tokenizer()
      .setInputCols("document")
      .setOutputCol("token")
    
    val tokenClassifier = XlnetForTokenClassification.pretrained()
      .setInputCols("token", "document")
      .setOutputCol("label")
      .setCaseSensitive(true)
    
    val pipeline = new Pipeline().setStages(Array(
      documentAssembler,
      tokenizer,
      tokenClassifier
    ))
    
    val data = Seq("John Lenon was born in London and lived in Paris. My name is Sarah and I live in London").toDF("text")
    val result = pipeline.fit(data).transform(data)
    
    result.select("label.result").show(false)
    +------------------------------------------------------------------------------------+
    |result                                                                              |
    +------------------------------------------------------------------------------------+
    |[B-PER, I-PER, O, O, O, B-LOC, O, O, O, B-LOC, O, O, O, O, B-PER, O, O, O, O, B-LOC]|
    +------------------------------------------------------------------------------------+
    See also

    XlnetForTokenClassification for token-level classification

    Annotators Main Page for a list of transformer based classifiers

Value Members

  1. object AlbertForQuestionAnswering extends ReadablePretrainedAlbertForQAModel with ReadAlbertForQATensorflowModel with Serializable

    This is the companion object of AlbertForQuestionAnswering.

    This is the companion object of AlbertForQuestionAnswering. Please refer to that class for the documentation.

  2. object AlbertForSequenceClassification extends ReadablePretrainedAlbertForSequenceModel with ReadAlbertForSequenceTensorflowModel with Serializable

    This is the companion object of AlbertForSequenceClassification.

    This is the companion object of AlbertForSequenceClassification. Please refer to that class for the documentation.

  3. object AlbertForTokenClassification extends ReadablePretrainedAlbertForTokenModel with ReadAlbertForTokenTensorflowModel with Serializable

    This is the companion object of AlbertForTokenClassification.

    This is the companion object of AlbertForTokenClassification. Please refer to that class for the documentation.

  4. object BertForQuestionAnswering extends ReadablePretrainedBertForQAModel with ReadBertForQATensorflowModel with Serializable

    This is the companion object of BertForQuestionAnswering.

    This is the companion object of BertForQuestionAnswering. Please refer to that class for the documentation.

  5. object BertForSequenceClassification extends ReadablePretrainedBertForSequenceModel with ReadBertForSequenceTensorflowModel with Serializable

    This is the companion object of BertForSequenceClassification.

    This is the companion object of BertForSequenceClassification. Please refer to that class for the documentation.

  6. object BertForTokenClassification extends ReadablePretrainedBertForTokenModel with ReadBertForTokenTensorflowModel with Serializable

    This is the companion object of BertForTokenClassification.

    This is the companion object of BertForTokenClassification. Please refer to that class for the documentation.

  7. object CamemBertForSequenceClassification extends ReadablePretrainedCamemBertForSequenceModel with ReadCamemBertForSequenceTensorflowModel with Serializable

    This is the companion object of CamemBertForSequenceClassification.

    This is the companion object of CamemBertForSequenceClassification. Please refer to that class for the documentation.

  8. object CamemBertForTokenClassification extends ReadablePretrainedCamemBertForTokenModel with ReadCamemBertForTokenTensorflowModel with Serializable

    This is the companion object of CamemBertForTokenClassification.

    This is the companion object of CamemBertForTokenClassification. Please refer to that class for the documentation.

  9. object ClassifierDLApproach extends DefaultParamsReadable[ClassifierDLApproach] with Serializable

    This is the companion object of ClassifierDLApproach.

    This is the companion object of ClassifierDLApproach. Please refer to that class for the documentation.

  10. object ClassifierDLModel extends ReadablePretrainedClassifierDL with ReadClassifierDLTensorflowModel with Serializable

    This is the companion object of ClassifierDLModel.

    This is the companion object of ClassifierDLModel. Please refer to that class for the documentation.

  11. object DeBertaForQuestionAnswering extends ReadablePretrainedDeBertaForQAModel with ReadDeBertaForQATensorflowModel with Serializable

    This is the companion object of DeBertaForQuestionAnswering.

    This is the companion object of DeBertaForQuestionAnswering. Please refer to that class for the documentation.

  12. object DeBertaForSequenceClassification extends ReadablePretrainedDeBertaForSequenceModel with ReadDeBertaForSequenceTensorflowModel with Serializable

    This is the companion object of DeBertaForSequenceClassification.

    This is the companion object of DeBertaForSequenceClassification. Please refer to that class for the documentation.

  13. object DeBertaForTokenClassification extends ReadablePretrainedDeBertaForTokenModel with ReadDeBertaForTokenTensorflowModel with Serializable

    This is the companion object of DeBertaForTokenClassification.

    This is the companion object of DeBertaForTokenClassification. Please refer to that class for the documentation.

  14. object DistilBertForQuestionAnswering extends ReadablePretrainedDistilBertForQAModel with ReadDistilBertForQATensorflowModel with Serializable

    This is the companion object of DistilBertForQuestionAnswering.

    This is the companion object of DistilBertForQuestionAnswering. Please refer to that class for the documentation.

  15. object DistilBertForSequenceClassification extends ReadablePretrainedDistilBertForSequenceModel with ReadDistilBertForSequenceTensorflowModel with Serializable

    This is the companion object of DistilBertForSequenceClassification.

    This is the companion object of DistilBertForSequenceClassification. Please refer to that class for the documentation.

  16. object DistilBertForTokenClassification extends ReadablePretrainedDistilBertForTokenModel with ReadDistilBertForTokenTensorflowModel with Serializable

    This is the companion object of DistilBertForTokenClassification.

    This is the companion object of DistilBertForTokenClassification. Please refer to that class for the documentation.

  17. object LongformerForQuestionAnswering extends ReadablePretrainedLongformerForQAModel with ReadLongformerForQATensorflowModel with Serializable

    This is the companion object of LongformerForQuestionAnswering.

    This is the companion object of LongformerForQuestionAnswering. Please refer to that class for the documentation.

  18. object LongformerForSequenceClassification extends ReadablePretrainedLongformerForSequenceModel with ReadLongformerForSequenceTensorflowModel with Serializable

    This is the companion object of LongformerForSequenceClassification.

    This is the companion object of LongformerForSequenceClassification. Please refer to that class for the documentation.

  19. object LongformerForTokenClassification extends ReadablePretrainedLongformerForTokenModel with ReadLongformerForTokenTensorflowModel with Serializable

    This is the companion object of LongformerForTokenClassification.

    This is the companion object of LongformerForTokenClassification. Please refer to that class for the documentation.

  20. object MultiClassifierDLModel extends ReadablePretrainedMultiClassifierDL with ReadMultiClassifierDLTensorflowModel with Serializable

    This is the companion object of MultiClassifierDLModel.

    This is the companion object of MultiClassifierDLModel. Please refer to that class for the documentation.

  21. object RoBertaForQuestionAnswering extends ReadablePretrainedRoBertaForQAModel with ReadRoBertaForQATensorflowModel with Serializable

    This is the companion object of RoBertaForQuestionAnswering.

    This is the companion object of RoBertaForQuestionAnswering. Please refer to that class for the documentation.

  22. object RoBertaForSequenceClassification extends ReadablePretrainedRoBertaForSequenceModel with ReadRoBertaForSequenceTensorflowModel with Serializable

    This is the companion object of RoBertaForSequenceClassification.

    This is the companion object of RoBertaForSequenceClassification. Please refer to that class for the documentation.

  23. object RoBertaForTokenClassification extends ReadablePretrainedRoBertaForTokenModel with ReadRoBertaForTokenTensorflowModel with Serializable

    This is the companion object of RoBertaForTokenClassification.

    This is the companion object of RoBertaForTokenClassification. Please refer to that class for the documentation.

  24. object SentimentApproach extends DefaultParamsReadable[SentimentDLApproach]

    This is the companion object of SentimentApproach.

    This is the companion object of SentimentApproach. Please refer to that class for the documentation.

  25. object SentimentDLModel extends ReadablePretrainedSentimentDL with ReadSentimentDLTensorflowModel with Serializable

    This is the companion object of SentimentDLModel.

    This is the companion object of SentimentDLModel. Please refer to that class for the documentation.

  26. object TapasForQuestionAnswering extends ReadablePretrainedTapasForQAModel with ReadTapasForQATensorflowModel with Serializable

    This is the companion object of TapasForQuestionAnswering.

    This is the companion object of TapasForQuestionAnswering. Please refer to that class for the documentation.

  27. object XlmRoBertaForQuestionAnswering extends ReadablePretrainedXlmRoBertaForQAModel with ReadXlmRoBertaForQATensorflowModel with Serializable

    This is the companion object of XlmRoBertaForQuestionAnswering.

    This is the companion object of XlmRoBertaForQuestionAnswering. Please refer to that class for the documentation.

  28. object XlmRoBertaForSequenceClassification extends ReadablePretrainedXlmRoBertaForSequenceModel with ReadXlmRoBertaForSequenceTensorflowModel with Serializable

    This is the companion object of XlmRoBertaForSequenceClassification.

    This is the companion object of XlmRoBertaForSequenceClassification. Please refer to that class for the documentation.

  29. object XlmRoBertaForTokenClassification extends ReadablePretrainedXlmRoBertaForTokenModel with ReadXlmRoBertaForTokenTensorflowModel with Serializable

    This is the companion object of XlmRoBertaForTokenClassification.

    This is the companion object of XlmRoBertaForTokenClassification. Please refer to that class for the documentation.

  30. object XlnetForSequenceClassification extends ReadablePretrainedXlnetForSequenceModel with ReadXlnetForSequenceTensorflowModel with Serializable

    This is the companion object of XlnetForSequenceClassification.

    This is the companion object of XlnetForSequenceClassification. Please refer to that class for the documentation.

  31. object XlnetForTokenClassification extends ReadablePretrainedXlnetForTokenModel with ReadXlnetForTokenTensorflowModel with Serializable

    This is the companion object of XlnetForTokenClassification.

    This is the companion object of XlnetForTokenClassification. Please refer to that class for the documentation.

Ungrouped