sparknlp.annotator.seq2seq.marian_transformer#

Contains classes for the MarianTransformer.

Module Contents#

Classes#

MarianTransformer

MarianTransformer: Fast Neural Machine Translation

class MarianTransformer(classname='com.johnsnowlabs.nlp.annotators.seq2seq.MarianTransformer', java_model=None)[source]#

MarianTransformer: Fast Neural Machine Translation

Marian is an efficient, free Neural Machine Translation framework written in pure C++ with minimal dependencies. It is mainly being developed by the Microsoft Translator team. Many academic (most notably the University of Edinburgh and in the past the Adam Mickiewicz University in Poznań) and commercial contributors help with its development. MarianTransformer uses the models trained by MarianNMT.

It is currently the engine behind the Microsoft Translator Neural Machine Translation services and being deployed by many companies, organizations and research projects.

Pretrained models can be loaded with pretrained() of the companion object:

>>> marian = MarianTransformer.pretrained() \
...     .setInputCols(["sentence"]) \
...     .setOutputCol("translation")

The default model is "opus_mt_en_fr", default language is "xx" (meaning multi-lingual), if no values are provided.

For available pretrained models please see the Models Hub.

For extended examples of usage, see the Spark NLP Workshop.

Input Annotation types

Output Annotation type

DOCUMENT

DOCUMENT

Parameters:
batchSize

Size of every batch, by default 1

configProtoBytes

ConfigProto from tensorflow, serialized into byte array.

langId

Transformer’s task, e.g. “summarize>”, by default “”

maxInputLength

Controls the maximum length for encoder inputs (source language texts), by default 40

maxOutputLength

Controls the maximum length for decoder outputs (target language texts), by default 40

Notes

This is a very computationally expensive module especially on larger sequence. The use of an accelerator such as GPU is recommended.

References

MarianNMT at GitHub

Marian: Fast Neural Machine Translation in C++

Paper Abstract:

We present Marian, an efficient and self-contained Neural Machine Translation framework with an integrated automatic differentiation engine based on dynamic computation graphs. Marian is written entirely in C++. We describe the design of the encoder-decoder framework and demonstrate that a research-friendly toolkit can achieve high training and translation speed.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> sentence = SentenceDetectorDLModel.pretrained("sentence_detector_dl", "xx") \
...     .setInputCols("document") \
...     .setOutputCol("sentence")
>>> marian = MarianTransformer.pretrained() \
...     .setInputCols("sentence") \
...     .setOutputCol("translation") \
...     .setMaxInputLength(30)
>>> pipeline = Pipeline() \
...     .setStages([
...       documentAssembler,
...       sentence,
...       marian
...     ])
>>> data = spark.createDataFrame([["What is the capital of France? We should know this in french."]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.selectExpr("explode(translation.result) as result").show(truncate=False)
+-------------------------------------+
|result                               |
+-------------------------------------+
|Quelle est la capitale de la France ?|
|On devrait le savoir en français.    |
+-------------------------------------+
name = MarianTransformer[source]#
configProtoBytes[source]#
langId[source]#
maxInputLength[source]#
maxOutputLength[source]#
ignoreTokenIds[source]#
setIgnoreTokenIds(self, value)[source]#

A list of token ids which are ignored in the decoder’s output.

Parameters:
valueList[int]

The words to be filtered out

setConfigProtoBytes(self, b)[source]#

Sets configProto from tensorflow, serialized into byte array.

Parameters:
bList[int]

ConfigProto from tensorflow, serialized into byte array

setLangId(self, value)[source]#

Sets transformer’s task, e.g. “summarize>”, by default “”.

Parameters:
valuestr

Transformer’s task, e.g. “summarize>”

setMaxInputLength(self, value)[source]#

Sets the maximum length for encoder inputs (source language texts), by default 40.

Parameters:
valueint

The maximum length for encoder inputs (source language texts)

setMaxOutputLength(self, value)[source]#

Sets the maximum length for decoder outputs (target language texts), by default 40.

Parameters:
valueint

The maximum length for decoder outputs (target language texts)

static loadSavedModel(folder, spark_session)[source]#

Loads a locally saved model.

Parameters:
folderstr

Folder of the saved model

spark_sessionpyspark.sql.SparkSession

The current SparkSession

Returns:
MarianTransformer

The restored model

static pretrained(name='opus_mt_en_fr', lang='xx', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:
namestr, optional

Name of the pretrained model, by default “opus_mt_en_fr”

langstr, optional

Language of the pretrained model, by default “xx”

remote_locstr, optional

Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:
MarianTransformer

The restored model