sparknlp.annotator.embeddings.doc2vec#

Contains classes for Doc2Vec.

Module Contents#

Classes#

Doc2VecApproach

Trains a Word2Vec model that creates vector representations of words in a

Doc2VecModel

Word2Vec model that creates vector representations of words in a text

class Doc2VecApproach[source]#

Trains a Word2Vec model that creates vector representations of words in a text corpus.

The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms.

We use Word2Vec implemented in Spark ML. It uses skip-gram model in our implementation and a hierarchical softmax method to train the model. The variable names in the implementation match the original C implementation.

For instantiated/pretrained models, see Doc2VecModel.

For available pretrained models please see the Models Hub.

Input Annotation types

Output Annotation type

TOKEN

SENTENCE_EMBEDDINGS

Parameters:
vectorSize

The dimension of codes after transforming from words (> 0), by default 100

windowSize

The window size (context words from [-window, window]) (> 0), by default 5

numPartitions

Number of partitions for sentences of words (> 0), by default 1

minCount

The minimum number of times a token must appear to be included in the word2vec model’s vocabulary (>= 0), by default 1

maxSentenceLength

The window size (Maximum length (in words) of each sentence in the input data. Any sentence longer than this threshold will be divided into chunks up to the size (> 0), by default 1000

stepSize

Step size (learning rate) to be used for each iteration of optimization (> 0), by default 0.025

maxIter

Maximum number of iterations (>= 0), by default 1

seed

Random seed, by default 44

References

For the original C implementation, see https://code.google.com/p/word2vec/

For the research paper, see Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["document"]) \
...     .setOutputCol("token")
>>> embeddings = Doc2VecApproach() \
...     .setInputCols(["token"]) \
...     .setOutputCol("embeddings")
>>> pipeline = Pipeline() \
...     .setStages([
...       documentAssembler,
...       tokenizer,
...       embeddings
...     ])
>>> path = "sherlockholmes.txt"
>>> dataset = spark.read.text(path).toDF("text")
>>> pipelineModel = pipeline.fit(dataset)
vectorSize[source]#
windowSize[source]#
numPartitions[source]#
minCount[source]#
maxSentenceLength[source]#
stepSize[source]#
maxIter[source]#
seed[source]#
setVectorSize(self, vectorSize)[source]#

Sets vector size (default: 100).

setWindowSize(self, windowSize)[source]#

Sets window size (default: 5).

setStepSize(self, stepSize)[source]#

Sets initial learning rate (default: 0.025).

setNumPartitions(self, numPartitions)[source]#

Sets number of partitions (default: 1). Use a small number for accuracy.

setMaxIter(self, numIterations)[source]#

Sets number of iterations (default: 1), which should be smaller than or equal to number of partitions.

setSeed(self, seed)[source]#

Sets random seed.

setMinCount(self, minCount)[source]#

Sets minCount, the minimum number of times a token must appear to be included in the word2vec model’s vocabulary (default: 5).

setMaxSentenceLength(self, maxSentenceLength)[source]#

Maximum length (in words) of each sentence in the input data. Any sentence longer than this threshold will be divided into chunks up to the size (> 0)

class Doc2VecModel(classname='com.johnsnowlabs.nlp.embeddings.Doc2VecModel', java_model=None)[source]#

Word2Vec model that creates vector representations of words in a text corpus.

The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms.

We use Word2Vec implemented in Spark ML. It uses skip-gram model in our implementation and a hierarchical softmax method to train the model. The variable names in the implementation match the original C implementation.

This is the instantiated model of the Doc2VecApproach. For training your own model, please see the documentation of that class.

Pretrained models can be loaded with pretrained() of the companion object:

>>> embeddings = Doc2VecModel.pretrained() \
...     .setInputCols(["token"]) \
...     .setOutputCol("embeddings")

The default model is “doc2vec_gigaword_300”, if no name is provided.

Input Annotation types

Output Annotation type

TOKEN

SENTENCE_EMBEDDINGS

Parameters:
vectorSize

The dimension of codes after transforming from words (> 0) , by default 100

References

For the original C implementation, see https://code.google.com/p/word2vec/

For the research paper, see Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality.

Examples

>>> import sparknlp
>>> from sparknlp.base import *
>>> from sparknlp.annotator import *
>>> from pyspark.ml import Pipeline
>>> documentAssembler = DocumentAssembler() \
...     .setInputCol("text") \
...     .setOutputCol("document")
>>> tokenizer = Tokenizer() \
...     .setInputCols(["document"]) \
...     .setOutputCol("token")
>>> embeddings = Doc2VecModel.pretrained() \
...     .setInputCols(["token"]) \
...     .setOutputCol("embeddings")
>>> embeddingsFinisher = EmbeddingsFinisher() \
...     .setInputCols(["embeddings"]) \
...     .setOutputCols("finished_embeddings") \
...     .setOutputAsVector(True)
>>> pipeline = Pipeline().setStages([
...     documentAssembler,
...     tokenizer,
...     embeddings,
...     embeddingsFinisher
... ])
>>> data = spark.createDataFrame([["This is a sentence."]]).toDF("text")
>>> result = pipeline.fit(data).transform(data)
>>> result.selectExpr("explode(finished_embeddings) as result").show(1, 80)
+--------------------------------------------------------------------------------+
|                                                                          result|
+--------------------------------------------------------------------------------+
|[0.06222493574023247,0.011579325422644615,0.009919632226228714,0.109361454844...|
+--------------------------------------------------------------------------------+
name = Doc2VecModel[source]#
vectorSize[source]#
setVectorSize(self, vectorSize)[source]#

Sets vector size (default: 100).

static pretrained(name='doc2vec_gigaword_300', lang='en', remote_loc=None)[source]#

Downloads and loads a pretrained model.

Parameters:
namestr, optional

Name of the pretrained model, by default “doc2vec_wiki”

langstr, optional

Language of the pretrained model, by default “en”

remote_locstr, optional

Optional remote address of the resource, by default None. Will use Spark NLPs repositories otherwise.

Returns:
Doc2VecModel

The restored model