Universal Sentence Encoder


The Universal Sentence Encoder encodes text into high-dimensional vectors that can be used for text classification, semantic similarity, clustering and other natural language tasks.

The model is trained and optimized for greater-than-word length text, such as sentences, phrases or short paragraphs. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks. The input is variable length English text and the output is a 512 dimensional vector. We apply this model to the STS benchmark for semantic similarity, and the results can be seen in the example notebook made available. The universal-sentence-encoder model is trained with a deep averaging network (DAN) encoder.

The details are described in the paper “Universal Sentence Encoder”.


How to use

embeddings = UniversalSentenceEncoder.pretrained("tfhub_use", "en") \
      .setInputCols("document") \

val embeddings = UniversalSentenceEncoder.pretrained("tfhub_use", "en")

Model Information

Model Name: tfhub_use
Type: embeddings
Compatibility: Spark NLP 2.4.0
License: Open Source
Edition: Official
Input Labels: [sentence]
Output Labels: [sentence_embeddings]
Language: [en]
Dimension: 512
Case sensitive: true

Data Source

The model is imported from https://tfhub.dev/google/universal-sentence-encoder/2