Universal Sentence Encoder Multilingual (tfhub_use_multi)

Description

The Universal Sentence Encoder encodes text into high-dimensional vectors that can be used for text classification, semantic similarity, clustering, and other natural language tasks.

The model is trained and optimized for greater-than-word length text, such as sentences, phrases, or short paragraphs. It is trained on a variety of data sources and a variety of tasks with the aim of dynamically accommodating a wide variety of natural language understanding tasks. The input is the variable-length text and the output is a 512-dimensional vector. The universal-sentence-encoder model has trained with a deep averaging network (DAN) encoder.

This model supports 16 languages (Arabic, Chinese-simplified, Chinese-traditional, English, French, German, Italian, Japanese, Korean, Dutch, Polish, Portuguese, Spanish, Thai, Turkish, Russian) text encoder.

The details are described in the paper “Multilingual Universal Sentence Encoder for Semantic Retrieval”.

Note: This model only works on Linux and macOS operating systems and is not compatible with Windows due to the incompatibility of the SentencePiece library.

Download Copy S3 URI

How to use

embeddings = UniversalSentenceEncoder.pretrained("tfhub_use_multi", "xx") \
.setInputCols("document") \
.setOutputCol("sentence_embeddings")

val embeddings = UniversalSentenceEncoder.pretrained("tfhub_use_multi", "xx")
.setInputCols("document")
.setOutputCol("sentence_embeddings")

import nlu

text = ["I love NLP", "Me encanta usar SparkNLP"]
embeddings_df = nlu.load('xx.use.multi').predict(text, output_level='sentence')
embeddings_df

Results

It gives a 512-dimensional vector of the sentences

Model Information

Model Name:	tfhub_use_multi
Compatibility:	Spark NLP 3.0.0+
License:	Open Source
Edition:	Official
Input Labels:	[sentence]
Output Labels:	[sentence_embeddings]
Language:	xx

Benchmarking

- We apply this model to the STS benchmark for semantic similarity. 

STSBenchmark                       | dev    | test  |
-----------------------------------|--------|-------|   
Correlation coefficient of Pearson | 0.829  | 0.809 |

- For semantic similarity retrieval, we evaluate the model on [Quora and AskUbuntu retrieval task.](https://arxiv.org/abs/1811.08008). Results are shown below:

Dataset                | Quora | AskUbuntu | Average |
-----------------------|-------|-----------|---------|
Mean Average Precision  | 89.2  | 39.9      | 64.6    |

- For the translation pair retrieval, we evaluate the model on the United Nation Parallel Corpus. Results are shown below:

Language Pair  | en-es  | en-fr | en-ru | en-zh |
---------------|--------|-------|-------|-------|
Precision@1    | 85.8   | 82.7  | 87.4  | 79.5  |

PREVIOUSUniversal Sentence Encoder Multilingual Large (tfhub_use_multi_lg)

NEXTChinese BERT Base