BERT LaBSE

Description

Language-agnostic BERT sentence embedding model supporting 109 languages:

The language-agnostic BERT sentence embedding encodes text into high dimensional vectors. The model is trained and optimized to produce similar representations exclusively for bilingual sentence pairs that are translations of each other. So it can be used for mining for translations of a sentence in a larger corpus.

The details are described in the paper “Language-agnostic BERT Sentence Embedding. July 2020

Download

How to use


embeddings = BertSentenceEmbeddings.pretrained("labse", "xx") \
      .setInputCols("sentence") \
      .setOutputCol("sentence_embeddings")

val embeddings = BertSentenceEmbeddings.pretrained("labse", "xx")
      .setInputCols("sentence")
      .setOutputCol("sentence_embeddings")

Model Information

Model Name: labse
Type: embeddings
Compatibility: Spark NLP 2.6.0+
License: Open Source
Edition: Official
Input Labels: [sentence, token]
Output Labels: [sentence_embeddings]
Language: [en]
Dimension: 768
Case sensitive: true

Data Source

The model is imported from https://tfhub.dev/google/LaBSE/1