BERT LaBSE Sentence Embeddings

Description

Language-agnostic BERT sentence embedding model supporting 109 languages:

The language-agnostic BERT sentence embedding encodes text into high dimensional vectors. The model is trained and optimized to produce similar representations exclusively for bilingual sentence pairs that are translations of each other. So it can be used for mining for translations of a sentence in a larger corpus.

The details are described in the paper “Language-agnostic BERT Sentence Embedding. July 2020”

Download Copy S3 URI

How to use

...
embeddings = BertSentenceEmbeddings.pretrained("labse", "xx") \
      .setInputCols("sentence") \
      .setOutputCol("sentence_embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings])
pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
result = pipeline_model.transform(spark.createDataFrame([['I love NLP', 'Many thanks']], ["text"]))

...
val embeddings = BertSentenceEmbeddings.pretrained("labse", "xx")
      .setInputCols("sentence")
      .setOutputCol("sentence_embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, embeddings))
val data = Seq("I love NLP", "Many thanks").toDF("text")
val result = pipeline.fit(data).transform(data)

import nlu

text = ["I love NLP", "Many thanks"]
embeddings_df = nlu.load('xx.embed_sentence.labse').predict(text, output_level='sentence')
embeddings_df

Results

        sentence	xx_embed_sentence_labse_embeddings

0	I love NLP	[-0.060951583087444305, -0.011645414866507053,...
1	Many thanks	[0.002173778833821416, -0.05513454228639603, 0...

Model Information

Model Name:	labse
Type:	embeddings
Compatibility:	Spark NLP 2.6.0+
License:	Open Source
Edition:	Official
Input Labels:	[sentence]
Output Labels:	[sentence_embeddings]
Language:	[xx]
Dimension:	768
Case sensitive:	true

Data Source

The model is imported from https://tfhub.dev/google/LaBSE/1

PREVIOUSBioBERT Sentence Embeddings (Pubmed PMC)

NEXTPortuguese BERT Embeddings (Base Cased)