Universal sentence encoder for 100+ languages trained with CMLM (sent_bert_use_cmlm_multi_base)


The universal sentence encoder family of models maps the text into high dimensional vectors that capture sentence-level semantics. Our Multilingual-base model is trained using a conditional masked language model described in [1]. The model is intended to be used for text classification, text clustering, semantic textual similarity, etc. The base model employs a 12 layer BERT transformer architecture.

The model extends the BERT transformer architecture that is why we use it with BertSentenceEmbeddings.

[1] Ziyi Yang, Yinfei Yang, Daniel Cer, Jax Law, Eric Darve. Universal Sentence Representations Learning with Conditional Masked Language Model. November 2020


How to use

embeddings = BertSentenceEmbeddings.pretrained("sent_bert_use_cmlm_multi_base", "xx") \
      .setInputCols("sentence") \
val embeddings = BertSentenceEmbeddings.pretrained("sent_bert_use_cmlm_multi_base", "xx")
import nlu

text = ["I hate cancer", "Antibiotics aren't painkiller"]
embeddings_df = nlu.load('xx.embed_sentence.sent_bert_use_cmlm_multi_base').predict(text, output_level='sentence')

Model Information

Model Name: sent_bert_use_cmlm_multi_base
Compatibility: Spark NLP 3.1.3+
License: Open Source
Edition: Official
Input Labels: [sentence]
Output Labels: [bert]
Language: xx
Case sensitive: true

Data Source



We evaluate this model on XEVAL, translated SentEval sentence representation benchmark. XEVAL will be publicly available soon.

XEVAL	ar	bg	de	....	zh	15 Languages Average
USE-CMLM-Multilingual-Base	80.6	81.2	82.6	....	81.7	81.2
USE-CMLM-Multilingual-Base + BR	82.6	83.0	84.0	....	83.0	82.8