Universal sentence encoder for English trained with CMLM (sent_bert_use_cmlm_en_base)

Description

Universal sentence encoder for English trained with a conditional masked language model. The universal sentence encoder family of models maps the text into high dimensional vectors that capture sentence-level semantics. Our English-base (en-base) model is trained using a conditional masked language model described in [1]. The model is intended to be used for text classification, text clustering, semantic textual similarity, etc. It can also be used used as modularized input for multimodal tasks with text as a feature. The base model employs a 12 layer BERT transformer architecture.

The model extends the BERT transformer architecture that is why we use it with BertSentenceEmbeddings.

[1] Ziyi Yang, Yinfei Yang, Daniel Cer, Jax Law, Eric Darve. Universal Sentence Representations Learning with Conditional Masked Language Model. November 2020

Download Copy S3 URI

How to use

embeddings = BertSentenceEmbeddings.pretrained("sent_bert_use_cmlm_en_base", "en") \
.setInputCols("sentence") \
.setOutputCol("sentence_embeddings")

val embeddings = BertSentenceEmbeddings.pretrained("sent_bert_use_cmlm_en_base", "en")
.setInputCols("sentence")
.setOutputCol("sentence_embeddings")

import nlu

text = ["I hate cancer", "Antibiotics aren't painkiller"]
embeddings_df = nlu.load('en.embed_sentence.sent_bert_use_cmlm_en_base').predict(text, output_level='sentence')
embeddings_df

Model Information

Model Name:	sent_bert_use_cmlm_en_base
Compatibility:	Spark NLP 3.1.3+
License:	Open Source
Edition:	Official
Input Labels:	[sentence]
Output Labels:	[bert]
Language:	en
Case sensitive:	false

Data Source

https://tfhub.dev/google/universal-sentence-encoder-cmlm/en-base/1

Benchmarking

Training News dataset by using ClassifierDL with 120K training examples:

precision    recall  f1-score   support

Business       0.84      0.90      0.87      1784
Sci/Tech       0.92      0.85      0.89      2053
Sports       0.98      0.96      0.97      1952
World       0.89      0.93      0.91      1811

accuracy                           0.91      7600
macro avg       0.91      0.91      0.91      7600
weighted avg       0.91      0.91      0.91      7600

PREVIOUSDetect Movie Entities - MIT Movie Simple (ner_mit_movie_simple_distilbert_base_cased)

NEXTUniversal sentence encoder for English trained with CMLM (sent_bert_use_cmlm_en_large)