DistilBERT base multilingual model (cased)

Description

This model is a distilled version of the BERT base multilingual model. The code for the distillation process can be found here. This model is cased: it does make a difference between english and English. The model is trained on the concatenation of Wikipedia in 104 different languages listed here.

The model has 6 layers, 768 dimension,s and 12 heads, totalizing 134M parameters (compared to 177M parameters for mBERT-base). On average DistilmBERT is twice as fast as mBERT-base.

Download Copy S3 URI

How to use

embeddings = DistilBertEmbeddings.pretrained("distilbert_base_multilingual_cased", "xx") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])

val embeddings = DistilBertEmbeddings.pretrained("distilbert_base_multilingual_cased", "xx")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))

import nlu
nlu.load("xx.embed.distilbert").predict("""Put your text here.""")

Model Information

Model Name:	distilbert_base_multilingual_cased
Compatibility:	Spark NLP 3.1.0+
License:	Open Source
Edition:	Official
Input Labels:	[token, sentence]
Output Labels:	[embeddings]
Language:	xx
Case sensitive:	true

Data Source

https://huggingface.co/distilbert-base-multilingual-cased

Benchmarking

| Model                        | English | Spanish | Chinese | German | Arabic  | Urdu |
| :---:                        | :---:   | :---:   | :---:   | :---:  | :---:   | :---:|
| mBERT base cased (computed)  | 82.1    | 74.6    | 69.1    | 72.3   | 66.4    | 58.5 |
| mBERT base uncased (reported)| 81.4    | 74.3    | 63.8    | 70.5   | 62.1    | 58.3 |
| DistilmBERT                  | 78.2    | 69.1    | 64.0    | 66.3   | 59.1    | 54.7 |

PREVIOUSDistilBERT base model (cased)

NEXTDistilBERT base model (uncased)