DistilBERT base multilingual model (cased)

Description

This model is a distilled version of the BERT base multilingual model. The code for the distillation process can be found here. This model is cased: it does make a difference between english and English. The model is trained on the concatenation of Wikipedia in 104 different languages listed here.

The model has 6 layers, 768 dimension,s and 12 heads, totalizing 134M parameters (compared to 177M parameters for mBERT-base). On average DistilmBERT is twice as fast as mBERT-base.

Download

How to use

embeddings = DistilBertEmbeddings.pretrained("distilbert_base_multilingual_cased", "xx") \
      .setInputCols("sentence", "token") \
      .setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
val embeddings = DistilBertEmbeddings.pretrained("distilbert_base_multilingual_cased", "xx")
      .setInputCols("sentence", "token")
      .setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))

Model Information

Model Name: distilbert_base_multilingual_cased
Compatibility: Spark NLP 3.1.0+
License: Open Source
Edition: Official
Input Labels: [token, sentence]
Output Labels: [embeddings]
Language: xx
Case sensitive: true

Data Source

https://huggingface.co/distilbert-base-multilingual-cased

Benchmarking

| Model                        | English | Spanish | Chinese | German | Arabic  | Urdu |
| :---:                        | :---:   | :---:   | :---:   | :---:  | :---:   | :---:|
| mBERT base cased (computed)  | 82.1    | 74.6    | 69.1    | 72.3   | 66.4    | 58.5 |
| mBERT base uncased (reported)| 81.4    | 74.3    | 63.8    | 70.5   | 62.1    | 58.3 |
| DistilmBERT                  | 78.2    | 69.1    | 64.0    | 66.3   | 59.1    | 54.7 |