CamemBERT Base CCNet

Description

CamemBERT is a state-of-the-art language model for French based on the RoBERTa model. For further information or requests, please go to Camembert Website

Predicted Entities

Download

How to use

embeddings = CamemBertEmbeddings.pretrained("camembert_base_ccnet", "fr") \
      .setInputCols("sentence", "token") \
      .setOutputCol("embeddings")
val embeddings = CamemBertEmbeddings.pretrained("camembert_base_ccnet", "fr")
      .setInputCols("sentence", "token")
      .setOutputCol("embeddings")

Model Information

Model Name: camembert_base_ccnet
Compatibility: Spark NLP 3.4.4+
License: Open Source
Edition: Official
Input Labels: [token, sentence]
Output Labels: [embeddings]
Language: fr
Size: 266.1 MB
Case sensitive: true

References

https://huggingface.co/camembert/camembert-base-ccnet

Benchmarking

| Model                          | #params                        | Arch. | Training data                     |
|--------------------------------|--------------------------------|-------|-----------------------------------|
| `camembert-base` | 110M   | Base  | OSCAR (138 GB of text)            |
| `camembert/camembert-large`              | 335M    | Large | CCNet (135 GB of text)            |
| `camembert/camembert-base-ccnet`         | 110M    | Base  | CCNet (135 GB of text)            |
| `camembert/camembert-base-wikipedia-4gb` | 110M    | Base  | Wikipedia (4 GB of text)          |
| `camembert/camembert-base-oscar-4gb`     | 110M    | Base  | Subsample of OSCAR (4 GB of text) |
| `camembert/camembert-base-ccnet-4gb`     | 110M    | Base  | Subsample of CCNet (4 GB of text) |