Word Embeddings for Dutch (dutch_cc_300d)

Description

This model is trained on Common Crawl and Wikipedia dataset for Dutch language using fastText. It is trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives.

The model gives 300 dimensional vector outputs per token. The output vectors map words into a meaningful space where the distance between the vectors is related to semantic similarity of words.

These embeddings can be used in multiple tasks like semantic word similarity, named entity recognition, sentiment analysis, and classification.

Predicted Entities

Download Copy S3 URI

How to use

embeddings = WordEmbeddingsModel.pretrained("dutch_cc_300d", "nl") \
        .setInputCols(["document", "token"]) \
        .setOutputCol("embeddings")

nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])

data = spark.createDataFrame([["De Bijlmerramp is de benaming voor de vliegramp"]]).toDF("text")

pipeline_model = nlp_pipeline.fit(data)

result = pipeline_model.transform(data)

val embeddings = WordEmbeddingsModel.pretrained("dutch_cc_300d", "nl")
        .setInputCols(Array("document", "token")) 
        .setOutputCol("embeddings")

val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))

val data = Seq("De Bijlmerramp is de benaming voor de vliegramp").toDF("text")

val result = pipeline.fit(data).transform(data)

import nlu
nlu.load("nl.embed").predict("""De Bijlmerramp is de benaming voor de vliegramp""")

Results

| token        | embedding                                                                      |
|:-------------|:-------------------------------------------------------------------------------|
| De           | ['0.0249', '-0.0115', '-0.0748', '-0.0823', '0.0866', '-0.0219', '0.00' ...]   |
| Bijlmerramp  | ['0.0204', '0.0079', '0.0224', '0.0352', '-0.0409', '0.0053', '0.0175', ...]   |
| is           | ['-1.0E-4', '0.1419', '0.053', '-0.0921', '0.07', '0.004', '-0.1683',   ...]   |
| de           | ['0.0309', '0.0411', '-0.0077', '-0.0756', '0.0741', '-0.0402', '0.025' ...]   |
| benaming     | ['0.0197', '0.0167', '-0.0051', '0.0198', '0.034', '-0.0086', '-0.009', ...]   |
| voor         | ['0.0642', '-0.0171', '-0.0118', '0.0042', '0.0058', '0.0018', '0.0039' ...]   |
| de           | ['0.0309', '0.0411', '-0.0077', '-0.0756', '0.0741', '-0.0402', '0.025' ...]   |
| vliegramp    | ['0.083', '0.025', '0.0029', '0.0064', '-0.0698', '0.0344', '-0.0305',  ...]   |

Model Information

Model Name:	dutch_cc_300d
Type:	embeddings
Compatibility:	Spark NLP 3.3.0+
License:	Open Source
Edition:	Official
Input Labels:	[document, token]
Output Labels:	[word_embeddings]
Language:	nl
Case sensitive:	false
Dimension:	300

Data Source

This model is imported from https://fasttext.cc/docs/en/crawl-vectors.html

PREVIOUSXLM-RoBERTa Token Classification Base - NER OntoNotes (xlm_roberta_base_token_classifier_ontonotes)

NEXTLongformer Token Classification Base - NER CoNLL (longformer_base_token_classifier_conll03)