Word Embeddings for Bengali (bengali_cc_300d)

Description

This model is trained on Common Crawl and Wikipedia using fastText. It is trained using CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives.

The model gives 300 dimensional vector outputs per token. The output vectors map words into a meaningful space where the distance between the vectors is related to semantic similarity of words.

These embeddings can be used in multiple tasks like semantic word similarity, named entity recognition, sentiment analysis, and classification.

How to use

embeddings = WordEmbeddingsModel.pretrained("bengali_cc_300d", "bn") \
.setInputCols(["sentence", "token"]) \
.setOutputCol("embeddings")


Results

The model gives 300 dimensional feature vector output per token.


Model Information

 Model Name: bengali_cc_300d Type: embeddings Compatibility: Spark NLP 2.7.3+ License: Open Source Edition: Official Input Labels: [document, token] Output Labels: [word_embeddings] Language: bn Case sensitive: false Dimension: 300

Data Source

This model is imported from https://fasttext.cc/docs/en/crawl-vectors.html