Word Embeddings for Urdu (urduvec_140M_300d)

Description

This model is trained using Word2Vec approach on a corpora of 140 Million tokens, has a vocabulary of 100k unique tokens, and gives 300 dimensional vector outputs per token. The output vectors map words into a meaningful space where the distance between the vectors is related to semantic similarity of words.

These embeddings can be used in multiple tasks like semantic word similarity, named entity recognition, sentiment analysis, and classification.

Download

How to use

embeddings = WordEmbeddingsModel.pretrained("urduvec_140M_300d", "ur") \
        .setInputCols(["document", "token"]) \
        .setOutputCol("embeddings")

val embeddings = WordEmbeddingsModel.pretrained("urduvec_140M_300d", "ur")
        .setInputCols(Array("document", "token"))
        .setOutputCol("embeddings")

Results

The model gives 300 dimensional Word2Vec feature vector outputs per token.

Model Information

Model Name: urduvec_140M_300d
Type: embeddings
Compatibility: Spark NLP 2.7.0+
License: Open Source
Edition: Official
Input Labels: [document, token]
Output Labels: [word_embeddings]
Language: ur
Case sensitive: false
Dimension: 300

Data Source

The model is imported from http://www.lrec-conf.org/proceedings/lrec2018/pdf/148.pdf