Multilingual Representations for Indian Languages (MuRIL)

Description

A BERT model pre-trained on 17 Indian languages, and their transliterated counterparts.

This model uses a BERT base architecture 1 pretrained from scratch using the Wikipedia 2, Common Crawl 3, PMINDIA 4 and Dakshina 5 corpora for the following 17 Indian languages:

Assamese, Bengali , English , Gujarati , Hindi , Kannada , Kashmiri , Malayalam , Marathi , Nepali , Oriya , Punjabi , Sanskrit , Sindhi , Tamil , Telugu , Urdu

Download

How to use

embeddings = BertEmbeddings.pretrained("bert_muril", "xx") \
      .setInputCols("sentence", "token") \
      .setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
val embeddings = BertEmbeddings.pretrained("bert_muril", "xx")
      .setInputCols("sentence", "token")
      .setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))

Model Information

Model Name: bert_muril
Compatibility: Spark NLP 3.2.0+
License: Open Source
Edition: Official
Input Labels: [sentence, token]
Output Labels: [bert]
Language: xx
Case sensitive: false

Data Source

The model is imported from: https://tfhub.dev/google/MuRIL/1