Multilingual Representations for Indian Languages (MuRIL) - BERT Sentence Embedding pre-trained on 17 Indian languages


This model uses a BERT base architecture pretrained from scratch using the Wikipedia, Common Crawl, PMINDIA and Dakshina corpora for the following 17 Indian languages:

Assamese, Bengali , English , Gujarati , Hindi , Kannada , Kashmiri , Malayalam , Marathi , Nepali , Oriya , Punjabi , Sanskrit , Sindhi , Tamil , Telugu , Urdu

The MuRIL model is pre-trained on monolingual segments as well as parallel segments as detailed below :

  • Monolingual Data : Publicly available corpora from Wikipedia and Common Crawl for 17 Indian languages.
  • Parallel Data : There are two types of parallel data :
    • Translated Data : Translations of the above monolingual corpora obtained using the Google NMT pipeline. Translated segment pairs fed as input. Also, Publicly available PMINDIA corpus was used.
    • Transliterated Data : Transliterations of Wikipedia obtained using the IndicTrans library. Transliterated segment pairs fed as input. Also, Publicly available Dakshina dataset was used.

Predicted Entities


How to use

sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_muril", "xx") \
      .setInputCols("sentence") \

nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, sent_embeddings ])
val sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_muril", "xx")

val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, sent_embeddings ))
import nlu

text = ["I love NLP"]
sent_embeddings_df = nlu.load('en.embed_sentence.bert.muril').predict(text, output_level='sentence')

Model Information

Model Name: sent_bert_muril
Compatibility: Spark NLP 3.2.0+
License: Open Source
Edition: Official
Input Labels: [sentence]
Output Labels: [bert_sentence]
Language: xx
Case sensitive: false

Data Source

The model is imported from: