Multilingual Representations for Indian Languages (MuRIL) - BERT Sentence Embedding pre-trained on 17 Indian languages

Description

This model uses a BERT base architecture pretrained from scratch using the Wikipedia, Common Crawl, PMINDIA and Dakshina corpora for the following 17 Indian languages:

Assamese, Bengali , English , Gujarati , Hindi , Kannada , Kashmiri , Malayalam , Marathi , Nepali , Oriya , Punjabi , Sanskrit , Sindhi , Tamil , Telugu , Urdu

The MuRIL model is pre-trained on monolingual segments as well as parallel segments as detailed below :

  • Monolingual Data : Publicly available corpora from Wikipedia and Common Crawl for 17 Indian languages.
  • Parallel Data : There are two types of parallel data :
  • Translated Data : Translations of the above monolingual corpora obtained using the Google NMT pipeline. Translated segment pairs fed as input. Also, Publicly available PMINDIA corpus was used.
  • Transliterated Data : Transliterations of Wikipedia obtained using the IndicTrans library. Transliterated segment pairs fed as input. Also, Publicly available Dakshina dataset was used.

Predicted Entities

Download Copy S3 URI

How to use

sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_muril", "xx") \
.setInputCols("sentence") \
.setOutputCol("bert_sentence")

nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, sent_embeddings ])
val sent_embeddings = BertSentenceEmbeddings.pretrained("sent_bert_muril", "xx")
.setInputCols("sentence")
.setOutputCol("bert_sentence")

val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, sent_embeddings ))
import nlu

text = ["I love NLP"]
sent_embeddings_df = nlu.load('en.embed_sentence.bert.muril').predict(text, output_level='sentence')
sent_embeddings_df

Model Information

Model Name: sent_bert_muril
Compatibility: Spark NLP 3.2.0+
License: Open Source
Edition: Official
Input Labels: [sentence]
Output Labels: [bert_sentence]
Language: xx
Case sensitive: false

Data Source

The model is imported from: https://tfhub.dev/google/MuRIL/1