Multilingual Representations for Indian Languages (MuRIL)

Description

A BERT model pre-trained on 17 Indian languages, and their transliterated counterparts.

This model uses a BERT base architecture 1 pretrained from scratch using the Wikipedia 2, Common Crawl 3, PMINDIA 4 and Dakshina 5 corpora for the following 17 Indian languages:

Assamese, Bengali , English , Gujarati , Hindi , Kannada , Kashmiri , Malayalam , Marathi , Nepali , Oriya , Punjabi , Sanskrit , Sindhi , Tamil , Telugu , Urdu

Download Copy S3 URI

How to use

embeddings = BertEmbeddings.pretrained("bert_muril", "xx") \
.setInputCols("sentence", "token") \
.setOutputCol("embeddings")
nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])

val embeddings = BertEmbeddings.pretrained("bert_muril", "xx")
.setInputCols("sentence", "token")
.setOutputCol("embeddings")
val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))

import nlu
nlu.load("xx.embed.bert.muril").predict("""Put your text here.""")

Model Information

Model Name:	bert_muril
Compatibility:	Spark NLP 3.2.0+
License:	Open Source
Edition:	Official
Input Labels:	[sentence, token]
Output Labels:	[bert]
Language:	xx
Case sensitive:	false

Data Source

The model is imported from: https://tfhub.dev/google/MuRIL/1

PREVIOUSNews Classifier Pipeline for Turkish text

NEXTBERT Embeddings trained on MEDLINE/PubMed