Legal BERT Base Uncased Embedding


LEGAL-BERT is a family of BERT models for the legal domain, intended to assist legal NLP research, computational law, and legal technology applications. To pre-train the different variations of LEGAL-BERT, we collected 12 GB of diverse English legal text from several fields (e.g., legislation, court cases, contracts) scraped from publicly available resources. Sub-domains variants (CONTRACTS-, EURLEX-, ECHR-) and/or general LEGAL-BERT perform better than using BERT out of the box for domain-specific tasks. A light-weight model (33% the size of BERT-BASE) pre-trained from scratch on legal data with competitive perfomance is also available.

Predicted Entities


How to use

embeddings = BertEmbeddings.pretrained("bert_base_uncased_legal", "en") \
      .setInputCols("sentence", "token") \

nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, embeddings])
val embeddings = BertEmbeddings.pretrained("bert_base_uncased_legal", "en")
      .setInputCols("sentence", "token")

val pipeline = new Pipeline().setStages(Array(document_assembler, sentence_detector, tokenizer, embeddings))

Model Information

Model Name: bert_base_uncased_legal
Compatibility: Spark NLP 3.2.2+
License: Open Source
Edition: Official
Input Labels: [sentence, token]
Output Labels: [bert]
Language: en
Case sensitive: true

Data Source

The model is imported from: