README.md · law-ai/InLegalBERT at 4b9c41c0c734c0fbcfcb9d6fd92137064c471b0a

metadata

language: en
pipeline_tag: fill-mask
tags:
  - legal
license: mit

InLegalBERT

Model and tokenizer files for the InLegalBERT model.

Training Data

For building the pre-training corpus of Indian legal text, we collected a large corpus of case documents from the Indian Supreme Court and many High Courts of India. These documents were collected from diverse publicly available sources on the Web, such as official websites of these courts (e.g., the website of the Indian Supreme Court), the erstwhile website of the Legal Information Institute of India, the popular legal repository IndianKanoon, and so on. The court cases in our dataset range from 1950 to 2019, and belong to all legal domains, such as Civil, Criminal, Constitutional, and so on. Additionally, we collected 1,113 Central Government Acts, which are the documents codifying the laws of the country. Each Act is a collection of related laws, called Sections. These 1,113 Acts contain a total of 32,021 Sections. In total, our dataset contains around 5.4 million Indian legal documents (all in the English language). The raw text corpus size is around 27 GB.

Training Objective

This model is initialized with the LEGAL-BERT-SC model from the paper LEGAL-BERT: The Muppets straight out of Law School

Usage

Using the tokenizer (same as LegalBERT

from transformers import AutoTokenizer, AutoModel, BertForPreTraining
tokenizer = AutoTokenizer.from_pretrained("nlpaueb/legal-bert-base-uncased")
model = AutoModel.from_pretrained("nlpaueb/legal-bert-base-uncased")

law-ai
/

InLegalBERT

InLegalBERT

Training Data

Training Objective

Usage

Citation