law-ai
/

InLegalBERT

Inference Endpoints

Model card Files Files and versions Community

law-ai commited on Sep 11, 2022

Commit

0d71062

•

1 Parent(s): f72722a

Update README.md

Files changed (1) hide show

README.md +19 -0

README.md CHANGED Viewed

@@ -5,3 +5,22 @@ tags:
 - legal
 license: mit
 ---

 - legal
 license: mit
 ---
+###  InLegalBERT
+Model and tokenizer files for the InLegalBERT model.
+### Training Data
+For building the pre-training corpus of Indian legal text, we collected a large corpus of case documents from the Indian Supreme Court and many High Courts of India.
+These documents were collected from diverse publicly available sources on the Web, such as official websites of these courts (e.g., [the website of the Indian Supreme Court](https://main.sci.gov.in/), the erstwhile website of the Legal Information Institute of India,
+the popular legal repository [IndianKanoon](https://www.indiankanoon.org), and so on.
+The court cases in our dataset range from 1950 to 2019, and belong to all legal domains, such as Civil, Criminal, Constitutional, and so on.
+Additionally, we collected 1,113 Central Government Acts, which are the documents codifying the laws of the country. Each Act is a collection of related laws, called Sections. These 1,113 Acts contain a total of 32,021 Sections.
+In total, our dataset contains around 5.4 million Indian legal documents (all in the English language).
+The raw text corpus size is around 27 GB.
+### Training Objective
+This model is initialized with the [LEGAL-BERT-SC model](https://huggingface.co/nlpaueb/legal-bert-base-uncased) from the paper [LEGAL-BERT: The Muppets straight out of Law School](https://aclanthology.org/2020.findings-emnlp.261/), and trained for an additional 300K steps on our data on the MLM and NSP objective.
+### Usage
+### Citation