law-ai commited on
Commit
0d71062
1 Parent(s): f72722a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -0
README.md CHANGED
@@ -5,3 +5,22 @@ tags:
5
  - legal
6
  license: mit
7
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  - legal
6
  license: mit
7
  ---
8
+
9
+ ### InLegalBERT
10
+ Model and tokenizer files for the InLegalBERT model.
11
+
12
+ ### Training Data
13
+ For building the pre-training corpus of Indian legal text, we collected a large corpus of case documents from the Indian Supreme Court and many High Courts of India.
14
+ These documents were collected from diverse publicly available sources on the Web, such as official websites of these courts (e.g., [the website of the Indian Supreme Court](https://main.sci.gov.in/), the erstwhile website of the Legal Information Institute of India,
15
+ the popular legal repository [IndianKanoon](https://www.indiankanoon.org), and so on.
16
+ The court cases in our dataset range from 1950 to 2019, and belong to all legal domains, such as Civil, Criminal, Constitutional, and so on.
17
+ Additionally, we collected 1,113 Central Government Acts, which are the documents codifying the laws of the country. Each Act is a collection of related laws, called Sections. These 1,113 Acts contain a total of 32,021 Sections.
18
+ In total, our dataset contains around 5.4 million Indian legal documents (all in the English language).
19
+ The raw text corpus size is around 27 GB.
20
+
21
+ ### Training Objective
22
+ This model is initialized with the [LEGAL-BERT-SC model](https://huggingface.co/nlpaueb/legal-bert-base-uncased) from the paper [LEGAL-BERT: The Muppets straight out of Law School](https://aclanthology.org/2020.findings-emnlp.261/), and trained for an additional 300K steps on our data on the MLM and NSP objective.
23
+
24
+ ### Usage
25
+
26
+ ### Citation