Update README.md
Browse files
README.md
CHANGED
@@ -5,3 +5,22 @@ tags:
|
|
5 |
- legal
|
6 |
license: mit
|
7 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
5 |
- legal
|
6 |
license: mit
|
7 |
---
|
8 |
+
|
9 |
+
### InLegalBERT
|
10 |
+
Model and tokenizer files for the InLegalBERT model.
|
11 |
+
|
12 |
+
### Training Data
|
13 |
+
For building the pre-training corpus of Indian legal text, we collected a large corpus of case documents from the Indian Supreme Court and many High Courts of India.
|
14 |
+
These documents were collected from diverse publicly available sources on the Web, such as official websites of these courts (e.g., [the website of the Indian Supreme Court](https://main.sci.gov.in/), the erstwhile website of the Legal Information Institute of India,
|
15 |
+
the popular legal repository [IndianKanoon](https://www.indiankanoon.org), and so on.
|
16 |
+
The court cases in our dataset range from 1950 to 2019, and belong to all legal domains, such as Civil, Criminal, Constitutional, and so on.
|
17 |
+
Additionally, we collected 1,113 Central Government Acts, which are the documents codifying the laws of the country. Each Act is a collection of related laws, called Sections. These 1,113 Acts contain a total of 32,021 Sections.
|
18 |
+
In total, our dataset contains around 5.4 million Indian legal documents (all in the English language).
|
19 |
+
The raw text corpus size is around 27 GB.
|
20 |
+
|
21 |
+
### Training Objective
|
22 |
+
This model is initialized with the [LEGAL-BERT-SC model](https://huggingface.co/nlpaueb/legal-bert-base-uncased) from the paper [LEGAL-BERT: The Muppets straight out of Law School](https://aclanthology.org/2020.findings-emnlp.261/), and trained for an additional 300K steps on our data on the MLM and NSP objective.
|
23 |
+
|
24 |
+
### Usage
|
25 |
+
|
26 |
+
### Citation
|