law-ai
/

InLegalBERT

@@ -7,7 +7,7 @@ license: mit
 ---
 ###  InLegalBERT
-Model and tokenizer files for the InLegalBERT model.
 ### Training Data
 For building the pre-training corpus of Indian legal text, we collected a large corpus of case documents from the Indian Supreme Court and many High Courts of India.
@@ -20,33 +20,41 @@ This model is initialized with the [LEGAL-BERT-SC model](https://huggingface.co/
 We further train this model on our data for 300K steps on the Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) tasks.
 ### Model Overview
 This model has the same configuration as the [bert-base-uncased model](https://huggingface.co/bert-base-uncased):
 12 hidden layers, 768 hidden dimensionality, 12 attention heads, ~110M parameters.
 ### Usage
-Using the tokenizer (same as [LegalBERT](https://huggingface.co/nlpaueb/legal-bert-base-uncased))
 ```python
-from transformers import AutoTokenizer
 tokenizer = AutoTokenizer.from_pretrained("law-ai/InLegalBERT")
-```
-Using the model to get embeddings/representations for a sentence
-```python
-from transformers import AutoModel
 model = AutoModel.from_pretrained("law-ai/InLegalBERT")
 ```
 ### Fine-tuning Results
 ### Citation
 ```
-@inproceedings{paul-2022-ptinlegal,
-    title = "Pre-training Transformers on Indian Legal Text",
-    author = "Paul, Shounak  and
-      Mandal, Arpan  and
-      Goyal, Pawan  and
-      Ghosh, Saptarshi",
-    eprinttype = {arXiv}
 }
 ```
 ### About Us

 ---
 ###  InLegalBERT
+Model and tokenizer files for the InLegalBERT model from the paper [Pre-training Transformers on Indian Legal Text](https://arxiv.org/abs/2209.06049).
 ### Training Data
 For building the pre-training corpus of Indian legal text, we collected a large corpus of case documents from the Indian Supreme Court and many High Courts of India.
 We further train this model on our data for 300K steps on the Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) tasks.
 ### Model Overview
+This model uses the same tokenizer as [LegalBERT](https://huggingface.co/nlpaueb/legal-bert-base-uncased).
 This model has the same configuration as the [bert-base-uncased model](https://huggingface.co/bert-base-uncased):
 12 hidden layers, 768 hidden dimensionality, 12 attention heads, ~110M parameters.
 ### Usage
+Using the model to get embeddings/representations for a piece of text
 ```python
+from transformers import AutoTokenizer, AutoModel
 tokenizer = AutoTokenizer.from_pretrained("law-ai/InLegalBERT")
+text = "Replace this string with yours"
+encoded_input = tokenizer(text, return_tensors="pt")
 model = AutoModel.from_pretrained("law-ai/InLegalBERT")
+output = model(**encoded_input)
+last_hidden_state = output.last_hidden_state
 ```
 ### Fine-tuning Results
+We have fine-tuned all pre-trained models on 3 legal tasks with Indian datasets:
+* Legal Statute Identification ([ILSI Dataset](https://arxiv.org/abs/2112.14731))[Multi-label Text Classification]: Identifying relevant statutes (law articles) based on the facts of a court case
+* Semantic Segmentation ([ISS Dataset](https://arxiv.org/abs/1911.05405))[Sentence Tagging]: Segmenting the document into 7 functional parts (semantic segments) such as Facts, Arguments, etc.
+* Court Judgment Prediction ([ILDC Dataset](https://arxiv.org/abs/2105.13562))[Binary Text Classification]: Predicting whether the claims/petitions of a court case will be accepted/rejected
+This InLegalBERT beats LegalBERT as well as all other baselines/variants we have used. For details, see our [paper](https://arxiv.org/abs/2209.06049).
 ### Citation
 ```
+@article{paul-2022-pretraining,
+  doi = {10.48550/ARXIV.2209.06049},
+  url = {https://arxiv.org/abs/2209.06049},
+  author = {Paul, Shounak and Mandal, Arpan and Goyal, Pawan and Ghosh, Saptarshi},
+  title = {Pre-training Transformers on Indian Legal Text},
+  publisher = {arXiv},
+  year = {2022},
+  copyright = {Creative Commons Attribution 4.0 International}
 }
 ```
 ### About Us