File size: 5,265 Bytes
3e0cac0 cfa6c55 3e0cac0 cfa6c55 3e0cac0 cfa6c55 3e0cac0 cfa6c55 3e0cac0 cfa6c55 3e0cac0 cfa6c55 3e0cac0 cfa6c55 3e0cac0 cfa6c55 3e0cac0 cfa6c55 3e0cac0 cfa6c55 3e0cac0 cfa6c55 3e0cac0 cfa6c55 3e0cac0 cfa6c55 3e0cac0 cfa6c55 3e0cac0 cfa6c55 3e0cac0 cfa6c55 3e0cac0 cfa6c55 3e0cac0 cfa6c55 3e0cac0 cfa6c55 3e0cac0 cfa6c55 3e0cac0 cfa6c55 3e0cac0 cfa6c55 3e0cac0 cfa6c55 3e0cac0 cfa6c55 3e0cac0 cfa6c55 3e0cac0 cfa6c55 3e0cac0 cfa6c55 3e0cac0 cfa6c55 3e0cac0 cfa6c55 3e0cac0 cfa6c55 3e0cac0 cfa6c55 3e0cac0 cfa6c55 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 |
---
library_name: transformers
tags:
- Legal
- court
- prediction
- Arabic
- NLP
datasets:
- mbayan/Arabic-LJP
language:
- ar
base_model:
- meta-llama/Llama-3.1-8B-Instruct
pipeline_tag: text-generation
---
# Arabic Legal Judgment Prediction Dataset
## Overview
This dataset is designed for **Arabic Legal Judgment Prediction (LJP)**, collected and preprocessed from **Saudi commercial court judgments**. It serves as a benchmark for evaluating Large Language Models (LLMs) in the legal domain, particularly in low-resource settings.
The dataset is released as part of our research:
> **Can Large Language Models Predict the Outcome of Judicial Decisions?**
> *Mohamed Bayan Kmainasi, Ali Ezzat Shahroor, and Amani Al-Ghraibah*
> [arXiv:2501.09768](https://arxiv.org/abs/2501.09768)
## Model Usage
```python
To use the model
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("mbayan/Llama-3.1-8b-ArLJP")
model = AutoModelForCausalLM.from_pretrained("mbayan/Llama-3.1-8b-ArLJP")
```
## Dataset Details
- **Size:** 3752 training samples, 538 test samples.
- **Annotations:** 75 diverse Arabic instructions generated using GPT-4o, varying in length and complexity.
- **Tasks Supported:**
- Zero-shot, One-shot, and Fine-tuning evaluation of Arabic legal text understanding.
## Data Structure
The dataset is provided in a structured format:
```python
from datasets import load_dataset
dataset = load_dataset("mbayan/Arabic-LJP")
print(dataset)
```
The dataset contains:
- **train**: Training set with 3752 samples
- **test**: Test set with 538 samples
Each sample includes:
- **Input text:** Legal case description
- **Target text:** Judicial decision
## Benchmark Results
We evaluated the dataset using **LLaMA-based models** with different configurations. Below is a summary of our findings:
| **Metric** | **LLaMA-3.2-3B** | **LLaMA-3.1-8B** | **LLaMA-3.2-3B-1S** | **LLaMA-3.2-3B-FT** | **LLaMA-3.1-8B-FT** |
|--------------------------|------------------|------------------|---------------------|---------------------|---------------------|
| **Coherence** | 2.69 | 5.49 | 4.52 | *6.60* | **6.94** |
| **Brevity** | 1.99 | 4.30 | 3.76 | *5.87* | **6.27** |
| **Legal Language** | 3.66 | 6.69 | 5.18 | *7.48* | **7.73** |
| **Faithfulness** | 3.00 | 5.99 | 4.00 | *6.08* | **6.42** |
| **Clarity** | 2.90 | 5.79 | 4.99 | *7.90* | **8.17** |
| **Consistency** | 3.04 | 5.93 | 5.14 | *8.47* | **8.65** |
| **Avg. Qualitative Score**| 3.01 | 5.89 | 4.66 | *7.13* | **7.44** |
| **ROUGE-1** | 0.08 | 0.12 | 0.29 | *0.50* | **0.53** |
| **ROUGE-2** | 0.02 | 0.04 | 0.19 | *0.39* | **0.41** |
| **BLEU** | 0.01 | 0.02 | 0.11 | *0.24* | **0.26** |
| **BERT** | 0.54 | 0.58 | 0.64 | *0.74* | **0.76** |
**Caption**: A comparative analysis of performance across different LLaMA models. The model names have been abbreviated for simplicity: **LLaMA-3.2-3B-Instruct** is represented as LLaMA-3.2-3B, **LLaMA-3.1-8B-Instruct** as LLaMA-3.1-8B, **LLaMA-3.2-3B-Instruct-1-Shot** as LLaMA-3.2-3B-1S, **LLaMA-3.2-3B-Instruct-Finetuned** as LLaMA-3.2-3B-FT, and **LLaMA-3.1-8B-Finetuned** as LLaMA-3.1-8B-FT.
### **Key Findings**
- Fine-tuned smaller models (**LLaMA-3.2-3B-FT**) achieve performance **comparable to larger models** (LLaMA-3.1-8B).
- Instruction-tuned models with one-shot prompting (LLaMA-3.2-3B-1S) significantly improve over zero-shot settings.
- Fine-tuning leads to a noticeable boost in **coherence, clarity, and faithfulness** of predictions.
## Usage
To use the dataset in your research, load it as follows:
```python
from datasets import load_dataset
dataset = load_dataset("mbayan/Arabic-LJP")
# Access train and test splits
train_data = dataset["train"]
test_data = dataset["test"]
```
## Repository & Implementation
The full implementation, including preprocessing scripts and model training code, is available in our GitHub repository:
🔗 **[GitHub](https://github.com/MohamedBayan/Arabic-Legal-Judgment-Prediction)**
## Citation
If you use this dataset, please cite our work:
```
@misc{kmainasi2025largelanguagemodelspredict,
title={Can Large Language Models Predict the Outcome of Judicial Decisions?},
author={Mohamed Bayan Kmainasi and Ali Ezzat Shahroor and Amani Al-Ghraibah},
year={2025},
eprint={2501.09768},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.09768},
}
``` |