File size: 5,265 Bytes

3e0cac0
 
cfa6c55
 
 
 
 
 
 
 
 
 
 
 
 
3e0cac0
 
cfa6c55
3e0cac0
cfa6c55
3e0cac0
cfa6c55
3e0cac0
cfa6c55
3e0cac0
cfa6c55
 
 
 
 
3e0cac0
cfa6c55
 
3e0cac0
cfa6c55
 
 
3e0cac0
cfa6c55
3e0cac0
cfa6c55
 
 
 
3e0cac0
cfa6c55
3e0cac0
cfa6c55
3e0cac0
cfa6c55
 
3e0cac0
cfa6c55
 
 
3e0cac0
cfa6c55
 
 
3e0cac0
cfa6c55
 
 
3e0cac0
cfa6c55
3e0cac0
cfa6c55
3e0cac0
cfa6c55
 
 
 
 
 
 
 
 
 
 
 
 
3e0cac0
cfa6c55
3e0cac0
cfa6c55
 
 
 
3e0cac0
cfa6c55
3e0cac0
cfa6c55
3e0cac0
cfa6c55
 
3e0cac0
cfa6c55
3e0cac0
cfa6c55
 
 
 
3e0cac0
cfa6c55
3e0cac0
cfa6c55
3e0cac0
cfa6c55
3e0cac0
cfa6c55
3e0cac0
cfa6c55
3e0cac0
cfa6c55

---
library_name: transformers
tags:
- Legal
- court
- prediction
- Arabic
- NLP
datasets:
- mbayan/Arabic-LJP
language:
- ar
base_model:
- meta-llama/Llama-3.1-8B-Instruct
pipeline_tag: text-generation
---

# Arabic Legal Judgment Prediction Dataset

## Overview

This dataset is designed for **Arabic Legal Judgment Prediction (LJP)**, collected and preprocessed from **Saudi commercial court judgments**. It serves as a benchmark for evaluating Large Language Models (LLMs) in the legal domain, particularly in low-resource settings.

The dataset is released as part of our research:

> **Can Large Language Models Predict the Outcome of Judicial Decisions?**  
> *Mohamed Bayan Kmainasi, Ali Ezzat Shahroor, and Amani Al-Ghraibah*  
> [arXiv:2501.09768](https://arxiv.org/abs/2501.09768)
##  Model Usage
```python

To use the model
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("mbayan/Llama-3.1-8b-ArLJP")
model = AutoModelForCausalLM.from_pretrained("mbayan/Llama-3.1-8b-ArLJP")
```

## Dataset Details

- **Size:** 3752 training samples, 538 test samples.
- **Annotations:** 75 diverse Arabic instructions generated using GPT-4o, varying in length and complexity.
- **Tasks Supported:**  
  - Zero-shot, One-shot, and Fine-tuning evaluation of Arabic legal text understanding.

## Data Structure

The dataset is provided in a structured format:

```python
from datasets import load_dataset

dataset = load_dataset("mbayan/Arabic-LJP")
print(dataset)
```

The dataset contains:
- **train**: Training set with 3752 samples
- **test**: Test set with 538 samples

Each sample includes:
- **Input text:** Legal case description
- **Target text:** Judicial decision

## Benchmark Results

We evaluated the dataset using **LLaMA-based models** with different configurations. Below is a summary of our findings:

| **Metric**               | **LLaMA-3.2-3B** | **LLaMA-3.1-8B** | **LLaMA-3.2-3B-1S** | **LLaMA-3.2-3B-FT** | **LLaMA-3.1-8B-FT** |
|--------------------------|------------------|------------------|---------------------|---------------------|---------------------|
| **Coherence**             | 2.69             | 5.49             | 4.52                | *6.60*              | **6.94**            |
| **Brevity**               | 1.99             | 4.30             | 3.76                | *5.87*              | **6.27**            |
| **Legal Language**        | 3.66             | 6.69             | 5.18                | *7.48*              | **7.73**            |
| **Faithfulness**          | 3.00             | 5.99             | 4.00                | *6.08*              | **6.42**            |
| **Clarity**               | 2.90             | 5.79             | 4.99                | *7.90*              | **8.17**            |
| **Consistency**           | 3.04             | 5.93             | 5.14                | *8.47*              | **8.65**            |
| **Avg. Qualitative Score**| 3.01             | 5.89             | 4.66                | *7.13*              | **7.44**            |
| **ROUGE-1**               | 0.08             | 0.12             | 0.29                | *0.50*              | **0.53**            |
| **ROUGE-2**               | 0.02             | 0.04             | 0.19                | *0.39*              | **0.41**            |
| **BLEU**                  | 0.01             | 0.02             | 0.11                | *0.24*              | **0.26**            |
| **BERT**                  | 0.54             | 0.58             | 0.64                | *0.74*              | **0.76**            |

**Caption**: A comparative analysis of performance across different LLaMA models. The model names have been abbreviated for simplicity: **LLaMA-3.2-3B-Instruct** is represented as LLaMA-3.2-3B, **LLaMA-3.1-8B-Instruct** as LLaMA-3.1-8B, **LLaMA-3.2-3B-Instruct-1-Shot** as LLaMA-3.2-3B-1S, **LLaMA-3.2-3B-Instruct-Finetuned** as LLaMA-3.2-3B-FT, and **LLaMA-3.1-8B-Finetuned** as LLaMA-3.1-8B-FT.

### **Key Findings**
- Fine-tuned smaller models (**LLaMA-3.2-3B-FT**) achieve performance **comparable to larger models** (LLaMA-3.1-8B).
- Instruction-tuned models with one-shot prompting (LLaMA-3.2-3B-1S) significantly improve over zero-shot settings.
- Fine-tuning leads to a noticeable boost in **coherence, clarity, and faithfulness** of predictions.

## Usage

To use the dataset in your research, load it as follows:

```python
from datasets import load_dataset

dataset = load_dataset("mbayan/Arabic-LJP")

# Access train and test splits
train_data = dataset["train"]
test_data = dataset["test"]
```

## Repository & Implementation

The full implementation, including preprocessing scripts and model training code, is available in our GitHub repository:

🔗 **[GitHub](https://github.com/MohamedBayan/Arabic-Legal-Judgment-Prediction)**

## Citation

If you use this dataset, please cite our work:

```
@misc{kmainasi2025largelanguagemodelspredict,
  title={Can Large Language Models Predict the Outcome of Judicial Decisions?}, 
  author={Mohamed Bayan Kmainasi and Ali Ezzat Shahroor and Amani Al-Ghraibah},
  year={2025},
  eprint={2501.09768},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2501.09768}, 
}
```