Arabic Legal Judgment Prediction Dataset
Overview
This dataset is designed for Arabic Legal Judgment Prediction (LJP), collected and preprocessed from Saudi commercial court judgments. It serves as a benchmark for evaluating Large Language Models (LLMs) in the legal domain, particularly in low-resource settings.
The dataset is released as part of our research:
Can Large Language Models Predict the Outcome of Judicial Decisions?
Mohamed Bayan Kmainasi, Ali Ezzat Shahroor, and Amani Al-Ghraibah
arXiv:2501.09768
Model Usage
To use the model
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("mbayan/Llama-3.1-8b-ArLJP")
model = AutoModelForCausalLM.from_pretrained("mbayan/Llama-3.1-8b-ArLJP")
Dataset Details
- Size: 3752 training samples, 538 test samples.
- Annotations: 75 diverse Arabic instructions generated using GPT-4o, varying in length and complexity.
- Tasks Supported:
- Zero-shot, One-shot, and Fine-tuning evaluation of Arabic legal text understanding.
Data Structure
The dataset is provided in a structured format:
from datasets import load_dataset
dataset = load_dataset("mbayan/Arabic-LJP")
print(dataset)
The dataset contains:
- train: Training set with 3752 samples
- test: Test set with 538 samples
Each sample includes:
- Input text: Legal case description
- Target text: Judicial decision
Benchmark Results
We evaluated the dataset using LLaMA-based models with different configurations. Below is a summary of our findings:
Metric | LLaMA-3.2-3B | LLaMA-3.1-8B | LLaMA-3.2-3B-1S | LLaMA-3.2-3B-FT | LLaMA-3.1-8B-FT |
---|---|---|---|---|---|
Coherence | 2.69 | 5.49 | 4.52 | 6.60 | 6.94 |
Brevity | 1.99 | 4.30 | 3.76 | 5.87 | 6.27 |
Legal Language | 3.66 | 6.69 | 5.18 | 7.48 | 7.73 |
Faithfulness | 3.00 | 5.99 | 4.00 | 6.08 | 6.42 |
Clarity | 2.90 | 5.79 | 4.99 | 7.90 | 8.17 |
Consistency | 3.04 | 5.93 | 5.14 | 8.47 | 8.65 |
Avg. Qualitative Score | 3.01 | 5.89 | 4.66 | 7.13 | 7.44 |
ROUGE-1 | 0.08 | 0.12 | 0.29 | 0.50 | 0.53 |
ROUGE-2 | 0.02 | 0.04 | 0.19 | 0.39 | 0.41 |
BLEU | 0.01 | 0.02 | 0.11 | 0.24 | 0.26 |
BERT | 0.54 | 0.58 | 0.64 | 0.74 | 0.76 |
Caption: A comparative analysis of performance across different LLaMA models. The model names have been abbreviated for simplicity: LLaMA-3.2-3B-Instruct is represented as LLaMA-3.2-3B, LLaMA-3.1-8B-Instruct as LLaMA-3.1-8B, LLaMA-3.2-3B-Instruct-1-Shot as LLaMA-3.2-3B-1S, LLaMA-3.2-3B-Instruct-Finetuned as LLaMA-3.2-3B-FT, and LLaMA-3.1-8B-Finetuned as LLaMA-3.1-8B-FT.
Key Findings
- Fine-tuned smaller models (LLaMA-3.2-3B-FT) achieve performance comparable to larger models (LLaMA-3.1-8B).
- Instruction-tuned models with one-shot prompting (LLaMA-3.2-3B-1S) significantly improve over zero-shot settings.
- Fine-tuning leads to a noticeable boost in coherence, clarity, and faithfulness of predictions.
Usage
To use the dataset in your research, load it as follows:
from datasets import load_dataset
dataset = load_dataset("mbayan/Arabic-LJP")
# Access train and test splits
train_data = dataset["train"]
test_data = dataset["test"]
Repository & Implementation
The full implementation, including preprocessing scripts and model training code, is available in our GitHub repository:
🔗 GitHub
Citation
If you use this dataset, please cite our work:
@misc{kmainasi2025largelanguagemodelspredict,
title={Can Large Language Models Predict the Outcome of Judicial Decisions?},
author={Mohamed Bayan Kmainasi and Ali Ezzat Shahroor and Amani Al-Ghraibah},
year={2025},
eprint={2501.09768},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.09768},
}
- Downloads last month
- 12
Model tree for mbayan/Llama-3.1-8b-ArLJP
Base model
meta-llama/Llama-3.1-8B