Arabic Legal Judgment Prediction Dataset

Overview

This dataset is designed for Arabic Legal Judgment Prediction (LJP), collected and preprocessed from Saudi commercial court judgments. It serves as a benchmark for evaluating Large Language Models (LLMs) in the legal domain, particularly in low-resource settings.

The dataset is released as part of our research:

Can Large Language Models Predict the Outcome of Judicial Decisions?
Mohamed Bayan Kmainasi, Ali Ezzat Shahroor, and Amani Al-Ghraibah
arXiv:2501.09768

Model Usage


To use the model
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("mbayan/Llama-3.1-8b-ArLJP")
model = AutoModelForCausalLM.from_pretrained("mbayan/Llama-3.1-8b-ArLJP")

Dataset Details

Size: 3752 training samples, 538 test samples.
Annotations: 75 diverse Arabic instructions generated using GPT-4o, varying in length and complexity.
Tasks Supported:
- Zero-shot, One-shot, and Fine-tuning evaluation of Arabic legal text understanding.

Data Structure

The dataset is provided in a structured format:

from datasets import load_dataset

dataset = load_dataset("mbayan/Arabic-LJP")
print(dataset)

The dataset contains:

train: Training set with 3752 samples
test: Test set with 538 samples

Each sample includes:

Input text: Legal case description
Target text: Judicial decision

Benchmark Results

We evaluated the dataset using LLaMA-based models with different configurations. Below is a summary of our findings:

Metric	LLaMA-3.2-3B	LLaMA-3.1-8B	LLaMA-3.2-3B-1S	LLaMA-3.2-3B-FT	LLaMA-3.1-8B-FT
Coherence	2.69	5.49	4.52	6.60	6.94
Brevity	1.99	4.30	3.76	5.87	6.27
Legal Language	3.66	6.69	5.18	7.48	7.73
Faithfulness	3.00	5.99	4.00	6.08	6.42
Clarity	2.90	5.79	4.99	7.90	8.17
Consistency	3.04	5.93	5.14	8.47	8.65
Avg. Qualitative Score	3.01	5.89	4.66	7.13	7.44
ROUGE-1	0.08	0.12	0.29	0.50	0.53
ROUGE-2	0.02	0.04	0.19	0.39	0.41
BLEU	0.01	0.02	0.11	0.24	0.26
BERT	0.54	0.58	0.64	0.74	0.76

Caption: A comparative analysis of performance across different LLaMA models. The model names have been abbreviated for simplicity: LLaMA-3.2-3B-Instruct is represented as LLaMA-3.2-3B, LLaMA-3.1-8B-Instruct as LLaMA-3.1-8B, LLaMA-3.2-3B-Instruct-1-Shot as LLaMA-3.2-3B-1S, LLaMA-3.2-3B-Instruct-Finetuned as LLaMA-3.2-3B-FT, and LLaMA-3.1-8B-Finetuned as LLaMA-3.1-8B-FT.

Key Findings

Fine-tuned smaller models (LLaMA-3.2-3B-FT) achieve performance comparable to larger models (LLaMA-3.1-8B).
Instruction-tuned models with one-shot prompting (LLaMA-3.2-3B-1S) significantly improve over zero-shot settings.
Fine-tuning leads to a noticeable boost in coherence, clarity, and faithfulness of predictions.

Usage

To use the dataset in your research, load it as follows:

from datasets import load_dataset

dataset = load_dataset("mbayan/Arabic-LJP")

# Access train and test splits
train_data = dataset["train"]
test_data = dataset["test"]

Repository & Implementation

The full implementation, including preprocessing scripts and model training code, is available in our GitHub repository:

🔗 GitHub

Citation

If you use this dataset, please cite our work:

@misc{kmainasi2025largelanguagemodelspredict,
  title={Can Large Language Models Predict the Outcome of Judicial Decisions?}, 
  author={Mohamed Bayan Kmainasi and Ali Ezzat Shahroor and Amani Al-Ghraibah},
  year={2025},
  eprint={2501.09768},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2501.09768}, 
}

mbayan
/

Llama-3.1-8b-ArLJP