--- library_name: transformers tags: - Legal - court - prediction - Arabic - NLP datasets: - mbayan/Arabic-LJP language: - ar base_model: - meta-llama/Llama-3.1-8B-Instruct pipeline_tag: text-generation --- # Arabic Legal Judgment Prediction Dataset ## Overview This dataset is designed for **Arabic Legal Judgment Prediction (LJP)**, collected and preprocessed from **Saudi commercial court judgments**. It serves as a benchmark for evaluating Large Language Models (LLMs) in the legal domain, particularly in low-resource settings. The dataset is released as part of our research: > **Can Large Language Models Predict the Outcome of Judicial Decisions?** > *Mohamed Bayan Kmainasi, Ali Ezzat Shahroor, and Amani Al-Ghraibah* > [arXiv:2501.09768](https://arxiv.org/abs/2501.09768) ## Model Usage ```python To use the model from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("mbayan/Llama-3.1-8b-ArLJP") model = AutoModelForCausalLM.from_pretrained("mbayan/Llama-3.1-8b-ArLJP") ``` ## Dataset Details - **Size:** 3752 training samples, 538 test samples. - **Annotations:** 75 diverse Arabic instructions generated using GPT-4o, varying in length and complexity. - **Tasks Supported:** - Zero-shot, One-shot, and Fine-tuning evaluation of Arabic legal text understanding. ## Data Structure The dataset is provided in a structured format: ```python from datasets import load_dataset dataset = load_dataset("mbayan/Arabic-LJP") print(dataset) ``` The dataset contains: - **train**: Training set with 3752 samples - **test**: Test set with 538 samples Each sample includes: - **Input text:** Legal case description - **Target text:** Judicial decision ## Benchmark Results We evaluated the dataset using **LLaMA-based models** with different configurations. Below is a summary of our findings: | **Metric** | **LLaMA-3.2-3B** | **LLaMA-3.1-8B** | **LLaMA-3.2-3B-1S** | **LLaMA-3.2-3B-FT** | **LLaMA-3.1-8B-FT** | |--------------------------|------------------|------------------|---------------------|---------------------|---------------------| | **Coherence** | 2.69 | 5.49 | 4.52 | *6.60* | **6.94** | | **Brevity** | 1.99 | 4.30 | 3.76 | *5.87* | **6.27** | | **Legal Language** | 3.66 | 6.69 | 5.18 | *7.48* | **7.73** | | **Faithfulness** | 3.00 | 5.99 | 4.00 | *6.08* | **6.42** | | **Clarity** | 2.90 | 5.79 | 4.99 | *7.90* | **8.17** | | **Consistency** | 3.04 | 5.93 | 5.14 | *8.47* | **8.65** | | **Avg. Qualitative Score**| 3.01 | 5.89 | 4.66 | *7.13* | **7.44** | | **ROUGE-1** | 0.08 | 0.12 | 0.29 | *0.50* | **0.53** | | **ROUGE-2** | 0.02 | 0.04 | 0.19 | *0.39* | **0.41** | | **BLEU** | 0.01 | 0.02 | 0.11 | *0.24* | **0.26** | | **BERT** | 0.54 | 0.58 | 0.64 | *0.74* | **0.76** | **Caption**: A comparative analysis of performance across different LLaMA models. The model names have been abbreviated for simplicity: **LLaMA-3.2-3B-Instruct** is represented as LLaMA-3.2-3B, **LLaMA-3.1-8B-Instruct** as LLaMA-3.1-8B, **LLaMA-3.2-3B-Instruct-1-Shot** as LLaMA-3.2-3B-1S, **LLaMA-3.2-3B-Instruct-Finetuned** as LLaMA-3.2-3B-FT, and **LLaMA-3.1-8B-Finetuned** as LLaMA-3.1-8B-FT. ### **Key Findings** - Fine-tuned smaller models (**LLaMA-3.2-3B-FT**) achieve performance **comparable to larger models** (LLaMA-3.1-8B). - Instruction-tuned models with one-shot prompting (LLaMA-3.2-3B-1S) significantly improve over zero-shot settings. - Fine-tuning leads to a noticeable boost in **coherence, clarity, and faithfulness** of predictions. ## Usage To use the dataset in your research, load it as follows: ```python from datasets import load_dataset dataset = load_dataset("mbayan/Arabic-LJP") # Access train and test splits train_data = dataset["train"] test_data = dataset["test"] ``` ## Repository & Implementation The full implementation, including preprocessing scripts and model training code, is available in our GitHub repository: 🔗 **[GitHub](https://github.com/MohamedBayan/Arabic-Legal-Judgment-Prediction)** ## Citation If you use this dataset, please cite our work: ``` @misc{kmainasi2025largelanguagemodelspredict, title={Can Large Language Models Predict the Outcome of Judicial Decisions?}, author={Mohamed Bayan Kmainasi and Ali Ezzat Shahroor and Amani Al-Ghraibah}, year={2025}, eprint={2501.09768}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2501.09768}, } ```