metadata

library_name: transformers
datasets:
  - mlabonne/orpo-dpo-mix-40k
language:
  - en
base_model:
  - meta-llama/Llama-3.2-1B-Instruct

Model Card for Model ID

ORPO-Tuned Llama-3.2-1B-Instruct

Model Details

This model is a fine-tuned version of the meta-llama/Llama-3.2-1B-Instruct base model, adapted using the ORPO (Optimizing Reward and Preference Objectives) technique.
Base Model: It builds upon the Llama-3.2-1B-Instruct model, (1 billion parameter instruction-following language model).
Fine-Tuning Technique: The model was fine-tuned using ORPO. ORPO combines supervised fine-tuning with preference optimization.
Training Data: It was trained on the mlabonne/orpo-dpo-mix-40k dataset, containing 44,245 examples of prompts, chosen answers, and rejected answers.
Purpose: The model is designed to generate responses that are better aligned with human preferences while maintaining the general knowledge and capabilities of the base Llama 3 model.
Efficient Fine-Tuning: LoRA (Low-Rank Adaptation) was used for efficient adaptation, allowing for faster training and smaller storage requirements.
Capabilities: Model should follow instructions and generate responses that are more in line with human preferences compared to the base model.
Evaluation: The model's performance was evaluated on the HellaSwag benchmark

Model Description

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

Model Sources [optional]

https://uplimit.com/course/open-source-llms/session/session_clu1q3j6f016d128r2zxe3uyj/assignment/assignment_clyvnyyjh019h199337oef4ur https://uplimit.com/ugc-assets/course/course_clmz6fh2a00aa12bqdtjv6ygs/assets/1728565337395-85hdx93s03d0v9bd8j1nnxfjylyty2/uplimitopensourcellmsoctoberweekone.ipynb

Uses

Hands-on learning: Finetuning LLMs

Direct Use

Introduction to Finetuning LLMs course - Learning

Downstream Use [optional]

This model is designed for tasks requiring improved alignment with human preferences, such as:

Chatbots
Question-answering systems
General text generation with enhanced preference alignment

Out-of-Scope Use

This should not yet be used in the world - More finetuning is required

Bias, Risks, and Limitations

Performance may vary on tasks outside the training distribution
May inherit biases present in the base model and training data
Limited to 1B parameters, which may impact performance on complex tasks

Recommendations

Users should be aware of potential biases in model outputs
Not suitable for critical decision-making without human oversight
May generate plausible-sounding but incorrect information

Training Details

Training Data

For training data the model used:'mlabonne/orpo-dpo-mix-40k'

This dataset is designed for ORPO (Optimizing Reward and Preference Objectives) or DPO (Direct Preference Optimization) training of language models.

It contains 44,245 examples in the training split.
Includes prompts, chosen answers, and rejected answers for each sample.
Combines various high-quality DPO datasets. [More Information Needed]

Training Procedure

This model was fine-tuned using the ORPO (Optimizing Reward and Preference Objectives) technique on the meta-llama/Llama-3.2-1B-Instruct base model.

Base Model: meta-llama/Llama-3.2-1B-Instruct Training Technique: ORPO (Optimizing Reward and Preference Objectives) Efficient Fine-tuning Method: LoRA (Low-Rank Adaptation)

Preprocessing [optional]

[More Information Needed]

Training Hyperparameters

Learning Rate: 2e-5
Batch Size: 4
Gradient Accumulation Steps: 4
Training Steps: 500
Warmup Steps: 20
LoRA Rank: 16
LoRA Alpha: 32

Speeds, Sizes, Times [optional]

[More Information Needed]

Evaluation

For evaluation the model used Hellaswag Results:

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
hellaswag	1	none	0	acc	↑	0.4516	±	0.0050
		none	0	acc_norm	↑	0.6139	±	0.0049

Interpretation:

Performance Level: The model achieves a raw accuracy of 45.16% and a normalized accuracy of 61.39% on the HellaSwag task.
Confidence: The small standard errors (about 0.5% for both metrics) indicate that these results are fairly precise.
Improvement over Random: Given that HellaSwag typically has 4 choices per question, a random baseline would achieve 25% accuracy. This model performs significantly better than random.
Normalized vs. Raw Accuracy: The higher normalized accuracy (61.39% vs. 45.16%) suggests that the model performs better when accounting for task-specific challenges.
Room for Improvement: While the performance is well above random, there's still significant room for improvement to reach human-level performance (which is typically above 95% on HellaSwag).

Summary

Base Model: meta-llama/Llama-3.2-1B-Instruct
Model Type: Causal Language Model
Language: English

Intended Use

This model is designed for tasks requiring improved alignment with human preferences, such as:
Chatbots
Question-answering systems
General text generation with enhanced preference alignment

Training Data

Dataset: mlabonne/orpo-dpo-mix-40k
Size: 44,245 examples
Content: Prompts, chosen answers, and rejected answers

Task: HellaSwag

This is a benchmark task designed to evaluate a model's commonsense reasoning and ability to complete scenarios logically.
No specific filtering was applied to the test set.
The evaluation was done in a zero-shot setting, where the model didn't receive any examples before making predictions.

Interpretation:

Performance Level: The model achieves a raw accuracy of 45.16% and a normalized accuracy of 61.39% on the HellaSwag task.
Confidence: The small standard errors (about 0.5% for both metrics) indicate that these results are fairly precise.
Improvement over Random: Given that HellaSwag typically has 4 choices per question, a random baseline would achieve 25% accuracy. This model performs significantly better than random.
Normalized vs. Raw Accuracy: The higher normalized accuracy (61.39% vs. 45.16%) suggests that the model performs better when accounting for task-specific challenges.
Room for Improvement: While the performance is well above random, there's still significant room for improvement to reach human-level performance (which is typically above 95% on HellaSwag).
Metrics: a. acc (Accuracy): Value: 0.4516 (45.16%), Stderr: ± 0.0050 (0.50%), b. acc_norm (Normalized Accuracy): Value: 0.6139 (61.39%), Stderr: ± 0.0049 (0.49%)

Environmental Impact

Hardware Type: A100
Hours used: No comment
Cloud Provider: Google Collab
Compute Region: Sacramento, CA, US
Framework: PyTorch

Technical Specifications [optional]

Hardware: A100 GPU

Model Card Author

Ruth Shacterman

Model Card Contact

[More Information Needed]