File size: 8,603 Bytes

---
library_name: transformers
datasets:
- mlabonne/orpo-dpo-mix-40k
language:
- en
base_model:
- meta-llama/Llama-3.2-1B-Instruct
---

# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->
ORPO-Tuned Llama-3.2-1B-Instruct


## Model Details
- This model is a fine-tuned version of the meta-llama/Llama-3.2-1B-Instruct base model, adapted using the ORPO (Optimizing Reward and Preference Objectives) technique. 
- Base Model: It builds upon the Llama-3.2-1B-Instruct model, (1 billion parameter instruction-following language model).
- Fine-Tuning Technique: The model was fine-tuned using ORPO. ORPO combines supervised fine-tuning with preference optimization.
- Training Data: It was trained on the mlabonne/orpo-dpo-mix-40k dataset, containing 44,245 examples of prompts, chosen answers, and rejected answers.
- Purpose: The model is designed to generate responses that are better aligned with human preferences while maintaining the general knowledge and capabilities of the base Llama 3 model.
- Efficient Fine-Tuning: LoRA (Low-Rank Adaptation) was used for efficient adaptation, allowing for faster training and smaller storage requirements.
- Capabilities: Model should follow instructions and generate responses that are more in line with human preferences compared to the base model.
- Evaluation: The model's performance was evaluated on the HellaSwag benchmark


### Model Description

<!-- Provide a longer summary of what this model is. -->

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.


### Model Sources [optional]

<!-- Provide the basic links for the model. -->
https://uplimit.com/course/open-source-llms/session/session_clu1q3j6f016d128r2zxe3uyj/assignment/assignment_clyvnyyjh019h199337oef4ur
https://uplimit.com/ugc-assets/course/course_clmz6fh2a00aa12bqdtjv6ygs/assets/1728565337395-85hdx93s03d0v9bd8j1nnxfjylyty2/uplimitopensourcellmsoctoberweekone.ipynb


## Uses


<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
Hands-on learning: Finetuning LLMs

### Direct Use

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
Introduction to Finetuning LLMs course - Learning

### Downstream Use [optional]

This model is designed for tasks requiring improved alignment with human preferences, such as:
- Chatbots
- Question-answering systems
- General text generation with enhanced preference alignment<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->

### Out-of-Scope Use

<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
This should not yet be used in the world - More finetuning is required


## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->
- Performance may vary on tasks outside the training distribution
- May inherit biases present in the base model and training data
- Limited to 1B parameters, which may impact performance on complex tasks

### Recommendations

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

- Users should be aware of potential biases in model outputs
- Not suitable for critical decision-making without human oversight
- May generate plausible-sounding but incorrect information


## Training Details


### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
For training data the model used:'mlabonne/orpo-dpo-mix-40k'

This dataset is designed for ORPO (Optimizing Reward and Preference Objectives) or DPO (Direct Preference Optimization) training of language models.
* It contains 44,245 examples in the training split.
* Includes prompts, chosen answers, and rejected answers for each sample.
* Combines various high-quality DPO datasets.
[More Information Needed]

### Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
This model was fine-tuned using the ORPO (Optimizing Reward and Preference Objectives) technique on the meta-llama/Llama-3.2-1B-Instruct base model.

Base Model: meta-llama/Llama-3.2-1B-Instruct
Training Technique: ORPO (Optimizing Reward and Preference Objectives)
Efficient Fine-tuning Method: LoRA (Low-Rank Adaptation)

#### Preprocessing [optional]

[More Information Needed]


#### Training Hyperparameters

- Learning Rate: 2e-5
- Batch Size: 4
- Gradient Accumulation Steps: 4
- Training Steps: 500
- Warmup Steps: 20
- LoRA Rank: 16
- LoRA Alpha: 32


#### Speeds, Sizes, Times [optional]

<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->

[More Information Needed]

## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->
For evaluation the model used Hellaswag
Results:


|  Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|---------|------:|------|-----:|--------|---|-----:|---|-----:|
|hellaswag|      1|none  |     0|acc     |↑  |0.4516|±  |0.0050|
|         |       |none  |     0|acc_norm|↑  |0.6139|±  |0.0049|

Interpretation:
- Performance Level: The model achieves a raw accuracy of 45.16% and a normalized accuracy of 61.39% on the HellaSwag task.
- Confidence: The small standard errors (about 0.5% for both metrics) indicate that these results are fairly precise.
- Improvement over Random: Given that HellaSwag typically has 4 choices per question, a random baseline would achieve 25% accuracy. This model performs significantly better than random.
- Normalized vs. Raw Accuracy: The higher normalized accuracy (61.39% vs. 45.16%) suggests that the model performs better when accounting for task-specific challenges.
- Room for Improvement: While the performance is well above random, there's still significant room for improvement to reach human-level performance (which is typically above 95% on HellaSwag).



#### Summary

- Base Model: meta-llama/Llama-3.2-1B-Instruct
- Model Type: Causal Language Model
- Language: English

Intended Use
- This model is designed for tasks requiring improved alignment with human preferences, such as:
- Chatbots
- Question-answering systems
- General text generation with enhanced preference alignment

Training Data
- Dataset: mlabonne/orpo-dpo-mix-40k
- Size: 44,245 examples
- Content: Prompts, chosen answers, and rejected answers


Task: HellaSwag
- This is a benchmark task designed to evaluate a model's commonsense reasoning and ability to complete scenarios logically.
- No specific filtering was applied to the test set.
- The evaluation was done in a zero-shot setting, where the model didn't receive any examples before making predictions.

Interpretation:
- Performance Level: The model achieves a raw accuracy of 45.16% and a normalized accuracy of 61.39% on the HellaSwag task.
- Confidence: The small standard errors (about 0.5% for both metrics) indicate that these results are fairly precise.
- Improvement over Random: Given that HellaSwag typically has 4 choices per question, a random baseline would achieve 25% accuracy. This model performs significantly better than random.
- Normalized vs. Raw Accuracy: The higher normalized accuracy (61.39% vs. 45.16%) suggests that the model performs better when accounting for task-specific challenges.
- Room for Improvement: While the performance is well above random, there's still significant room for improvement to reach human-level performance (which is typically above 95% on HellaSwag).
- Metrics: a. acc (Accuracy): Value: 0.4516 (45.16%), Stderr: ± 0.0050 (0.50%), b. acc_norm (Normalized Accuracy): Value: 0.6139 (61.39%), Stderr: ± 0.0049 (0.49%)


## Environmental Impact

<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

- **Hardware Type:** A100
- **Hours used:** No comment
- **Cloud Provider:** Google Collab
- **Compute Region:** Sacramento, CA, US
- **Framework**: PyTorch

## Technical Specifications [optional]

Hardware: A100 GPU


## Model Card Author

Ruth Shacterman

## Model Card Contact

[More Information Needed]