|
--- |
|
library_name: transformers |
|
datasets: |
|
- mlabonne/orpo-dpo-mix-40k |
|
language: |
|
- en |
|
base_model: |
|
- meta-llama/Llama-3.2-1B-Instruct |
|
--- |
|
|
|
# Model Card for Model ID |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
ORPO-Tuned Llama-3.2-1B-Instruct |
|
|
|
|
|
## Model Details |
|
- This model is a fine-tuned version of the meta-llama/Llama-3.2-1B-Instruct base model, adapted using the ORPO (Optimizing Reward and Preference Objectives) technique. |
|
- Base Model: It builds upon the Llama-3.2-1B-Instruct model, (1 billion parameter instruction-following language model). |
|
- Fine-Tuning Technique: The model was fine-tuned using ORPO. ORPO combines supervised fine-tuning with preference optimization. |
|
- Training Data: It was trained on the mlabonne/orpo-dpo-mix-40k dataset, containing 44,245 examples of prompts, chosen answers, and rejected answers. |
|
- Purpose: The model is designed to generate responses that are better aligned with human preferences while maintaining the general knowledge and capabilities of the base Llama 3 model. |
|
- Efficient Fine-Tuning: LoRA (Low-Rank Adaptation) was used for efficient adaptation, allowing for faster training and smaller storage requirements. |
|
- Capabilities: Model should follow instructions and generate responses that are more in line with human preferences compared to the base model. |
|
- Evaluation: The model's performance was evaluated on the HellaSwag benchmark |
|
|
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated. |
|
|
|
|
|
### Model Sources [optional] |
|
|
|
<!-- Provide the basic links for the model. --> |
|
https://uplimit.com/course/open-source-llms/session/session_clu1q3j6f016d128r2zxe3uyj/assignment/assignment_clyvnyyjh019h199337oef4ur |
|
https://uplimit.com/ugc-assets/course/course_clmz6fh2a00aa12bqdtjv6ygs/assets/1728565337395-85hdx93s03d0v9bd8j1nnxfjylyty2/uplimitopensourcellmsoctoberweekone.ipynb |
|
|
|
|
|
## Uses |
|
|
|
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
Hands-on learning: Finetuning LLMs |
|
|
|
### Direct Use |
|
|
|
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. --> |
|
Introduction to Finetuning LLMs course - Learning |
|
|
|
### Downstream Use [optional] |
|
|
|
This model is designed for tasks requiring improved alignment with human preferences, such as: |
|
- Chatbots |
|
- Question-answering systems |
|
- General text generation with enhanced preference alignment<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app --> |
|
|
|
### Out-of-Scope Use |
|
|
|
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. --> |
|
This should not yet be used in the world - More finetuning is required |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
<!-- This section is meant to convey both technical and sociotechnical limitations. --> |
|
- Performance may vary on tasks outside the training distribution |
|
- May inherit biases present in the base model and training data |
|
- Limited to 1B parameters, which may impact performance on complex tasks |
|
|
|
### Recommendations |
|
|
|
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> |
|
|
|
- Users should be aware of potential biases in model outputs |
|
- Not suitable for critical decision-making without human oversight |
|
- May generate plausible-sounding but incorrect information |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> |
|
For training data the model used:'mlabonne/orpo-dpo-mix-40k' |
|
|
|
This dataset is designed for ORPO (Optimizing Reward and Preference Objectives) or DPO (Direct Preference Optimization) training of language models. |
|
* It contains 44,245 examples in the training split. |
|
* Includes prompts, chosen answers, and rejected answers for each sample. |
|
* Combines various high-quality DPO datasets. |
|
[More Information Needed] |
|
|
|
### Training Procedure |
|
|
|
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. --> |
|
This model was fine-tuned using the ORPO (Optimizing Reward and Preference Objectives) technique on the meta-llama/Llama-3.2-1B-Instruct base model. |
|
|
|
Base Model: meta-llama/Llama-3.2-1B-Instruct |
|
Training Technique: ORPO (Optimizing Reward and Preference Objectives) |
|
Efficient Fine-tuning Method: LoRA (Low-Rank Adaptation) |
|
|
|
#### Preprocessing [optional] |
|
|
|
[More Information Needed] |
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
- Learning Rate: 2e-5 |
|
- Batch Size: 4 |
|
- Gradient Accumulation Steps: 4 |
|
- Training Steps: 500 |
|
- Warmup Steps: 20 |
|
- LoRA Rank: 16 |
|
- LoRA Alpha: 32 |
|
|
|
|
|
#### Speeds, Sizes, Times [optional] |
|
|
|
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. --> |
|
|
|
[More Information Needed] |
|
|
|
## Evaluation |
|
|
|
<!-- This section describes the evaluation protocols and provides the results. --> |
|
For evaluation the model used Hellaswag |
|
Results: |
|
|
|
|
|
| Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |
|
|---------|------:|------|-----:|--------|---|-----:|---|-----:| |
|
|hellaswag| 1|none | 0|acc |↑ |0.4516|± |0.0050| |
|
| | |none | 0|acc_norm|↑ |0.6139|± |0.0049| |
|
|
|
Interpretation: |
|
- Performance Level: The model achieves a raw accuracy of 45.16% and a normalized accuracy of 61.39% on the HellaSwag task. |
|
- Confidence: The small standard errors (about 0.5% for both metrics) indicate that these results are fairly precise. |
|
- Improvement over Random: Given that HellaSwag typically has 4 choices per question, a random baseline would achieve 25% accuracy. This model performs significantly better than random. |
|
- Normalized vs. Raw Accuracy: The higher normalized accuracy (61.39% vs. 45.16%) suggests that the model performs better when accounting for task-specific challenges. |
|
- Room for Improvement: While the performance is well above random, there's still significant room for improvement to reach human-level performance (which is typically above 95% on HellaSwag). |
|
|
|
|
|
|
|
#### Summary |
|
|
|
- Base Model: meta-llama/Llama-3.2-1B-Instruct |
|
- Model Type: Causal Language Model |
|
- Language: English |
|
|
|
Intended Use |
|
- This model is designed for tasks requiring improved alignment with human preferences, such as: |
|
- Chatbots |
|
- Question-answering systems |
|
- General text generation with enhanced preference alignment |
|
|
|
Training Data |
|
- Dataset: mlabonne/orpo-dpo-mix-40k |
|
- Size: 44,245 examples |
|
- Content: Prompts, chosen answers, and rejected answers |
|
|
|
|
|
Task: HellaSwag |
|
- This is a benchmark task designed to evaluate a model's commonsense reasoning and ability to complete scenarios logically. |
|
- No specific filtering was applied to the test set. |
|
- The evaluation was done in a zero-shot setting, where the model didn't receive any examples before making predictions. |
|
|
|
Interpretation: |
|
- Performance Level: The model achieves a raw accuracy of 45.16% and a normalized accuracy of 61.39% on the HellaSwag task. |
|
- Confidence: The small standard errors (about 0.5% for both metrics) indicate that these results are fairly precise. |
|
- Improvement over Random: Given that HellaSwag typically has 4 choices per question, a random baseline would achieve 25% accuracy. This model performs significantly better than random. |
|
- Normalized vs. Raw Accuracy: The higher normalized accuracy (61.39% vs. 45.16%) suggests that the model performs better when accounting for task-specific challenges. |
|
- Room for Improvement: While the performance is well above random, there's still significant room for improvement to reach human-level performance (which is typically above 95% on HellaSwag). |
|
- Metrics: a. acc (Accuracy): Value: 0.4516 (45.16%), Stderr: ± 0.0050 (0.50%), b. acc_norm (Normalized Accuracy): Value: 0.6139 (61.39%), Stderr: ± 0.0049 (0.49%) |
|
|
|
|
|
## Environmental Impact |
|
|
|
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly --> |
|
|
|
- **Hardware Type:** A100 |
|
- **Hours used:** No comment |
|
- **Cloud Provider:** Google Collab |
|
- **Compute Region:** Sacramento, CA, US |
|
- **Framework**: PyTorch |
|
|
|
## Technical Specifications [optional] |
|
|
|
Hardware: A100 GPU |
|
|
|
|
|
## Model Card Author |
|
|
|
Ruth Shacterman |
|
|
|
## Model Card Contact |
|
|
|
[More Information Needed] |