---
library_name: transformers
license: apache-2.0
language:
- en
tags:
- causal-lm
- Large Language Model
- LLM
- detoxification
- unbias
- bias
- instruction
- finetuned
- llama2
- DPO
---
# Model Card for SungJoo/llama2-7b-sft-detox
This model is an instruction-tuned version of meta-llama/Llama-2-7b-hf, fine-tuned to reduce toxicity in Large Language Models (LLMs).
## Model Details
### Model Description
This model is built on the LLaMA-2-7b architecture and has been refined with instruction tuning and Direct Preference Optimization (DPO).
- **Developed by:** Sungjoo Byun (Grace Byun)
- **Model type:** Auto-regressive language model
- **Language(s) (NLP):** English
- **License:** Apache License 2.0
- **Finetuned from:** meta-llama/Llama-2-7b-hf
### Model Sources
- **Repository:** TBD
- **Paper:** TBD
## Uses
This model is intended to be used for generating less toxic language in various applications, including chatbots and other NLP systems.
## Bias, Risks, and Limitations
While this model aims to reduce toxicity, it may still generate biased or harmful content. Users should apply this model with caution and review outputs for sensitive applications.
## How to Get Started with the Model
Use the code below to get started with the model:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("SungJoo/llama2-7b-sft-dpo-detox")
model = AutoModelForCausalLM.from_pretrained("SungJoo/llama2-7b-sft-dpo-detox")
```
## Training Details
- Parameter-Efficient Fine-Tuning (PEFT)
- BitsAndBytes Configuration (bnb_config): This model employs a 4-bit quantization technique using the BitsAndBytes library to further enhance training efficiency.
### Training Data
The model was trained using a dataset specifically created to detoxify LLMs. DPO dataset will be publicly available soon.
### Training Procedure
The model was trained using efficient fine-tuning techniques with the following hyperparameters:
| **Hyperparameter** | **Value** |
|--------------------|-----------|
| Batch size | 4 |
| Learning rate | 2e-4 |
| Epochs | 10 |
| Max length | 2,048 |
| Max prompt length | 1,024 |
| Beta | 0.1 |
*Hyperparameters when applying DPO to LLaMA-2
## Objective
The main objective of this research is to reduce toxicity in LLMs by applying instruction tuning and Direct Preference Optimization (DPO).
A comprehensive instruction and DPO dataset was constructed for this purpose, which will be released in the future.
| **Model** | **LLaMA-2-base** | | **Finetuned LLaMA-2** | | **DPO LLaMA-2** | |
|--------------------|-------------------|-----------------------|-----------------------|-------------------------|-----------------------|-------------------------|
| **Category** | **\>=0.5 (%)** | **Count** | **\>=0.5 (%)** | **Count** | **\>=0.5 (%)** | **Count** |
| **TOXICITY** | 4.46 | 4,438 | 3.61 | 3,593 | 2.39 | 2,377 |
| | | | (-0.85) | (-845) | (-1.22) | (-1,216) |
| **SEVERE_TOXICITY**| 0.08 | 77 | 0.07 | 70 | 0.03 | 31 |
| | | | (-0.01) | (-7) | (-0.04) | (-39) |
| **IDENTITY_ATTACK**| 0.79 | 788 | 0.42 | 413 | 0.28 | 274 |
| | | | (-0.37) | (-375) | (-0.14) | (-139) |
| **INSULT** | 1.97 | 1,961 | 1.60 | 1,588 | 0.90 | 892 |
| | | | (-0.37) | (-373) | (-0.70) | (-696) |
| **PROFANITY** | 2.10 | 2,086 | 1.76 | 1,753 | 1.04 | 1,030 |
| | | | (-0.34) | (-333) | (-0.72) | (-723) |
| **THREAT** | 1.43 | 1,424 | 0.92 | 919 | 0.76 | 754 |
| | | | (-0.51) | (-505) | (-0.16) | (-165) |
*Comparison of LLaMA-2-base, Finetuned LLaMA-2, and DPO LLaMA-2 across various categories. Reductions in blue indicate comparisons between the base model and the fine-tuned model, while text in green represents comparisons between the fine-tuned model and the DPO model.*
The table above shows the effectiveness of this model in reducing bias, measured using the RealToxicityPrompt dataset and the Perspective API.
## Contact
For any questions or issues, please contact byunsj@snu.ac.kr.