|
--- |
|
library_name: transformers |
|
license: apache-2.0 |
|
language: |
|
- en |
|
tags: |
|
- causal-lm |
|
- Large Language Model |
|
- LLM |
|
- detoxification |
|
- unbias |
|
- bias |
|
- instruction |
|
- finetuned |
|
- llama2 |
|
- DPO |
|
--- |
|
|
|
# Model Card for SungJoo/llama2-7b-sft-detox |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
This model is built on the LLaMA-2-7b architecture and has been refined with instruction tuning and Direct Preference Optimization (DPO). |
|
|
|
- **Developed by:** Sungjoo Byun (Grace Byun) |
|
- **Model type:** Auto-regressive language model |
|
- **Language(s) (NLP):** English |
|
- **License:** Apache License 2.0 |
|
- **Finetuned from:** meta-llama/Llama-2-7b-hf |
|
|
|
### Model Sources |
|
|
|
- **Repository:** TBD |
|
- **Paper:** TBD |
|
|
|
## Uses |
|
|
|
This model is intended to be used for generating less toxic language in various applications, including chatbots and other NLP systems. |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
While this model aims to reduce toxicity, it may still generate biased or harmful content. Users should apply this model with caution and review outputs for sensitive applications. |
|
|
|
## How to Get Started with the Model |
|
|
|
Use the code below to get started with the model: |
|
|
|
```python |
|
import torch |
|
from transformers import AutoTokenizer, AutoPeftModelForCausalLM |
|
|
|
DEV = torch.device('cuda' if torch.cuda.is_available() else 'cpu') |
|
|
|
adapter_path = "SungJoo/llama2-7b-sft-detox-DPO" |
|
|
|
# Load model |
|
model = AutoPeftModelForCausalLM.from_pretrained( |
|
adapter_path, |
|
torch_dtype=torch.bfloat16 |
|
).to(DEV) |
|
|
|
# Load tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained(adapter_path) |
|
``` |
|
```python |
|
# Example usage |
|
input_text = "Your input text here" |
|
inputs = tokenizer(input_text, return_tensors="pt").to(DEV) |
|
outputs = model.generate(**inputs) |
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
``` |
|
|
|
## Training Details |
|
|
|
- Parameter-Efficient Fine-Tuning (PEFT) |
|
|
|
- BitsAndBytes Configuration (bnb_config): This model employs a 4-bit quantization technique using the BitsAndBytes library to further enhance training efficiency. |
|
|
|
### Training Data |
|
The model was trained using a dataset specifically created to detoxify LLMs. DPO dataset will be publicly available soon. |
|
|
|
### Training Procedure |
|
|
|
DPO was applied on "SungJoo/llama2-7b-sft-detox" with the following hyperparameters: |
|
|
|
| **Hyperparameter** | **Value** | |
|
|--------------------|-----------| |
|
| Batch size | 4 | |
|
| Learning rate | 2e-4 | |
|
| Epochs | 10 | |
|
| Max length | 2,048 | |
|
| Max prompt length | 1,024 | |
|
| Beta | 0.1 | |
|
|
|
|
|
|
|
## Objective |
|
The main objective of this research is to reduce toxicity in LLMs by applying instruction tuning and Direct Preference Optimization (DPO). |
|
A comprehensive instruction and DPO dataset was constructed for this purpose, which will be released in the future. |
|
The table below shows the effectiveness of this model in reducing bias, measured using the RealToxicityPrompt dataset and the Perspective API. |
|
|
|
| **Model** | **LLaMA-2-base** | | **Finetuned LLaMA-2** | | **DPO LLaMA-2** | | |
|
|--------------------|-------------------|-----------------------|-----------------------|-------------------------|-----------------------|-------------------------| |
|
| **Category** | **\>=0.5 (%)** | **Count** | **\>=0.5 (%)** | **Count** | **\>=0.5 (%)** | **Count** | |
|
| **TOXICITY** | 4.46 | 4,438 | 3.61 | 3,593 | 2.39 | 2,377 | |
|
| | | | <span style="color:blue;">(-0.85)</span> | <span style="color:blue;">(-845)</span> | <span style="color:green;">(-1.22)</span> | <span style="color:green;">(-1,216)</span> | |
|
| **SEVERE_TOXICITY**| 0.08 | 77 | 0.07 | 70 | 0.03 | 31 | |
|
| | | | <span style="color:blue;">(-0.01)</span> | <span style="color:blue;">(-7)</span> | <span style="color:green;">(-0.04)</span> | <span style="color:green;">(-39)</span> | |
|
| **IDENTITY_ATTACK**| 0.79 | 788 | 0.42 | 413 | 0.28 | 274 | |
|
| | | | <span style="color:blue;">(-0.37)</span> | <span style="color:blue;">(-375)</span> | <span style="color:green;">(-0.14)</span> | <span style="color:green;">(-139)</span> | |
|
| **INSULT** | 1.97 | 1,961 | 1.60 | 1,588 | 0.90 | 892 | |
|
| | | | <span style="color:blue;">(-0.37)</span> | <span style="color:blue;">(-373)</span> | <span style="color:green;">(-0.70)</span> | <span style="color:green;">(-696)</span> | |
|
| **PROFANITY** | 2.10 | 2,086 | 1.76 | 1,753 | 1.04 | 1,030 | |
|
| | | | <span style="color:blue;">(-0.34)</span> | <span style="color:blue;">(-333)</span> | <span style="color:green;">(-0.72)</span> | <span style="color:green;">(-723)</span> | |
|
| **THREAT** | 1.43 | 1,424 | 0.92 | 919 | 0.76 | 754 | |
|
| | | | <span style="color:blue;">(-0.51)</span> | <span style="color:blue;">(-505)</span> | <span style="color:green;">(-0.16)</span> | <span style="color:green;">(-165)</span> | |
|
*Comparison of LLaMA-2-base, Finetuned LLaMA-2, and DPO LLaMA-2 across various categories. Reductions in blue indicate comparisons between the base model and the fine-tuned model, while text in green represents comparisons between the fine-tuned model and the DPO model.* |
|
|
|
|
|
## Contact |
|
For any questions or issues, please contact [email protected]. |