File size: 5,883 Bytes
5bd31c9
 
9c04e2c
 
 
 
 
 
 
 
 
 
 
 
 
5bd31c9
 
9c04e2c
5bd31c9
9c04e2c
5bd31c9
 
 
 
 
9c04e2c
5bd31c9
9c04e2c
 
 
 
 
5bd31c9
9c04e2c
5bd31c9
9c04e2c
 
5bd31c9
 
 
9c04e2c
a41b016
5bd31c9
 
9c04e2c
5bd31c9
a41b016
df20cbc
9c04e2c
df20cbc
9c04e2c
 
 
 
 
 
5bd31c9
 
 
 
9c04e2c
5bd31c9
 
9c04e2c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5bd31c9
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
---
library_name: transformers
license: apache-2.0
language:
- en
tags:
- causal-lm
- Large Language Model
- LLM
- detoxification
- unbias
- bias
- instruction
- finetuned
- llama2
---

# Model Card for SungJoo/llama2-7b-sft-detox

This model is an instruction-tuned version of meta-llama/Llama-2-7b-hf, fine-tuned to reduce toxicity in Large Language Models (LLMs).

## Model Details

### Model Description

This is an instruction-tuned model based on the LLaMA-2-7b architecture. It has been fine-tuned using a comprehensive instruction dataset specifically designed for detoxification of LLMs.

- **Developed by:** Sungjoo Byun (Grace Byun)
- **Model type:** Auto-regressive language model
- **Language(s) (NLP):** English
- **License:** Apache License 2.0
- **Finetuned from:** meta-llama/Llama-2-7b-hf

### Model Sources

- **Repository:** TBD
- **Paper:** TBD

## Uses

This model is intended to be used for generating less toxic language in various applications, including chatbots and other NLP systems.

## Bias, Risks, and Limitations

While this model aims to reduce toxicity, it may still generate biased or harmful content. Users should apply this model with caution and review outputs for sensitive applications.

## How to Get Started with the Model

Use the code below to get started with the model:

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("SungJoo/llama2-7b-sft-detox")
model = AutoModelForCausalLM.from_pretrained("SungJoo/llama2-7b-sft-detox")
```

## Training Details

### Training Data
The model was fine-tuned using a dataset specifically created to detoxify LLMs. This dataset will be publicly available soon.

### Training Procedure
The model was trained using full fine-tuning with the following hyperparameters:

| **Hyperparameter** | **Value** |
|--------------------|-----------|
| Batch size         | 128       |
| Learning rate      | 2e-5      |
| Epochs             | 3         |
| Max length         | 512       |
| Weight decay       | 0         |


The training was conducted on 4 A100 80GB GPUs.

## Objective
The main objective of this research is to reduce toxicity in LLMs by applying instruction tuning and Direct Preference Optimization (DPO). 
This version has been tuned with instruction tuning only.
A comprehensive instruction and DPO dataset was constructed for this purpose, which will be released in the future.

| **Model**          | **LLaMA-2-base**  |                       | **Finetuned LLaMA-2** |                         | **DPO LLaMA-2**      |                         |
|--------------------|-------------------|-----------------------|-----------------------|-------------------------|-----------------------|-------------------------|
| **Category**       | **\>=0.5 (%)**    | **Count**             | **\>=0.5 (%)**        | **Count**               | **\>=0.5 (%)**        | **Count**               |
| **TOXICITY**       | 4.46              | 4,438                 | 3.61                  | 3,593                   | 2.39                  | 2,377                   |
|                    |                   |                       | <span style="color:blue;">(-0.85)</span> | <span style="color:blue;">(-845)</span>  | <span style="color:green;">(-1.22)</span> | <span style="color:green;">(-1,216)</span> |
| **SEVERE_TOXICITY**| 0.08              | 77                    | 0.07                  | 70                      | 0.03                  | 31                      |
|                    |                   |                       | <span style="color:blue;">(-0.01)</span> | <span style="color:blue;">(-7)</span>    | <span style="color:green;">(-0.04)</span>  | <span style="color:green;">(-39)</span>   |
| **IDENTITY_ATTACK**| 0.79              | 788                   | 0.42                  | 413                     | 0.28                  | 274                     |
|                    |                   |                       | <span style="color:blue;">(-0.37)</span> | <span style="color:blue;">(-375)</span>  | <span style="color:green;">(-0.14)</span>  | <span style="color:green;">(-139)</span>  |
| **INSULT**         | 1.97              | 1,961                 | 1.60                  | 1,588                   | 0.90                  | 892                     |
|                    |                   |                       | <span style="color:blue;">(-0.37)</span> | <span style="color:blue;">(-373)</span>  | <span style="color:green;">(-0.70)</span>  | <span style="color:green;">(-696)</span>  |
| **PROFANITY**      | 2.10              | 2,086                 | 1.76                  | 1,753                   | 1.04                  | 1,030                   |
|                    |                   |                       | <span style="color:blue;">(-0.34)</span> | <span style="color:blue;">(-333)</span>  | <span style="color:green;">(-0.72)</span>  | <span style="color:green;">(-723)</span>  |
| **THREAT**         | 1.43              | 1,424                 | 0.92                  | 919                     | 0.76                  | 754                     |
|                    |                   |                       | <span style="color:blue;">(-0.51)</span> | <span style="color:blue;">(-505)</span>  | <span style="color:green;">(-0.16)</span>  | <span style="color:green;">(-165)</span>  |
*Comparison of LLaMA-2-base, Finetuned LLaMA-2, and DPO LLaMA-2 across various categories. Reductions in blue indicate comparisons between the base model and the fine-tuned model, while text in green represents comparisons between the fine-tuned model and the DPO model.*

The table above shows the effectiveness of this model in reducing bias, measured using the RealToxicityPrompt dataset and the Perspective API.

## Contact
For any questions or issues, please contact [email protected].