File size: 3,276 Bytes
c346f72
 
44466a3
 
 
 
c346f72
44466a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ec3f17c
44466a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
---
license: apache-2.0
datasets:
- lmsys/toxic-chat
metrics:
- perplexity
---

# Model Card for Model ID

This model is a `facebook/bart-large` fine-tuned on toxic inputs from `lmsys/toxic-chat` dataset.

## Model Details

This model is not intended to be used for plain inference as it is very likely to predict toxic content. 
It is intended to be used instead as "utility model" for detecting and fixing toxic content as its token probability distributions will likely differ from comparable models not trained/fine-tuned over toxic data. 

Its name tci_minus refers to the _G-_ model in [Detoxifying Text with MaRCo: Controllable Revision with Experts and Anti-Experts](https://aclanthology.org/2023.acl-short.21.pdf).

It can be used within `TrustyAI`'s `TMaRCo` tool for detoxifying text, see https://github.com/trustyai-explainability/trustyai-detoxify/.

### Model Description

<!-- Provide a longer summary of what this model is. -->

- **Developed by:** [tteofili]
- **Shared by:** [tteofili]
- **License:** [AL2.0]
- **Finetuned from model:** ["facebook/bart-large"]

## Uses

This model is intended to be used as "utility model" for detecting and fixing toxic content as its token probability distributions will likely differ from comparable models not trained/fine-tuned over toxic data.

## Bias, Risks, and Limitations

This model is fine-tuned over toxic inputs from the [`lmsys/toxic-chat`](https://huggingface.co/lmsys/toxic-chat) dataset and it is very likely to produce toxic content. For this reason this model should only be used in combination with other models for the sake of detecting / fixing toxic content.

## How to Get Started with the Model

Use the code below to start using the model for text detoxification.

```python
from trustyai.detoxify import TMaRCo
tmarco = TMaRCo(expert_weights=[-1, 3])
tmarco.load_models(["trustyai/tci_minus", "trustyai/gplus"])
tmarco.rephrase(["white men can't jump"])
```

## Training Details

This model has been trained on toxic inputs from the `lmsys/toxic-chat` dataset.

### Training Data

Training data from the [`lmsys/toxic-chat`](https://huggingface.co/lmsys/toxic-chat) dataset.


### Training Procedure 

This model has been fine tuned with the following code:

```python
from trustyai.detoxify import TMaRCo

dataset_name = 'lmsys/toxic-chat'
data_dir = ''
perc = 100
td_columns = ['model_output', 'user_input', 'human_annotation', 'conv_id', 'jailbreaking', 'openai_moderation',
              'toxicity']

target_feature = 'toxicity'
content_feature = 'user_input'
model_prefix = 'toxic_chat_input_'
tmarco.train_models(perc=perc, dataset_name=dataset_name, expert_feature=target_feature, model_prefix=model_prefix,
                    data_dir=data_dir, content_feature=content_feature, td_columns=td_columns)
```

#### Training Hyperparameters

This model has been trained with the following hyperparams:

```python
training_args = TrainingArguments(
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    weight_decay=0.01
)
```

## Evaluation

### Testing Data, Factors & Metrics

#### Testing Data

Test data from the [`lmsys/toxic-chat`](https://huggingface.co/lmsys/toxic-chat) dataset.

#### Metrics

The model was evaluated using perplexity metric.

### Results

Perplexity: 1.08