Tawkat commited on
Commit
f5ffee3
·
verified ·
1 Parent(s): ac3ee09

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +16 -16
README.md CHANGED
@@ -13,7 +13,7 @@ language:
13
  ---
14
 
15
 
16
- # GreenLLaMA-7B
17
 
18
  <p align="center">
19
  <br>
@@ -23,9 +23,9 @@ language:
23
 
24
  </p>
25
 
26
- This model card corresponds to the GreenLLaMA-7B detoxification model based on [LLaMA-2](https://huggingface.co/meta-llama/Llama-2-7b). The model is finetuned with Chain-of-Thought (CoT) explanation.
27
 
28
- **Paper**: [GreenLLaMA: A Framework for Detoxification with Explanations](https://arxiv.org/abs/2402.15951)
29
 
30
  **Authors**: Md Tawkat Islam Khondaker, Muhammad Abdul-Mageed, Laks V.S. Lakshmanan
31
 
@@ -35,7 +35,7 @@ Summary description and brief definition of inputs and outputs.
35
 
36
  ### Description
37
 
38
- GreenLLaMA is the first comprehensive end-to-end detoxification framework trained on cross-platform pseudo-parallel corpus. GreenLLaMA further introduces explanation to promote transparency and trustworthiness. The framework also demonstrates robustness against adversarial toxicity.
39
 
40
  ### Usage
41
 
@@ -48,7 +48,7 @@ Below we share some code snippets on how to get quickly started with running the
48
  ```python
49
  from transformers import AutoTokenizer, AutoModelForCausalLM
50
 
51
- model_name = "UBC-NLP/GreenLLaMA-7b"
52
 
53
  tokenizer = AutoTokenizer.from_pretrained(model_name)
54
  model = AutoModelForCausalLM.from_pretrained(model_name)
@@ -71,7 +71,7 @@ print(tokenizer.decode(outputs[0]))
71
  ```python
72
  from transformers import AutoTokenizer, AutoModelForCausalLM
73
 
74
- model_name = "UBC-NLP/GreenLLaMA-7b"
75
 
76
  tokenizer = AutoTokenizer.from_pretrained(model_name)
77
  model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
@@ -97,7 +97,7 @@ print(tokenizer.decode(outputs[0]))
97
  # pip install accelerate
98
  from transformers import AutoTokenizer, AutoModelForCausalLM
99
 
100
- model_name = "UBC-NLP/GreenLLaMA-7b"
101
 
102
  tokenizer = AutoTokenizer.from_pretrained(model_name)
103
  model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.float16)
@@ -118,7 +118,7 @@ print(tokenizer.decode(outputs[0]))
118
  ```python
119
  from transformers import AutoTokenizer, AutoModelForCausalLM
120
 
121
- model_name = "UBC-NLP/GreenLLaMA-7b"
122
 
123
  tokenizer = AutoTokenizer.from_pretrained(model_name)
124
  model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.bfloat16)
@@ -143,7 +143,7 @@ from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
143
 
144
  quantization_config = BitsAndBytesConfig(load_in_8bit=True)
145
 
146
- model_name = "UBC-NLP/GreenLLaMA-7b"
147
 
148
  tokenizer = AutoTokenizer.from_pretrained(model_name)
149
  model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=quantization_config)
@@ -166,7 +166,7 @@ from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
166
 
167
  quantization_config = BitsAndBytesConfig(load_in_4bit=True)
168
 
169
- model_name = "UBC-NLP/GreenLLaMA-7b"
170
 
171
  tokenizer = AutoTokenizer.from_pretrained(model_name)
172
  model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=quantization_config)
@@ -194,7 +194,7 @@ These models have certain limitations that users should be aware of.
194
 
195
  ### Intended Usage
196
 
197
- The intended use of GreenLLaMA is for the detoxification tasks. We aim to help researchers to build an end-to-end complete detoxification framework. GreenLLaMA can also be regarded as a promising baseline to develop more robust and effective detoxification frameworks.
198
 
199
 
200
  ### Limitations
@@ -202,9 +202,9 @@ The intended use of GreenLLaMA is for the detoxification tasks. We aim to help r
202
  * **Data Generation Process:**
203
  This work uses ChatGPT, a gpt-3.5-turbo version from June, 2023. Since the model can be updated on a regular interval, the data generation process should be treated accordingly.
204
  * **Data Quality:**
205
- GreenLLaMA proposes an automated data generation pipeline to create a pseudo-parallel cross-platform corpus. The synthetic data generation process involves multi-stage data processing without the necessity of direct human inspection. Although this automated pipeline makes the overall data generation process scalable, it comes at the risk of allowing low-quality data in our cross-platform corpus. Hence, human inspection is recommended to remove any sort of potential vulnerability and maintain a standard quality of the corpus.
206
  * **Model Responses:**
207
- Although GreenLLaMA exhibits impressive ability in generating detoxified responses, we believe there is still room for improvement for the model in terms of producing meaning-preserved detoxified outcomes. Moreover, the models can sometimes be vulnerable to implicit, adversarial tokens and continue to produce toxic content. Therefore, we recommend that GreenLLaMA should be couched with caution before deployment.
208
 
209
  ### Ethical Considerations and Risks
210
 
@@ -214,13 +214,13 @@ In creating an open model, we have carefully considered the following:
214
  * **Data Collection and Release:**
215
  We compile datasets from a wide range of platforms. To ensure proper credit assignment, we refer users to the original publications in our paper. We create the cross-platform detoxification corpus for academic research purposes. We intend to share the corpus in the future. We would also like to mention that some content are generated using GPT-4 for illustration purposes.
216
  * **Potential Misuse and Bias:**
217
- GreenLLaMA can potentially be misused to generate toxic and biased content. For these reasons, we recommend that GreenLLaMA not be used in applications without careful prior consideration of potential misuse and bias.
218
 
219
  ## Citation
220
  If you use GreenLLaMA for your scientific publication, or if you find the resources in this repository useful, please cite our paper as follows:
221
  ```
222
- @inproceedings{Khondaker2024GreenLLaMA,
223
- title={GreenLLaMA: A Framework for Detoxification with Explanations},
224
  author={Md. Tawkat Islam Khondaker and Muhammad Abdul-Mageed and Laks V. S. Lakshmanan},
225
  year={2024},
226
  url={https://arxiv.org/abs/2402.15951}
 
13
  ---
14
 
15
 
16
+ # DetoxLLM-7B
17
 
18
  <p align="center">
19
  <br>
 
23
 
24
  </p>
25
 
26
+ This model card corresponds to the DetoxLLM-7B detoxification model based on [LLaMA-2](https://huggingface.co/meta-llama/Llama-2-7b). The model is finetuned with Chain-of-Thought (CoT) explanation.
27
 
28
+ **Paper**: [GreenLLaMA: A Framework for Detoxification with Explanations](https://arxiv.org/abs/2402.15951) **(EMNLP 2024 Main)**
29
 
30
  **Authors**: Md Tawkat Islam Khondaker, Muhammad Abdul-Mageed, Laks V.S. Lakshmanan
31
 
 
35
 
36
  ### Description
37
 
38
+ DetoxLLM is the first comprehensive end-to-end detoxification framework trained on cross-platform pseudo-parallel corpus. DetoxLLM further introduces explanation to promote transparency and trustworthiness. The framework also demonstrates robustness against adversarial toxicity.
39
 
40
  ### Usage
41
 
 
48
  ```python
49
  from transformers import AutoTokenizer, AutoModelForCausalLM
50
 
51
+ model_name = "UBC-NLP/DetoxLLM-7B"
52
 
53
  tokenizer = AutoTokenizer.from_pretrained(model_name)
54
  model = AutoModelForCausalLM.from_pretrained(model_name)
 
71
  ```python
72
  from transformers import AutoTokenizer, AutoModelForCausalLM
73
 
74
+ model_name = "UBC-NLP/DetoxLLM-7B"
75
 
76
  tokenizer = AutoTokenizer.from_pretrained(model_name)
77
  model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
 
97
  # pip install accelerate
98
  from transformers import AutoTokenizer, AutoModelForCausalLM
99
 
100
+ model_name = "UBC-NLP/DetoxLLM-7B"
101
 
102
  tokenizer = AutoTokenizer.from_pretrained(model_name)
103
  model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.float16)
 
118
  ```python
119
  from transformers import AutoTokenizer, AutoModelForCausalLM
120
 
121
+ model_name = "UBC-NLP/DetoxLLM-7B"
122
 
123
  tokenizer = AutoTokenizer.from_pretrained(model_name)
124
  model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype=torch.bfloat16)
 
143
 
144
  quantization_config = BitsAndBytesConfig(load_in_8bit=True)
145
 
146
+ model_name = "UBC-NLP/DetoxLLM-7B"
147
 
148
  tokenizer = AutoTokenizer.from_pretrained(model_name)
149
  model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=quantization_config)
 
166
 
167
  quantization_config = BitsAndBytesConfig(load_in_4bit=True)
168
 
169
+ model_name = "UBC-NLP/DetoxLLM-7B"
170
 
171
  tokenizer = AutoTokenizer.from_pretrained(model_name)
172
  model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=quantization_config)
 
194
 
195
  ### Intended Usage
196
 
197
+ The intended use of DetoxLLM is for the detoxification tasks. We aim to help researchers to build an end-to-end complete detoxification framework. DetoxLLM can also be regarded as a promising baseline to develop more robust and effective detoxification frameworks.
198
 
199
 
200
  ### Limitations
 
202
  * **Data Generation Process:**
203
  This work uses ChatGPT, a gpt-3.5-turbo version from June, 2023. Since the model can be updated on a regular interval, the data generation process should be treated accordingly.
204
  * **Data Quality:**
205
+ DetoxLLM proposes an automated data generation pipeline to create a pseudo-parallel cross-platform corpus. The synthetic data generation process involves multi-stage data processing without the necessity of direct human inspection. Although this automated pipeline makes the overall data generation process scalable, it comes at the risk of allowing low-quality data in our cross-platform corpus. Hence, human inspection is recommended to remove any sort of potential vulnerability and maintain a standard quality of the corpus.
206
  * **Model Responses:**
207
+ Although DetoxLLM exhibits impressive ability in generating detoxified responses, we believe there is still room for improvement for the model in terms of producing meaning-preserved detoxified outcomes. Moreover, the models can sometimes be vulnerable to implicit, adversarial tokens and continue to produce toxic content. Therefore, we recommend that DetoxLLM should be couched with caution before deployment.
208
 
209
  ### Ethical Considerations and Risks
210
 
 
214
  * **Data Collection and Release:**
215
  We compile datasets from a wide range of platforms. To ensure proper credit assignment, we refer users to the original publications in our paper. We create the cross-platform detoxification corpus for academic research purposes. We intend to share the corpus in the future. We would also like to mention that some content are generated using GPT-4 for illustration purposes.
216
  * **Potential Misuse and Bias:**
217
+ GreenLLaMA can potentially be misused to generate toxic and biased content. For these reasons, we recommend that DetoxLLM not be used in applications without careful prior consideration of potential misuse and bias.
218
 
219
  ## Citation
220
  If you use GreenLLaMA for your scientific publication, or if you find the resources in this repository useful, please cite our paper as follows:
221
  ```
222
+ @inproceedings{Khondaker2024DetoxLLM,
223
+ title={DetoxLLM: A Framework for Detoxification with Explanations},
224
  author={Md. Tawkat Islam Khondaker and Muhammad Abdul-Mageed and Laks V. S. Lakshmanan},
225
  year={2024},
226
  url={https://arxiv.org/abs/2402.15951}