Cyrile commited on
Commit
92b2f25
1 Parent(s): ca38345

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -40,7 +40,7 @@ essential.
40
 
41
  The bloomz-3b-dpo-chat model was trained using the [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset, which includes:
42
 
43
- 1. **Human Preference Data:**
44
  - **Description:** Annotations of helpfulness and harmlessness, with each entry containing "chosen" and "rejected" text pairs.
45
  - **Purpose:** To train preference models for Reinforcement Learning from Human Feedback (RLHF), not for supervised training of dialogue agents.
46
  - **Source:** Data from context-distilled language models, rejection sampling, and an iterated online process.
@@ -92,7 +92,7 @@ result
92
  ```
93
 
94
 
95
- ## Citation
96
 
97
  ```bibtex
98
  @online{DeBloomzChat,
 
40
 
41
  The bloomz-3b-dpo-chat model was trained using the [Anthropic/hh-rlhf](https://huggingface.co/datasets/Anthropic/hh-rlhf) dataset, which includes:
42
 
43
+ **Human Preference Data:**
44
  - **Description:** Annotations of helpfulness and harmlessness, with each entry containing "chosen" and "rejected" text pairs.
45
  - **Purpose:** To train preference models for Reinforcement Learning from Human Feedback (RLHF), not for supervised training of dialogue agents.
46
  - **Source:** Data from context-distilled language models, rejection sampling, and an iterated online process.
 
92
  ```
93
 
94
 
95
+ ### Citation
96
 
97
  ```bibtex
98
  @online{DeBloomzChat,