Update README.md
Browse files
README.md
CHANGED
@@ -23,4 +23,6 @@ model-index:
|
|
23 |
# Reward model based `deberta-v3-large-tasksource-nli` fine-tuned on Anthropic/hh-rlhf
|
24 |
For 1 epoch with 1e-5 learning rate.
|
25 |
|
|
|
|
|
26 |
Validation accuracy is currently the best publicly available reported: 75.16% (vs 69.25% for `OpenAssistant/reward-model-deberta-v3-large-v2`).
|
|
|
23 |
# Reward model based `deberta-v3-large-tasksource-nli` fine-tuned on Anthropic/hh-rlhf
|
24 |
For 1 epoch with 1e-5 learning rate.
|
25 |
|
26 |
+
The data are described in the paper: [Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback](https://arxiv.org/abs/2204.05862).
|
27 |
+
|
28 |
Validation accuracy is currently the best publicly available reported: 75.16% (vs 69.25% for `OpenAssistant/reward-model-deberta-v3-large-v2`).
|