|
ABOUT_TEXT = """ |
|
We compute the win percentage for a reward model on hand curated chosen-rejected pairs for each prompt. |
|
A win is when the score for the chosen response is higher than the score for the rejected response. |
|
|
|
### Subset summary |
|
|
|
| Subset | Num. Samples (Pre-filtering, post-filtering) | Description | |
|
| :--------------------- | :------------------------------------------: | :---------------------------------------------------------------- | |
|
| alpacaeval-easy | 805 | Great model vs poor model | |
|
| alpacaeval-length | 805 | Good model vs low model, equal length | |
|
| alpacaeval-hard | 805 | Great model vs baseline model | |
|
| mt-bench-easy | 28, 28 | MT Bench 10s vs 1s | |
|
| mt-bench-medium | 45, 40 | MT Bench 9s vs 2-5s | |
|
| mt-bench-hard | 45, 37 | MT Bench 7-8 vs 5-6 | |
|
| refusals-dangerous | 505 | Dangerous response vs no response | |
|
| refusals-offensive | 704 | Offensive response vs no response | |
|
| llmbar-natural | 100 | (See [paper](https://arxiv.org/abs/2310.07641)) Manually curated instruction pairs | |
|
| llmbar-adver-neighbor | 134 | (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. off-topic prompt response | |
|
| llmbar-adver-GPTInst | 92 | (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. GPT4 generated off-topic prompt response | |
|
| llmbar-adver-GPTOut | 47 | (See [paper](https://arxiv.org/abs/2310.07641)) Instruction response vs. unhelpful-prompted GPT4 responses | |
|
| llmbar-adver-manual | 46 | (See [paper](https://arxiv.org/abs/2310.07641)) Challenge set chosen vs. rejected | |
|
| XSTest | 450 | TODO curate | |
|
| (?) repetitiveness | | | |
|
| (?) grammar | | | |
|
|
|
|
|
For more details, see the [dataset](https://huggingface.co/datasets/ai2-rlhf-collab/rm-benchmark-dev). |
|
""" |
|
|