Update README.md
Browse files
README.md
CHANGED
@@ -3,13 +3,13 @@ license: apache-2.0
|
|
3 |
---
|
4 |
# Better Implementation for [*PairRM*](https://huggingface.co/llm-blender/PairRM)
|
5 |
|
6 |
-
|
7 |
|
8 |
This version of PairRM have some fixes on training process, which improve model's performance significantly.
|
9 |
|
10 |
-
|
11 |
|
12 |
-
|
13 |
|
14 |
Thanks to deberta's tokenzer, original PairRM model had enough Context Length.
|
15 |
|
@@ -17,9 +17,9 @@ But, the longer the better :>
|
|
17 |
|
18 |
---
|
19 |
|
20 |
-
|
21 |
|
22 |
-
|
23 |
|
24 |
Why use something like
|
25 |
```
|
@@ -30,12 +30,66 @@ So, I changed to a format based on Vicuna 1.1.
|
|
30 |
|
31 |
---
|
32 |
|
33 |
-
|
34 |
|
35 |
-
The original process was using right side truncate even on Input. This can cause serious problem when Input exceeds model's
|
36 |
|
37 |
---
|
38 |
|
39 |
-
|
40 |
|
41 |
-
There was decent amount of empty assistant response on original dataset. So, I dropped them.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
4 |
# Better Implementation for [*PairRM*](https://huggingface.co/llm-blender/PairRM)
|
5 |
|
6 |
+
## Introduction
|
7 |
|
8 |
This version of PairRM have some fixes on training process, which improve model's performance significantly.
|
9 |
|
10 |
+
### Minor Fixes
|
11 |
|
12 |
+
- Longer Context Length (2048 -> 3370)
|
13 |
|
14 |
Thanks to deberta's tokenzer, original PairRM model had enough Context Length.
|
15 |
|
|
|
17 |
|
18 |
---
|
19 |
|
20 |
+
### Major Fixes
|
21 |
|
22 |
+
- Change Prompt Format
|
23 |
|
24 |
Why use something like
|
25 |
```
|
|
|
30 |
|
31 |
---
|
32 |
|
33 |
+
- Change Truncate side
|
34 |
|
35 |
+
The original process was using right side truncate even on Input. This can cause serious problem when Input exceeds model's context length.
|
36 |
|
37 |
---
|
38 |
|
39 |
+
- Dataset Filter
|
40 |
|
41 |
+
There was decent amount of empty assistant response on original dataset. So, I dropped them.
|
42 |
+
|
43 |
+
---
|
44 |
+
|
45 |
+
## Statistics
|
46 |
+
|
47 |
+
### Context length
|
48 |
+
| PairRanker type | Source max length | Candidate max length | Total max length |
|
49 |
+
|:-----------------:|:-----------------:|----------------------|------------------|
|
50 |
+
| [pair-ranker](https://huggingface.co/llm-blender/pair-ranker) | 128 | 128 | 384 |
|
51 |
+
| [PairRM](https://huggingface.co/llm-blender/pair-reward-model/) | 1224 | 412 | 2048 |
|
52 |
+
| [Better-PairRM](https://huggingface.co/maywell/Better-PairRM/) (This model) | 2030 | 670 | 3370 |
|
53 |
+
|
54 |
+
### Performance
|
55 |
+
|
56 |
+
#### Reward-Bench by AllenAI
|
57 |
+
|
58 |
+
| Metric | llm-blender/PairRM-hf | maywell/Better-PairRM |
|
59 |
+
|----------------------------|------------------------|------------------------|
|
60 |
+
| model | llm-blender/PairRM-hf | maywell/Better-PairRM |
|
61 |
+
| model_type | Custom Classifier | Custom Classifier |
|
62 |
+
| alpacaeval-length | 0.758 | **0.863** |
|
63 |
+
| alpacaeval-hard | 0.979 | **1.000** |
|
64 |
+
| alpacaeval-easy | 0.970 | **0.990** |
|
65 |
+
| donotanswer | 0.360 | **0.522** |
|
66 |
+
| hep-cpp | 0.628 | **0.646** |
|
67 |
+
| hep-go | 0.689 | **0.713** |
|
68 |
+
| hep-java | 0.628 | **0.713** |
|
69 |
+
| hep-js | 0.604 | **0.707** |
|
70 |
+
| hep-python | 0.646 | **0.713** |
|
71 |
+
| hep-rust | 0.652 | **0.726** |
|
72 |
+
| llmbar-adver-GPTInst | **0.304** | 0.141 |
|
73 |
+
| llmbar-adver-GPTOut | **0.596** | 0.447 |
|
74 |
+
| llmbar-adver-manual | **0.500** | 0.261 |
|
75 |
+
| llmbar-adver-neighbor | **0.433** | 0.276 |
|
76 |
+
| llmbar-natural | **0.800** | 0.720 |
|
77 |
+
| math-prm | **0.333** | 0.295 |
|
78 |
+
| mt-bench-hard | 0.649 | **0.703** |
|
79 |
+
| mt-bench-med | 0.900 | **1.000** |
|
80 |
+
| mt-bench-easy | **0.964** | 0.929 |
|
81 |
+
| refusals-dangerous | 0.080 | **0.730** |
|
82 |
+
| refusals-offensive | 0.010 | **0.940** |
|
83 |
+
| xstest-should-refuse | 0.370 | **0.968** |
|
84 |
+
| xstest-should-respond | **0.952** | 0.876 |
|
85 |
+
| average | 0.600 | **0.690** |
|
86 |
+
|
87 |
+
> *Note - llmbar test score is bit weird across all models on [Reward-Bench](https://huggingface.co/spaces/allenai/reward-bench)*
|
88 |
+
|
89 |
+
## Thanks to
|
90 |
+
|
91 |
+
- [Sionic AI](https://sionic.ai/) for providing the A100 cluster.
|
92 |
+
|
93 |
+
## Contact
|
94 |
+
|
95 |
+
- [Discord Server Link](https://discord.gg/MrBt3PXdXc)
|