Updated Scores
Browse files
README.md
CHANGED
@@ -21,7 +21,8 @@ model-index:
|
|
21 |
split: validation
|
22 |
metrics:
|
23 |
- type: accuracy
|
24 |
-
value: 65.
|
|
|
25 |
- task:
|
26 |
type: text-generation
|
27 |
name: ARC-Challenge
|
@@ -32,7 +33,8 @@ model-index:
|
|
32 |
split: test
|
33 |
metrics:
|
34 |
- type: accuracy
|
35 |
-
value: 68.
|
|
|
36 |
- task:
|
37 |
type: text-generation
|
38 |
name: HellaSwag
|
@@ -42,18 +44,8 @@ model-index:
|
|
42 |
split: test
|
43 |
metrics:
|
44 |
- type: accuracy
|
45 |
-
value: 85.
|
46 |
-
|
47 |
-
type: text-generation
|
48 |
-
name: GSM8k
|
49 |
-
dataset:
|
50 |
-
type: text-generation
|
51 |
-
name: gsm8k
|
52 |
-
config: main
|
53 |
-
split: test
|
54 |
-
metrics:
|
55 |
-
- type: accuracy
|
56 |
-
value: 48.98
|
57 |
- task:
|
58 |
type: text-generation
|
59 |
name: Winogrande
|
@@ -64,7 +56,8 @@ model-index:
|
|
64 |
split: test
|
65 |
metrics:
|
66 |
- type: accuracy
|
67 |
-
value:
|
|
|
68 |
- task:
|
69 |
type: text-generation
|
70 |
name: MMLU
|
@@ -75,7 +68,8 @@ model-index:
|
|
75 |
split: test
|
76 |
metrics:
|
77 |
- type: accuracy
|
78 |
-
value:
|
|
|
79 |
- task:
|
80 |
type: text-generation
|
81 |
name: PiQA
|
@@ -95,7 +89,8 @@ model-index:
|
|
95 |
split: validation
|
96 |
metrics:
|
97 |
- type: accuracy
|
98 |
-
value:
|
|
|
99 |
- task:
|
100 |
type: text-generation
|
101 |
name: PubMedQA
|
@@ -109,28 +104,23 @@ model-index:
|
|
109 |
value: 76.0
|
110 |
---
|
111 |
|
112 |
-
# juanako-7b-UNA
|
113 |
|
114 |
This model is a fine-tuned version of [fblgit/juanako-7b-UNA-v2-phase-1](https://huggingface.co/fblgit/juanako-7b-UNA-v2-phase-1) on the HuggingFaceH4/ultrafeedback_binarized dataset.
|
115 |
It outperforms in many aspects most of the current Mistral based models and is the **latest and most powerful juanako version as of now**.
|
116 |
|
117 |
-
##
|
118 |
-
|
119 |
-
|
120 |
-
|
121 |
-
* Scores #2 in TruthfulQA
|
122 |
-
* Scores #6 in CoPa
|
123 |
-
* Scores #2 in PiQA
|
124 |
-
* Scores #9 in BoolQ
|
125 |
| Model | Average ⬆️| ARC (25-s) ⬆️ | HellaSwag (10-s) ⬆️ | MMLU (5-s) ⬆️| TruthfulQA (MC) (0-s) ⬆️ | Winogrande (5-s) | GSM8K (5-s) | DROP (3-s) |
|
126 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
127 |
|[mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) | 50.32 | 59.58 | 83.31 | 64.16 | 42.15 | 78.37 | 18.12 | 6.14 |
|
128 |
| [Intel/neural-chat-7b-v3-1](https://huggingface.co/Intel/neural-chat-7b-v3-1) | 59.0 | 66.21 | 83.64 | 62.37 | 59.65 | 78.14 | 19.56 | 43.84 |
|
129 |
-
| [fblgit/juanako-7b-UNA](https://huggingface.co/fblgit/juanako-7b-UNA) | **
|
130 |
|
131 |
-
|
132 |
-
|
133 |
-
It scores: **65.1** according HuggingFace LLM Leaderboard.
|
134 |
|
135 |
Author [Xavier M.](mailto:[email protected]) @fblgit
|
136 |
|
@@ -138,33 +128,68 @@ Author [Xavier M.](mailto:[email protected]) @fblgit
|
|
138 |
|
139 |
juanako uses UNA, Uniform Neural Alignment. A training technique that ease alignment between transformer layers yet to be published.
|
140 |
|
141 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
142 |
```
|
143 |
| Tasks |Version|Filter|Metric|Value | |Stderr|
|
144 |
|--------------|-------|------|------|-----:|---|-----:|
|
145 |
|truthfulqa_mc2|Yaml |none |acc |0.6549|± |0.0153|
|
146 |
```
|
147 |
-
|
148 |
```
|
149 |
| Tasks |Version|Filter| Metric |Value | |Stderr|
|
150 |
|-------------|-------|------|--------|-----:|---|-----:|
|
151 |
|arc_challenge|Yaml |none |acc |0.6476|± |0.0140|
|
152 |
| | |none |acc_norm|0.6809|± |0.0136|
|
153 |
```
|
154 |
-
|
155 |
```
|
156 |
| Tasks |Version|Filter| Metric |Value | |Stderr|
|
157 |
|---------|-------|------|--------|-----:|---|-----:|
|
158 |
|hellaswag|Yaml |none |acc |0.6703|± |0.0047|
|
159 |
| | |none |acc_norm|0.8520|± |0.0035|
|
160 |
```
|
161 |
-
|
162 |
```
|
163 |
|Tasks|Version| Filter | Metric |Value | |Stderr|
|
164 |
|-----|-------|----------|-----------|-----:|---|-----:|
|
165 |
|gsm8k|Yaml |get-answer|exact_match|0.4898|± |0.0138|
|
166 |
```
|
167 |
-
|
168 |
```
|
169 |
| Tasks |Version|Filter| Metric |Value | |Stderr|
|
170 |
|--------------|-------|------|----------|-----:|---|-----:|
|
@@ -176,39 +201,39 @@ juanako uses UNA, Uniform Neural Alignment. A training technique that ease align
|
|
176 |
|sciq |Yaml |none |acc |0.9580|± |0.0063|
|
177 |
| | |none |acc_norm |0.9130|± |0.0089|
|
178 |
```
|
179 |
-
|
180 |
```
|
181 |
|Tasks |Version|Filter| Metric |Value | |Stderr|
|
182 |
|------|-------|------|--------|-----:|---|-----:|
|
183 |
|mathqa|Yaml |none |acc |0.3752|± |0.0089|
|
184 |
| | |none |acc_norm|0.3772|± |0.0089|
|
185 |
```
|
186 |
-
|
187 |
```
|
188 |
|Tasks|Version|Filter| Metric |Value | |Stderr|
|
189 |
|-----|-------|------|--------|-----:|---|-----:|
|
190 |
|piqa |Yaml |none |acc |0.8308|± |0.0087|
|
191 |
| | |none |acc_norm|0.8357|± |0.0086|
|
192 |
```
|
193 |
-
|
194 |
```
|
195 |
| Tasks |Version|Filter|Metric|Value| |Stderr|
|
196 |
|----------|-------|------|------|----:|---|-----:|
|
197 |
|winogrande|Yaml |none |acc |0.768|± |0.0119|
|
198 |
```
|
199 |
-
|
200 |
```
|
201 |
| Tasks |Version|Filter|Metric|Value| |Stderr|
|
202 |
|--------|-------|------|------|----:|---|-----:|
|
203 |
|pubmedqa|Yaml |none |acc | 0.76|± |0.0191|
|
204 |
```
|
205 |
-
|
206 |
```
|
207 |
|Tasks|Version|Filter|Metric|Value | |Stderr|
|
208 |
|-----|-------|------|------|-----:|---|-----:|
|
209 |
|race |Yaml |none |acc |0.5282|± |0.0154|
|
210 |
```
|
211 |
-
|
212 |
```
|
213 |
| Groups |Version|Filter|Metric|Value | |Stderr|
|
214 |
|------------------|-------|------|------|-----:|---|-----:|
|
@@ -218,19 +243,22 @@ juanako uses UNA, Uniform Neural Alignment. A training technique that ease align
|
|
218 |
| - social_sciences|N/A |none |acc |0.7195|± |0.0713|
|
219 |
| - stem |N/A |none |acc |0.5087|± |0.1297|
|
220 |
```
|
221 |
-
|
222 |
```
|
223 |
{'score': 0.49801113762927607}
|
224 |
{'drop': 49.8}
|
225 |
drop: 49.8
|
226 |
```
|
227 |
|
228 |
-
|
229 |
```
|
230 |
{'score': 0.8357664233576643}
|
231 |
{'crass': 83.58}
|
232 |
crass: 83.58
|
233 |
```
|
|
|
|
|
|
|
234 |
### Training hyperparameters
|
235 |
|
236 |
The following hyperparameters were used during training:
|
@@ -267,6 +295,7 @@ The following hyperparameters were used during training:
|
|
267 |
|
268 |
## Citations
|
269 |
If you find juanako useful please:
|
|
|
270 |
```
|
271 |
@misc{juanako7buna,
|
272 |
title={Juanako: Uniform Neural Alignment},
|
@@ -278,6 +307,7 @@ If you find juanako useful please:
|
|
278 |
}
|
279 |
```
|
280 |
|
|
|
281 |
```
|
282 |
@misc{lin2021truthfulqa,
|
283 |
title={TruthfulQA: Measuring How Models Mimic Human Falsehoods},
|
@@ -295,12 +325,6 @@ If you find juanako useful please:
|
|
295 |
archivePrefix={arXiv},
|
296 |
primaryClass={cs.LG}
|
297 |
}
|
298 |
-
@article{cobbe2021gsm8k,
|
299 |
-
title={Training Verifiers to Solve Math Word Problems},
|
300 |
-
author={Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John},
|
301 |
-
journal={arXiv preprint arXiv:2110.14168},
|
302 |
-
year={2021}
|
303 |
-
}
|
304 |
@inproceedings{Bisk2020,
|
305 |
author = {Yonatan Bisk and Rowan Zellers and
|
306 |
Ronan Le Bras and Jianfeng Gao
|
|
|
21 |
split: validation
|
22 |
metrics:
|
23 |
- type: accuracy
|
24 |
+
value: 65.13
|
25 |
+
verified: true
|
26 |
- task:
|
27 |
type: text-generation
|
28 |
name: ARC-Challenge
|
|
|
33 |
split: test
|
34 |
metrics:
|
35 |
- type: accuracy
|
36 |
+
value: 68.17
|
37 |
+
verified: true
|
38 |
- task:
|
39 |
type: text-generation
|
40 |
name: HellaSwag
|
|
|
44 |
split: test
|
45 |
metrics:
|
46 |
- type: accuracy
|
47 |
+
value: 85.34
|
48 |
+
verified: true
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
49 |
- task:
|
50 |
type: text-generation
|
51 |
name: Winogrande
|
|
|
56 |
split: test
|
57 |
metrics:
|
58 |
- type: accuracy
|
59 |
+
value: 78.85
|
60 |
+
verified: true
|
61 |
- task:
|
62 |
type: text-generation
|
63 |
name: MMLU
|
|
|
68 |
split: test
|
69 |
metrics:
|
70 |
- type: accuracy
|
71 |
+
value: 62.47
|
72 |
+
verified: true
|
73 |
- task:
|
74 |
type: text-generation
|
75 |
name: PiQA
|
|
|
89 |
split: validation
|
90 |
metrics:
|
91 |
- type: accuracy
|
92 |
+
value: 38.74
|
93 |
+
verified: true
|
94 |
- task:
|
95 |
type: text-generation
|
96 |
name: PubMedQA
|
|
|
104 |
value: 76.0
|
105 |
---
|
106 |
|
107 |
+
# juanako-7b-UNA (Uniform Neural Alignment)
|
108 |
|
109 |
This model is a fine-tuned version of [fblgit/juanako-7b-UNA-v2-phase-1](https://huggingface.co/fblgit/juanako-7b-UNA-v2-phase-1) on the HuggingFaceH4/ultrafeedback_binarized dataset.
|
110 |
It outperforms in many aspects most of the current Mistral based models and is the **latest and most powerful juanako version as of now**.
|
111 |
|
112 |
+
## Scores
|
113 |
+
|
114 |
+
The official HuggingFace results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/results/blob/main/fblgit/juanako-7b-UNA/results_2023-11-28T08-33-33.965228.json)
|
115 |
+
|
|
|
|
|
|
|
|
|
116 |
| Model | Average ⬆️| ARC (25-s) ⬆️ | HellaSwag (10-s) ⬆️ | MMLU (5-s) ⬆️| TruthfulQA (MC) (0-s) ⬆️ | Winogrande (5-s) | GSM8K (5-s) | DROP (3-s) |
|
117 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
|
118 |
|[mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) | 50.32 | 59.58 | 83.31 | 64.16 | 42.15 | 78.37 | 18.12 | 6.14 |
|
119 |
| [Intel/neural-chat-7b-v3-1](https://huggingface.co/Intel/neural-chat-7b-v3-1) | 59.0 | 66.21 | 83.64 | 62.37 | 59.65 | 78.14 | 19.56 | 43.84 |
|
120 |
+
| [fblgit/juanako-7b-UNA](https://huggingface.co/fblgit/juanako-7b-UNA) | **59.91** | **68.17** | **85.34** | 62.47 | **65.13** | **78.85** | **20.7** | 38.74 |
|
121 |
|
122 |
+
It scores: **59.91** according HuggingFace LLM Leaderboard.
|
123 |
+
It scores: **65.1** with `big-refactor` branch of lm-eval-harness
|
|
|
124 |
|
125 |
Author [Xavier M.](mailto:[email protected]) @fblgit
|
126 |
|
|
|
128 |
|
129 |
juanako uses UNA, Uniform Neural Alignment. A training technique that ease alignment between transformer layers yet to be published.
|
130 |
|
131 |
+
### Prompts
|
132 |
+
The following prompts showed positive results, it may depend the task and needs further experimentation but this should work for starters:
|
133 |
+
```
|
134 |
+
<|im_start|>system
|
135 |
+
- You are a helpful assistant chatbot trained by MosaicML.
|
136 |
+
- You answer questions.
|
137 |
+
- You are excited to be able to help the user, but will refuse to do anything that could be considered harmful to the user.
|
138 |
+
- You are more than just an information source, you are also able to write poetry, short stories, and make jokes.<|im_end|>
|
139 |
+
<|im_start|>user
|
140 |
+
Explain QKV<|im_end|>
|
141 |
+
<|im_start|>assistant
|
142 |
+
```
|
143 |
+
```
|
144 |
+
### Assistant: I am StableVicuna, a large language model created by CarperAI. I am here to chat!
|
145 |
+
|
146 |
+
### Human: Explain QKV
|
147 |
+
### Assistant:
|
148 |
+
```
|
149 |
+
```
|
150 |
+
[Round <|round|>]
|
151 |
+
问:Explain QKV
|
152 |
+
答:
|
153 |
+
```
|
154 |
+
```
|
155 |
+
[Round <|round|>]
|
156 |
+
Question:Explain QKV
|
157 |
+
Answer:
|
158 |
+
```
|
159 |
+
```
|
160 |
+
Question:Explain QKV
|
161 |
+
Answer:
|
162 |
+
```
|
163 |
+
|
164 |
+
## Evaluations (lm-eval big-refactor branch)
|
165 |
+
|
166 |
+
### TruthfulQA 0-Shot
|
167 |
```
|
168 |
| Tasks |Version|Filter|Metric|Value | |Stderr|
|
169 |
|--------------|-------|------|------|-----:|---|-----:|
|
170 |
|truthfulqa_mc2|Yaml |none |acc |0.6549|± |0.0153|
|
171 |
```
|
172 |
+
### ARC 25-Shot
|
173 |
```
|
174 |
| Tasks |Version|Filter| Metric |Value | |Stderr|
|
175 |
|-------------|-------|------|--------|-----:|---|-----:|
|
176 |
|arc_challenge|Yaml |none |acc |0.6476|± |0.0140|
|
177 |
| | |none |acc_norm|0.6809|± |0.0136|
|
178 |
```
|
179 |
+
### HellaSwag 10-Shot
|
180 |
```
|
181 |
| Tasks |Version|Filter| Metric |Value | |Stderr|
|
182 |
|---------|-------|------|--------|-----:|---|-----:|
|
183 |
|hellaswag|Yaml |none |acc |0.6703|± |0.0047|
|
184 |
| | |none |acc_norm|0.8520|± |0.0035|
|
185 |
```
|
186 |
+
### GSM8k 5-Shot
|
187 |
```
|
188 |
|Tasks|Version| Filter | Metric |Value | |Stderr|
|
189 |
|-----|-------|----------|-----------|-----:|---|-----:|
|
190 |
|gsm8k|Yaml |get-answer|exact_match|0.4898|± |0.0138|
|
191 |
```
|
192 |
+
### GPT Evaluations 0-Shot
|
193 |
```
|
194 |
| Tasks |Version|Filter| Metric |Value | |Stderr|
|
195 |
|--------------|-------|------|----------|-----:|---|-----:|
|
|
|
201 |
|sciq |Yaml |none |acc |0.9580|± |0.0063|
|
202 |
| | |none |acc_norm |0.9130|± |0.0089|
|
203 |
```
|
204 |
+
### MathQA 0-Shot
|
205 |
```
|
206 |
|Tasks |Version|Filter| Metric |Value | |Stderr|
|
207 |
|------|-------|------|--------|-----:|---|-----:|
|
208 |
|mathqa|Yaml |none |acc |0.3752|± |0.0089|
|
209 |
| | |none |acc_norm|0.3772|± |0.0089|
|
210 |
```
|
211 |
+
### PiQa 1-Shot
|
212 |
```
|
213 |
|Tasks|Version|Filter| Metric |Value | |Stderr|
|
214 |
|-----|-------|------|--------|-----:|---|-----:|
|
215 |
|piqa |Yaml |none |acc |0.8308|± |0.0087|
|
216 |
| | |none |acc_norm|0.8357|± |0.0086|
|
217 |
```
|
218 |
+
### Winogrande 5-Shot
|
219 |
```
|
220 |
| Tasks |Version|Filter|Metric|Value| |Stderr|
|
221 |
|----------|-------|------|------|----:|---|-----:|
|
222 |
|winogrande|Yaml |none |acc |0.768|± |0.0119|
|
223 |
```
|
224 |
+
### PubMedQA 0-Shot
|
225 |
```
|
226 |
| Tasks |Version|Filter|Metric|Value| |Stderr|
|
227 |
|--------|-------|------|------|----:|---|-----:|
|
228 |
|pubmedqa|Yaml |none |acc | 0.76|± |0.0191|
|
229 |
```
|
230 |
+
### RACE 1-Shot
|
231 |
```
|
232 |
|Tasks|Version|Filter|Metric|Value | |Stderr|
|
233 |
|-----|-------|------|------|-----:|---|-----:|
|
234 |
|race |Yaml |none |acc |0.5282|± |0.0154|
|
235 |
```
|
236 |
+
### MMLU 5-Shot (8-Bit)
|
237 |
```
|
238 |
| Groups |Version|Filter|Metric|Value | |Stderr|
|
239 |
|------------------|-------|------|------|-----:|---|-----:|
|
|
|
243 |
| - social_sciences|N/A |none |acc |0.7195|± |0.0713|
|
244 |
| - stem |N/A |none |acc |0.5087|± |0.1297|
|
245 |
```
|
246 |
+
### DROP 3-Shot (8-Bit) (Instruct-Eval)
|
247 |
```
|
248 |
{'score': 0.49801113762927607}
|
249 |
{'drop': 49.8}
|
250 |
drop: 49.8
|
251 |
```
|
252 |
|
253 |
+
### CRASS 0-Shot (Instruct-Eval)
|
254 |
```
|
255 |
{'score': 0.8357664233576643}
|
256 |
{'crass': 83.58}
|
257 |
crass: 83.58
|
258 |
```
|
259 |
+
|
260 |
+
## Training Details
|
261 |
+
|
262 |
### Training hyperparameters
|
263 |
|
264 |
The following hyperparameters were used during training:
|
|
|
295 |
|
296 |
## Citations
|
297 |
If you find juanako useful please:
|
298 |
+
|
299 |
```
|
300 |
@misc{juanako7buna,
|
301 |
title={Juanako: Uniform Neural Alignment},
|
|
|
307 |
}
|
308 |
```
|
309 |
|
310 |
+
Thanks to all the brilliant humans behind the creation of AI, here some of the ones that we find relevant to our research. If you feel a citation is missing, please contact.
|
311 |
```
|
312 |
@misc{lin2021truthfulqa,
|
313 |
title={TruthfulQA: Measuring How Models Mimic Human Falsehoods},
|
|
|
325 |
archivePrefix={arXiv},
|
326 |
primaryClass={cs.LG}
|
327 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
328 |
@inproceedings{Bisk2020,
|
329 |
author = {Yonatan Bisk and Rowan Zellers and
|
330 |
Ronan Le Bras and Jianfeng Gao
|