prince-canuma
commited on
Add open-llm-leaderboard results
Browse files
README.md
CHANGED
@@ -109,7 +109,19 @@ I highly recommend it to anyone who enjoys a well-crafted and emotionally engagi
|
|
109 |
### Training Data
|
110 |
|
111 |
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
112 |
-
I used [SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca) dataset, a new curated subset of our OpenOrca data.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
113 |
|
114 |
### Training Procedure
|
115 |
|
@@ -123,39 +135,57 @@ I used [SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca) dataset, a
|
|
123 |
3. Mask instructions (System and User) at training time.
|
124 |
|
125 |
|
126 |
-
|
127 |
#### Training Hyperparameters
|
128 |
|
129 |
-
- **Training regime:** bf16 mixed precision <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
|
130 |
|
|
|
131 |
|
132 |
## Evaluation
|
133 |
|
134 |
<!-- This section describes the evaluation protocols and provides the results. -->
|
135 |
|
136 |
-
|
137 |
-
|
138 |
-
#### Testing Data
|
139 |
-
|
140 |
-
<!-- This should link to a Dataset Card if possible. -->
|
141 |
-
|
142 |
-
[TODO]
|
143 |
-
|
144 |
-
#### Factors
|
145 |
-
|
146 |
-
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
|
147 |
-
|
148 |
-
[TODO]
|
149 |
|
150 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
151 |
|
152 |
-
|
153 |
|
154 |
[TODO]
|
155 |
|
156 |
### Results
|
157 |
|
158 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
159 |
|
160 |
## Technical Specifications
|
161 |
|
@@ -180,6 +210,12 @@ I used [SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca) dataset, a
|
|
180 |
- Bitsandbytes
|
181 |
- Plotly
|
182 |
|
|
|
|
|
|
|
|
|
|
|
|
|
183 |
## Citation
|
184 |
|
185 |
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
@@ -192,3 +228,21 @@ I used [SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca) dataset, a
|
|
192 |
year={2024},
|
193 |
}
|
194 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
109 |
### Training Data
|
110 |
|
111 |
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
112 |
+
I used [SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca) dataset, a new curated subset of our OpenOrca data.
|
113 |
+
In the course of this study, the [SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca) dataset was used, representing a meticulously curated subset derived from the broader OpenOrca dataset. This release provides an efficient means of reaching performance on-par with using larger slices of the [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca), while only including ~500k GPT-4 completions.
|
114 |
+
|
115 |
+
|
116 |
+
|
117 |
+
|
118 |
+
Subsequently, two distinct subsets were crafted, comprising 102,000 and 1,000 samples, denoted as:
|
119 |
+
- [prince-canuma/SmallOrca](https://huggingface.co/datasets/prince-canuma/SmallOrca)
|
120 |
+
- [prince-canuma/TinyOrca](https://huggingface.co/datasets/prince-canuma/TinyOrca)
|
121 |
+
Although experimentation was conducted with both datasets, optimal results were achieved through fine-tuning on a modest set of 200 samples.
|
122 |
+
Notably, the investigation revealed that augmenting the training data beyond this threshold predominantly enhanced the model's proficiency in generating Chain-of-Thought responses.
|
123 |
+
However, it is imperative to note that the preference for Chain-of-Thought responses may not be universally applicable. Particularly in scenarios like the RAG setup,
|
124 |
+
succinct answers to prompts are often favored, especially for straightforward queries.
|
125 |
|
126 |
### Training Procedure
|
127 |
|
|
|
135 |
3. Mask instructions (System and User) at training time.
|
136 |
|
137 |
|
|
|
138 |
#### Training Hyperparameters
|
139 |
|
140 |
+
- **Training regime:** bf16 mixed precision <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
|
141 |
|
142 |
+
[TODO]
|
143 |
|
144 |
## Evaluation
|
145 |
|
146 |
<!-- This section describes the evaluation protocols and provides the results. -->
|
147 |
|
148 |
+
We evaluate models on 7 key benchmarks using the Eleuther AI Language Model Evaluation Harness , a unified framework to test generative language models on a large number of different evaluation tasks.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
149 |
|
150 |
+
- AI2 Reasoning Challenge (25-shot) - a set of grade-school science questions.
|
151 |
+
- HellaSwag (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
|
152 |
+
- MMLU (5-shot) - a test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
|
153 |
+
- TruthfulQA (0-shot) - a test to measure a model's propensity to reproduce falsehoods commonly found online. Note: TruthfulQA is technically a 6-shot task in the Harness because each example is prepended with 6 Q/A pairs, even in the 0-shot setting.
|
154 |
+
- Winogrande (5-shot) - an adversarial and difficult Winograd benchmark at scale, for commonsense reasoning.
|
155 |
+
- GSM8k (5-shot) - diverse grade school math word problems to measure a model's ability to solve multi-step mathematical reasoning problems.
|
156 |
+
For all these evaluations, a higher score is a better score. We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.
|
157 |
|
158 |
+
Read more [here](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
|
159 |
|
160 |
[TODO]
|
161 |
|
162 |
### Results
|
163 |
|
164 |
+
```json
|
165 |
+
{
|
166 |
+
"AVG": {
|
167 |
+
"acc": 60.49
|
168 |
+
},
|
169 |
+
"ARC": {
|
170 |
+
"acc": 59.81
|
171 |
+
},
|
172 |
+
"HellaSwag": {
|
173 |
+
"acc": 74.52
|
174 |
+
},
|
175 |
+
"MMLU": {
|
176 |
+
"acc": 56.33
|
177 |
+
},
|
178 |
+
"truthfulqa": {
|
179 |
+
"acc": 46.74,
|
180 |
+
},
|
181 |
+
"winogrande": {
|
182 |
+
"acc": 75.00,
|
183 |
+
},
|
184 |
+
"gsm8k": {
|
185 |
+
"acc": 50.64,
|
186 |
+
}
|
187 |
+
}
|
188 |
+
```
|
189 |
|
190 |
## Technical Specifications
|
191 |
|
|
|
210 |
- Bitsandbytes
|
211 |
- Plotly
|
212 |
|
213 |
+
## Future work
|
214 |
+
|
215 |
+
I plan to explore the following tuning setups:
|
216 |
+
- Function calling
|
217 |
+
- DPO
|
218 |
+
|
219 |
## Citation
|
220 |
|
221 |
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
|
|
228 |
year={2024},
|
229 |
}
|
230 |
```
|
231 |
+
```bibtex
|
232 |
+
@misc{SlimOrca,
|
233 |
+
title = {SlimOrca: An Open Dataset of GPT-4 Augmented FLAN Reasoning Traces, with Verification},
|
234 |
+
author = {Wing Lian and Guan Wang and Bleys Goodson and Eugene Pentland and Austin Cook and Chanvichet Vong and "Teknium"},
|
235 |
+
year = {2023},
|
236 |
+
publisher = {HuggingFace},
|
237 |
+
url = {https://https://huggingface.co/Open-Orca/SlimOrca}
|
238 |
+
}
|
239 |
+
```
|
240 |
+
```bibtex
|
241 |
+
@misc{open-llm-leaderboard,
|
242 |
+
author = {Edward Beeching and Clémentine Fourrier and Nathan Habib and Sheon Han and Nathan Lambert and Nazneen Rajani and Omar Sanseviero and Lewis Tunstall and Thomas Wolf},
|
243 |
+
title = {Open LLM Leaderboard},
|
244 |
+
year = {2023},
|
245 |
+
publisher = {Hugging Face},
|
246 |
+
howpublished = "\url{https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard}"
|
247 |
+
}
|
248 |
+
```
|