prince-canuma commited on
Commit
7a19e2b
·
verified ·
1 Parent(s): d805640

Add open-llm-leaderboard results

Browse files
Files changed (1) hide show
  1. README.md +73 -19
README.md CHANGED
@@ -109,7 +109,19 @@ I highly recommend it to anyone who enjoys a well-crafted and emotionally engagi
109
  ### Training Data
110
 
111
  <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
112
- I used [SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca) dataset, a new curated subset of our OpenOrca data. This release provides an efficient means of reaching performance on-par with using larger slices of the [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca), while only including ~500k GPT-4 completions.
 
 
 
 
 
 
 
 
 
 
 
 
113
 
114
  ### Training Procedure
115
 
@@ -123,39 +135,57 @@ I used [SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca) dataset, a
123
  3. Mask instructions (System and User) at training time.
124
 
125
 
126
-
127
  #### Training Hyperparameters
128
 
129
- - **Training regime:** bf16 mixed precision <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
130
 
 
131
 
132
  ## Evaluation
133
 
134
  <!-- This section describes the evaluation protocols and provides the results. -->
135
 
136
- ### Testing Data, Factors & Metrics
137
-
138
- #### Testing Data
139
-
140
- <!-- This should link to a Dataset Card if possible. -->
141
-
142
- [TODO]
143
-
144
- #### Factors
145
-
146
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
147
-
148
- [TODO]
149
 
150
- #### Metrics
 
 
 
 
 
 
151
 
152
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
153
 
154
  [TODO]
155
 
156
  ### Results
157
 
158
- [TODO]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
159
 
160
  ## Technical Specifications
161
 
@@ -180,6 +210,12 @@ I used [SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca) dataset, a
180
  - Bitsandbytes
181
  - Plotly
182
 
 
 
 
 
 
 
183
  ## Citation
184
 
185
  <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
@@ -192,3 +228,21 @@ I used [SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca) dataset, a
192
  year={2024},
193
  }
194
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
109
  ### Training Data
110
 
111
  <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
112
+ I used [SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca) dataset, a new curated subset of our OpenOrca data.
113
+ In the course of this study, the [SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca) dataset was used, representing a meticulously curated subset derived from the broader OpenOrca dataset. This release provides an efficient means of reaching performance on-par with using larger slices of the [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca), while only including ~500k GPT-4 completions.
114
+
115
+
116
+
117
+
118
+ Subsequently, two distinct subsets were crafted, comprising 102,000 and 1,000 samples, denoted as:
119
+ - [prince-canuma/SmallOrca](https://huggingface.co/datasets/prince-canuma/SmallOrca)
120
+ - [prince-canuma/TinyOrca](https://huggingface.co/datasets/prince-canuma/TinyOrca)
121
+ Although experimentation was conducted with both datasets, optimal results were achieved through fine-tuning on a modest set of 200 samples.
122
+ Notably, the investigation revealed that augmenting the training data beyond this threshold predominantly enhanced the model's proficiency in generating Chain-of-Thought responses.
123
+ However, it is imperative to note that the preference for Chain-of-Thought responses may not be universally applicable. Particularly in scenarios like the RAG setup,
124
+ succinct answers to prompts are often favored, especially for straightforward queries.
125
 
126
  ### Training Procedure
127
 
 
135
  3. Mask instructions (System and User) at training time.
136
 
137
 
 
138
  #### Training Hyperparameters
139
 
140
+ - **Training regime:** bf16 mixed precision <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
141
 
142
+ [TODO]
143
 
144
  ## Evaluation
145
 
146
  <!-- This section describes the evaluation protocols and provides the results. -->
147
 
148
+ We evaluate models on 7 key benchmarks using the Eleuther AI Language Model Evaluation Harness , a unified framework to test generative language models on a large number of different evaluation tasks.
 
 
 
 
 
 
 
 
 
 
 
 
149
 
150
+ - AI2 Reasoning Challenge (25-shot) - a set of grade-school science questions.
151
+ - HellaSwag (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.
152
+ - MMLU (5-shot) - a test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
153
+ - TruthfulQA (0-shot) - a test to measure a model's propensity to reproduce falsehoods commonly found online. Note: TruthfulQA is technically a 6-shot task in the Harness because each example is prepended with 6 Q/A pairs, even in the 0-shot setting.
154
+ - Winogrande (5-shot) - an adversarial and difficult Winograd benchmark at scale, for commonsense reasoning.
155
+ - GSM8k (5-shot) - diverse grade school math word problems to measure a model's ability to solve multi-step mathematical reasoning problems.
156
+ For all these evaluations, a higher score is a better score. We chose these benchmarks as they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings.
157
 
158
+ Read more [here](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
159
 
160
  [TODO]
161
 
162
  ### Results
163
 
164
+ ```json
165
+ {
166
+ "AVG": {
167
+ "acc": 60.49
168
+ },
169
+ "ARC": {
170
+ "acc": 59.81
171
+ },
172
+ "HellaSwag": {
173
+ "acc": 74.52
174
+ },
175
+ "MMLU": {
176
+ "acc": 56.33
177
+ },
178
+ "truthfulqa": {
179
+ "acc": 46.74,
180
+ },
181
+ "winogrande": {
182
+ "acc": 75.00,
183
+ },
184
+ "gsm8k": {
185
+ "acc": 50.64,
186
+ }
187
+ }
188
+ ```
189
 
190
  ## Technical Specifications
191
 
 
210
  - Bitsandbytes
211
  - Plotly
212
 
213
+ ## Future work
214
+
215
+ I plan to explore the following tuning setups:
216
+ - Function calling
217
+ - DPO
218
+
219
  ## Citation
220
 
221
  <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 
228
  year={2024},
229
  }
230
  ```
231
+ ```bibtex
232
+ @misc{SlimOrca,
233
+ title = {SlimOrca: An Open Dataset of GPT-4 Augmented FLAN Reasoning Traces, with Verification},
234
+ author = {Wing Lian and Guan Wang and Bleys Goodson and Eugene Pentland and Austin Cook and Chanvichet Vong and "Teknium"},
235
+ year = {2023},
236
+ publisher = {HuggingFace},
237
+ url = {https://https://huggingface.co/Open-Orca/SlimOrca}
238
+ }
239
+ ```
240
+ ```bibtex
241
+ @misc{open-llm-leaderboard,
242
+ author = {Edward Beeching and Clémentine Fourrier and Nathan Habib and Sheon Han and Nathan Lambert and Nazneen Rajani and Omar Sanseviero and Lewis Tunstall and Thomas Wolf},
243
+ title = {Open LLM Leaderboard},
244
+ year = {2023},
245
+ publisher = {Hugging Face},
246
+ howpublished = "\url{https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard}"
247
+ }
248
+ ```