migtissera
commited on
Commit
•
4936c0b
1
Parent(s):
e3350e4
Update README.md
Browse files
README.md
CHANGED
@@ -31,7 +31,6 @@ The model was trained mostly with Chain-of-Thought reasoning data, including the
|
|
31 |
|
32 |
|
33 |
# Evaluations
|
34 |
-
The below evaluations were performed using a fork of Glaive's `simple-evals` codebase. Many thanks to @winglian for performing the evals. The codebase for evaluations can be found here: https://github.com/winglian/simple-evals
|
35 |
| | Tess-R1 Limerick | Claude 3.5 Haiku | GPT-4o mini |
|
36 |
|--------------|------------------|------------------|-------------|
|
37 |
| GPQA | 41.5% | 41.6% | 40.2% |
|
@@ -40,6 +39,13 @@ The below evaluations were performed using a fork of Glaive's `simple-evals` cod
|
|
40 |
| MMLU-Pro | 65.6% | 65.0% | - |
|
41 |
| HumanEval | | 88.1% | 87.2% |
|
42 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
43 |
# Prompt Format
|
44 |
The model uses Llama3 prompt format.
|
45 |
|
|
|
31 |
|
32 |
|
33 |
# Evaluations
|
|
|
34 |
| | Tess-R1 Limerick | Claude 3.5 Haiku | GPT-4o mini |
|
35 |
|--------------|------------------|------------------|-------------|
|
36 |
| GPQA | 41.5% | 41.6% | 40.2% |
|
|
|
39 |
| MMLU-Pro | 65.6% | 65.0% | - |
|
40 |
| HumanEval | | 88.1% | 87.2% |
|
41 |
|
42 |
+
The evaluations were performed using a fork of Glaive's `simple-evals` codebase. Many thanks to @winglian for performing the evals. The codebase for evaluations can be found here: https://github.com/winglian/simple-evals
|
43 |
+
|
44 |
+
Example to run evaluations:
|
45 |
+
`python run_reflection_eval.py tess_r1_70b --evals gpqa mmlu math`
|
46 |
+
|
47 |
+
The system message have been edited in the sampler to reflect Tess-R1's system prompt.
|
48 |
+
|
49 |
# Prompt Format
|
50 |
The model uses Llama3 prompt format.
|
51 |
|