nicholasKluge
commited on
Commit
•
7022940
1
Parent(s):
d50ead0
Update README.md
Browse files
README.md
CHANGED
@@ -160,10 +160,10 @@ Evaluations on benchmarks were performed using the [Language Model Evaluation Ha
|
|
160 |
|
161 |
| | **ARC** | **HellaSwag** | **MMLU** | **TruthfulQA** | **Average** |
|
162 |
|------------------|-----------|---------------|-----------|----------------|-------------|
|
163 |
-
| Pythia-410m | 24.83* |
|
164 |
-
| **TTL-460m** |
|
165 |
| Bloom-560m | 24.74* | 37.15* | 24.22* | 42.44* | 32.13 |
|
166 |
-
| Xglm-564M | 25.56 | 34.64* | 25.18* |
|
167 |
| OPT-350m | 23.55* | 36.73* | 26.02* | 40.83* | 31.78 |
|
168 |
| **TTL-160m** | 26.15 | 29.29 | 28.11 | 41.12 | 31.16 |
|
169 |
| Pythia-160m | 24.06* | 31.39* | 24.86* | 44.34* | 31.16 |
|
@@ -172,6 +172,26 @@ Evaluations on benchmarks were performed using the [Language Model Evaluation Ha
|
|
172 |
| Gpt2-small | 21.48* | 31.60* | 25.79* | 40.65* | 29.97 |
|
173 |
| Multilingual GPT | 23.81 | 26.37* | 25.17* | 39.62 | 28.73 |
|
174 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
175 |
## Fine-Tuning Comparisons
|
176 |
|
177 |
To further evaluate the downstream capabilities of our models, we decided to employ a basic fine-tuning procedure for our TTL pair on a subset of tasks from the Poeta benchmark. We apply the same procedure for comparison purposes on both [BERTimbau](https://huggingface.co/neuralmind/bert-base-portuguese-cased) models, given that they are also LLM trained from scratch in Brazilian Portuguese and have a similar size range to our models. We used these comparisons to assess if our pre-training runs produced LLM capable of producing good results ("good" here means "close to BERTimbau") when utilized for downstream applications.
|
|
|
160 |
|
161 |
| | **ARC** | **HellaSwag** | **MMLU** | **TruthfulQA** | **Average** |
|
162 |
|------------------|-----------|---------------|-----------|----------------|-------------|
|
163 |
+
| Pythia-410m | 24.83* | 41.29* | 25.99* | 40.95* | 33.26 |
|
164 |
+
| **TTL-460m** | 29.40 | 33.00 | 28.55 | 41.10 | 33.01 |
|
165 |
| Bloom-560m | 24.74* | 37.15* | 24.22* | 42.44* | 32.13 |
|
166 |
+
| Xglm-564M | 25.56 | 34.64* | 25.18* | 42.53 | 31.97 |
|
167 |
| OPT-350m | 23.55* | 36.73* | 26.02* | 40.83* | 31.78 |
|
168 |
| **TTL-160m** | 26.15 | 29.29 | 28.11 | 41.12 | 31.16 |
|
169 |
| Pythia-160m | 24.06* | 31.39* | 24.86* | 44.34* | 31.16 |
|
|
|
172 |
| Gpt2-small | 21.48* | 31.60* | 25.79* | 40.65* | 29.97 |
|
173 |
| Multilingual GPT | 23.81 | 26.37* | 25.17* | 39.62 | 28.73 |
|
174 |
|
175 |
+
Evaluations on Brazilian Portuguese benchmarks were performed using a [Portuguese implementation of the EleutherAI LM Evaluation Harness](https://github.com/eduagarcia/lm-evaluation-harness-pt) (created by [Eduardo Garcia](https://github.com/eduagarcia/lm-evaluation-harness-pt)).
|
176 |
+
|
177 |
+
| | **ASSIN2 RTE** | **ASSIN2 STS** | **BLUEX** | **ENEM** | **FAQUAD NLI** | **HateBR** | **OAB Exams** | **Average** |
|
178 |
+
|----------------|----------------|----------------|-----------|----------|----------------|------------|---------------|-------------|
|
179 |
+
| Qwen-1.8B | 64.83 | 19.53 | 26.15 | 30.23 | 43.97 | 33.33 | 27.20 | 35.03 |
|
180 |
+
| TinyLlama-1.1B | 58.93 | 13.57 | 22.81 | 22.25 | 43.97 | 36.92 | 23.64 | 31.72 |
|
181 |
+
| **TTL-460m** | 53.93 | 12.66 | 22.81 | 19.87 | 49.01 | 33.59 | 27.06 | 31.27 |
|
182 |
+
| XGLM-564m | 49.61 | 22.91 | 19.61 | 19.38 | 43.97 | 33.99 | 23.42 | 30.41 |
|
183 |
+
| Bloom-1b7 | 53.60 | 4.81 | 21.42 | 18.96 | 43.97 | 34.89 | 23.05 | 28.67 |
|
184 |
+
| **TTL-160m** | 53.36 | 2.58 | 21.84 | 18.75 | 43.97 | 36.88 | 22.60 | 28.56 |
|
185 |
+
| OPT-125m | 39.77 | 2.00 | 21.84 | 17.42 | 43.97 | 47.04 | 22.78 | 27.83 |
|
186 |
+
| Pythia-160 | 33.33 | 12.81 | 16.13 | 16.66 | 50.36 | 41.09 | 22.82 | 27.60 |
|
187 |
+
| OLMo-1b | 34.12 | 9.28 | 18.92 | 20.29 | 43.97 | 41.33 | 22.96 | 27.26 |
|
188 |
+
| Bloom-560m | 33.33 | 8.48 | 18.92 | 19.03 | 43.97 | 37.07 | 23.05 | 26.26 |
|
189 |
+
| Pythia-410m | 33.33 | 4.80 | 19.47 | 19.45 | 43.97 | 33.33 | 23.01 | 25.33 |
|
190 |
+
| OPT-350m | 33.33 | 3.65 | 20.72 | 17.35 | 44.71 | 33.33 | 23.01 | 25.15 |
|
191 |
+
| GPT-2 small | 33.26 | 0.00 | 10.43 | 11.20 | 43.52 | 33.68 | 13.12 | 20.74 |
|
192 |
+
| GPorTuguese | 33.33 | 3.85 | 14.74 | 3.01 | 28.81 | 33.33 | 21.23 | 19.75 |
|
193 |
+
| Samba-1.1B | 33.33 | 1.30 | 8.07 | 10.22 | 17.72 | 35.79 | 15.03 | 17.35 |
|
194 |
+
|
195 |
## Fine-Tuning Comparisons
|
196 |
|
197 |
To further evaluate the downstream capabilities of our models, we decided to employ a basic fine-tuning procedure for our TTL pair on a subset of tasks from the Poeta benchmark. We apply the same procedure for comparison purposes on both [BERTimbau](https://huggingface.co/neuralmind/bert-base-portuguese-cased) models, given that they are also LLM trained from scratch in Brazilian Portuguese and have a similar size range to our models. We used these comparisons to assess if our pre-training runs produced LLM capable of producing good results ("good" here means "close to BERTimbau") when utilized for downstream applications.
|