Taishi-N324
commited on
Upload README.md
Browse files
README.md
CHANGED
@@ -34,8 +34,7 @@ We released the 7B and 70B models without vocabulary expansion on January 26th,
|
|
34 |
![logo](./logo.png)
|
35 |
|
36 |
This repository provides large language models developed by [TokyoTech-LLM](https://tokyotech-llm.github.io/).
|
37 |
-
Read our [blog post](https://zenn.dev/tokyotech_lm/articles/d6cb3a8fdfc907) or our paper
|
38 |
-
|
39 |
|
40 |
## Model Details
|
41 |
|
@@ -47,7 +46,7 @@ Read our [blog post](https://zenn.dev/tokyotech_lm/articles/d6cb3a8fdfc907) or o
|
|
47 |
|
48 |
## Base Model Performance
|
49 |
|
50 |
-
### Japanese
|
51 |
|
52 |
|Model|Size|JCommonsenseQA|JEMHopQA|NIILC|JSQuAD|XL-Sum|MGSM|WMT20-en-ja|WMT20-ja-en|
|
53 |
|---|---|---|---|---|---|---|---|---|---|
|
@@ -62,7 +61,7 @@ Read our [blog post](https://zenn.dev/tokyotech_lm/articles/d6cb3a8fdfc907) or o
|
|
62 |
| Llama 2 | 70B | 0.8686 | 0.4656 | 0.5256 | 0.9080 | 0.2361 | 0.3560 | 0.2643 | **0.2398** |
|
63 |
| Swallow | 70B | 0.9348 | **0.6290** | 0.6960 | 0.9176 | 0.2266 | **0.4840** | **0.3043** | 0.2298 |
|
64 |
| Swallow-NVE | 70B | **0.9410** | 0.5759 | **0.7024** | **0.9254** | **0.2758** | 0.4720 | 0.3042 | 0.2322 |
|
65 |
-
### English
|
66 |
|
67 |
|Model|Size|OpenBookQA|TriviaQA|HellaSwag|SQuAD2.0|XWINO|GSM8K|
|
68 |
|---|---|---|---|---|---|---|---|
|
@@ -78,6 +77,33 @@ Read our [blog post](https://zenn.dev/tokyotech_lm/articles/d6cb3a8fdfc907) or o
|
|
78 |
| Swallow | 70B | 0.4220 | 0.7756 | 0.6458 | 0.3745 | 0.9204 | 0.4867 |
|
79 |
| Swallow-NVE | 70B | 0.4240 | 0.7817 | 0.6439 | 0.3451 | 0.9256 | 0.4943 |
|
80 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
81 |
## Usage
|
82 |
|
83 |
First install additional dependencies in [requirements.txt](./requirements.txt):
|
|
|
34 |
![logo](./logo.png)
|
35 |
|
36 |
This repository provides large language models developed by [TokyoTech-LLM](https://tokyotech-llm.github.io/).
|
37 |
+
Read our [blog post](https://zenn.dev/tokyotech_lm/articles/d6cb3a8fdfc907) or our [paper](https://www.anlp.jp/proceedings/annual_meeting/2024/pdf_dir/A8-5.pdf)
|
|
|
38 |
|
39 |
## Model Details
|
40 |
|
|
|
46 |
|
47 |
## Base Model Performance
|
48 |
|
49 |
+
### Japanese tasks
|
50 |
|
51 |
|Model|Size|JCommonsenseQA|JEMHopQA|NIILC|JSQuAD|XL-Sum|MGSM|WMT20-en-ja|WMT20-ja-en|
|
52 |
|---|---|---|---|---|---|---|---|---|---|
|
|
|
61 |
| Llama 2 | 70B | 0.8686 | 0.4656 | 0.5256 | 0.9080 | 0.2361 | 0.3560 | 0.2643 | **0.2398** |
|
62 |
| Swallow | 70B | 0.9348 | **0.6290** | 0.6960 | 0.9176 | 0.2266 | **0.4840** | **0.3043** | 0.2298 |
|
63 |
| Swallow-NVE | 70B | **0.9410** | 0.5759 | **0.7024** | **0.9254** | **0.2758** | 0.4720 | 0.3042 | 0.2322 |
|
64 |
+
### English tasks
|
65 |
|
66 |
|Model|Size|OpenBookQA|TriviaQA|HellaSwag|SQuAD2.0|XWINO|GSM8K|
|
67 |
|---|---|---|---|---|---|---|---|
|
|
|
77 |
| Swallow | 70B | 0.4220 | 0.7756 | 0.6458 | 0.3745 | 0.9204 | 0.4867 |
|
78 |
| Swallow-NVE | 70B | 0.4240 | 0.7817 | 0.6439 | 0.3451 | 0.9256 | 0.4943 |
|
79 |
|
80 |
+
## Evaluation Benchmarks
|
81 |
+
|
82 |
+
### Japanese evaluation benchmarks
|
83 |
+
|
84 |
+
We used llm-jp-eval(v1.0.0) and JP Language Model Evaluation Harness(commit #9b42d41). The details are as follows:
|
85 |
+
|
86 |
+
- Multiple-choice question answering (JCommonsenseQA [Kurihara+, 2022])
|
87 |
+
- Open-ended question answering (JEMHopQA [Ishii+, 2023])
|
88 |
+
- Open-ended question answering (NIILC [Sekine, 2003])
|
89 |
+
- Machine reading comprehension (JSQuAD [Kurihara+, 2022])
|
90 |
+
- Automatic summarization (XL-Sum [Hasan+, 2021])
|
91 |
+
- Machine translation (WMT2020 ja-en [Barrault+, 2020])
|
92 |
+
- Machine translation (WMT2020 en-ja [Barrault+, 2020])
|
93 |
+
- Mathematical reasoning (MGSM [Shi+, 2023])
|
94 |
+
|
95 |
+
### English evaluation benchmarks
|
96 |
+
|
97 |
+
We used the Language Model Evaluation Harness(v.0.3.0). The details are as follows:
|
98 |
+
|
99 |
+
- Multiple-choice question answering (OpenBookQA [Mihaylov+, 2018])
|
100 |
+
- Open-ended question answering (TriviaQA [Joshi+, 2017])
|
101 |
+
- Machine reading comprehension (SQuAD 2.0 [Rajpurkar+, 2018])
|
102 |
+
- Commonsense reasoning (XWINO [Tikhonov & Ryabinin, 2021])
|
103 |
+
- Natural language inference (HellaSwag [Zellers+, 2019])
|
104 |
+
- Mathematical reasoning (GSM8k [Cobbe+, 2021])
|
105 |
+
|
106 |
+
|
107 |
## Usage
|
108 |
|
109 |
First install additional dependencies in [requirements.txt](./requirements.txt):
|