aiXcoder commited on
Commit
35a116e
·
verified ·
1 Parent(s): c114adb

Upload folder using huggingface_hub

Browse files
MODEL_LICENSE ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ The aiXcoder Model License
2
+
3
+ 1. Definitions
4
+
5
+ “Licensor” means the aiXcoder Model Team that distributes its Software.
6
+
7
+ “Software” means the aiXcoder model parameters made available under this license.
8
+
9
+ 2. License Grant
10
+
11
+ Subject to the terms and conditions of this License, the Licensor hereby grants to you a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable, royalty-free copyright license to use the Software solely for your non-commercial research purposes.
12
+
13
+ The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
14
+
15
+ 3. Restriction
16
+
17
+ You will not use, copy, modify, merge, publish, distribute, reproduce, or create derivative works of the Software, in whole or in part, for any commercial, military, or illegal purposes.
18
+
19
+ You will not use the Software for any act that may undermine China's national security and national unity, harm the public interest of society, or infringe upon the rights and interests of human beings.
20
+
21
+ 4. Disclaimer
22
+
23
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
24
+
25
+ 5. Limitation of Liability
26
+
27
+ EXCEPT TO THE EXTENT PROHIBITED BY APPLICABLE LAW, IN NO EVENT AND UNDER NO LEGAL THEORY, WHETHER BASED IN TORT, NEGLIGENCE, CONTRACT, LIABILITY, OR OTHERWISE WILL ANY LICENSOR BE LIABLE TO YOU FOR ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES, OR ANY OTHER COMMERCIAL LOSSES, EVEN IF THE LICENSOR HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
28
+
29
+ 6. Dispute Resolution
30
+
31
+ This license shall be governed and construed in accordance with the laws of People’s Republic of China. Any dispute arising from or in connection with this License shall be submitted to Haidian District People's Court in Beijing.
32
+
33
+ Note that the license is subject to update to a more comprehensive version. For any questions related to the license and copyright, please contact us at [email protected].
README.md ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # aiXcoder-colt
2
+
3
+ This repository contains the implementation of CoLT in our papers, a novel fine-tuning approach for enhancing large language models' ability to utilize information within long contexts for code completion tasks.
4
+
5
+ ## Resources
6
+
7
+ ### Dataset
8
+ - **CoLT-132K dataset**: A large-scale repo-level code completion dataset comprising 132,000 samples across four programming languages.
9
+ - **Download**: [https://zenodo.org/records/15019938](https://zenodo.org/records/15019938)
10
+
11
+ ### Models
12
+ - **Trained Models**: This repository includes our models (aiXcoder-7B-v2, DS-Coder and Code Llama) trained with the CoLT approach.
13
+ - **Download**: [https://zenodo.org/records/15021246](https://zenodo.org/records/15021246)
14
+
15
+ ## Experiment Reproduction
16
+
17
+ We use the latest TRL framework(https://github.com/huggingface/trl) code for our experiments. To reproduce our results:
18
+
19
+ ### 1. Prompt Construction Files
20
+ - `aiXcoder-colt/prompt/prompt_aixcoder_colt.py`
21
+ - `aiXcoder-colt/prompt/prompt_codellama.py`
22
+ - `aiXcoder-colt/prompt/prompt_codeqwen.py`
23
+ - `aiXcoder-colt/prompt/prompt_deepseekcoder.py`
24
+
25
+ ### 2. Training Scripts
26
+ - **SFT (Supervised Fine-Tuning)**: See scripts in `aiXcoder-colt/commands/sft/`
27
+ - **DPO (Direct Preference Optimization)**: See scripts in `aiXcoder-colt/commands/po/`
28
+
29
+ ### 3. Reject Sampling for DPO
30
+
31
+ The `aiXcoder-colt/Reject_Sample/` directory contains implementation and evaluation scripts for our reject sampling approach used in Direct Preference Optimization:
32
+
33
+ - **Model-specific implementations**:
34
+ - `aixcoder/`: Reject sampling for aiXcoder model
35
+ - `codellama/`: Reject sampling for Code Llama model
36
+ - `deepseek/`: Reject sampling for DeepSeek-Coder model
37
+
38
+ - **Evaluation scripts**:
39
+ - `eval_api.py`: API-based evaluation script
40
+ - `eval_line.py`: Line-level evaluation script
41
+ - `eval_span.py`: Span-level evaluation script
42
+ - `inference.py`: Model inference script for generating completions
43
+
44
+ ## Dependencies
45
+
46
+ In our experiments, we utilized two Docker environments for TRL training and vLLM (reject sampling). Below are the key dependencies for each environment, excluding redundant packages:
47
+
48
+ ### TRL Training Environment:
49
+ - transformers==4.46.0.dev0
50
+ - torch==2.4.0a0+07cecf4168.nv24.5
51
+ - accelerate==1.0.0
52
+ - deepspeed==0.15.2
53
+ - peft==0.13.1
54
+ - flash-attn==2.4.2
55
+ - datasets==3.0.1
56
+ - wandb==0.15.0
57
+
58
+ ### vLLM Inference Environment:
59
+ - vllm==0.6.0+cu124
60
+ - torch==2.4.0
61
+ - transformers==4.44.2
62
+ - vllm-flash-attn==2.6.1
63
+ - xformers==0.0.27.post2
64
+ - flashinfer==0.1.6+cu121torch2.4
65
+ - fastapi==0.114.1
66
+ - uvicorn==0.30.6
67
+
68
+ Complete dependency lists can be found in the `dependency` directory.
config.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/nfs100/zhuhao/models/aixcoder/aiXcoder-7b-base-weights-hf",
3
+ "architectures": [
4
+ "LlamaForCausalLM"
5
+ ],
6
+ "attention_bias": false,
7
+ "attention_dropout": 0.0,
8
+ "bos_token_id": 1,
9
+ "eos_token_id": 2,
10
+ "head_dim": 128,
11
+ "hidden_act": "silu",
12
+ "hidden_size": 4096,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 14464,
15
+ "max_position_embeddings": 65536,
16
+ "mlp_bias": false,
17
+ "model_type": "llama",
18
+ "num_attention_heads": 32,
19
+ "num_hidden_layers": 32,
20
+ "num_key_value_heads": 8,
21
+ "pretraining_tp": 1,
22
+ "rms_norm_eps": 1e-06,
23
+ "rope_scaling": null,
24
+ "rope_theta": 256000.0,
25
+ "tie_word_embeddings": false,
26
+ "torch_dtype": "bfloat16",
27
+ "transformers_version": "4.46.0.dev0",
28
+ "use_cache": false,
29
+ "vocab_size": 49152
30
+ }
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "transformers_version": "4.46.0.dev0"
6
+ }
model-00001-of-00003.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4843fea59a057fb913068322eae6e9f72cd3be06f37ca9c8689ff262bdb4eb64
3
+ size 4998736752
model-00002-of-00003.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:78ecc96ccffa42beffe3f17c674423f4c077feaff4084c56a5ad2c5734f9a638
3
+ size 4951567760
model-00003-of-00003.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:099d5f9ebd573c3e94208cb9161b633e3b356aa872c88645b384fec7b3e7c509
3
+ size 4914875176
model.safetensors.index.json ADDED
@@ -0,0 +1,298 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 14865145856
4
+ },
5
+ "weight_map": {
6
+ "lm_head.weight": "model-00003-of-00003.safetensors",
7
+ "model.embed_tokens.weight": "model-00001-of-00003.safetensors",
8
+ "model.layers.0.input_layernorm.weight": "model-00001-of-00003.safetensors",
9
+ "model.layers.0.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
10
+ "model.layers.0.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
11
+ "model.layers.0.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
12
+ "model.layers.0.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
13
+ "model.layers.0.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
14
+ "model.layers.0.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
15
+ "model.layers.0.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
16
+ "model.layers.0.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
17
+ "model.layers.1.input_layernorm.weight": "model-00001-of-00003.safetensors",
18
+ "model.layers.1.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
19
+ "model.layers.1.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
20
+ "model.layers.1.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
21
+ "model.layers.1.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
22
+ "model.layers.1.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
23
+ "model.layers.1.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
24
+ "model.layers.1.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
25
+ "model.layers.1.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
26
+ "model.layers.10.input_layernorm.weight": "model-00002-of-00003.safetensors",
27
+ "model.layers.10.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
28
+ "model.layers.10.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
29
+ "model.layers.10.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
30
+ "model.layers.10.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
31
+ "model.layers.10.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
32
+ "model.layers.10.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
33
+ "model.layers.10.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
34
+ "model.layers.10.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
35
+ "model.layers.11.input_layernorm.weight": "model-00002-of-00003.safetensors",
36
+ "model.layers.11.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
37
+ "model.layers.11.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
38
+ "model.layers.11.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
39
+ "model.layers.11.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
40
+ "model.layers.11.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
41
+ "model.layers.11.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
42
+ "model.layers.11.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
43
+ "model.layers.11.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
44
+ "model.layers.12.input_layernorm.weight": "model-00002-of-00003.safetensors",
45
+ "model.layers.12.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
46
+ "model.layers.12.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
47
+ "model.layers.12.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
48
+ "model.layers.12.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
49
+ "model.layers.12.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
50
+ "model.layers.12.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
51
+ "model.layers.12.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
52
+ "model.layers.12.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
53
+ "model.layers.13.input_layernorm.weight": "model-00002-of-00003.safetensors",
54
+ "model.layers.13.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
55
+ "model.layers.13.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
56
+ "model.layers.13.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
57
+ "model.layers.13.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
58
+ "model.layers.13.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
59
+ "model.layers.13.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
60
+ "model.layers.13.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
61
+ "model.layers.13.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
62
+ "model.layers.14.input_layernorm.weight": "model-00002-of-00003.safetensors",
63
+ "model.layers.14.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
64
+ "model.layers.14.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
65
+ "model.layers.14.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
66
+ "model.layers.14.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
67
+ "model.layers.14.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
68
+ "model.layers.14.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
69
+ "model.layers.14.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
70
+ "model.layers.14.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
71
+ "model.layers.15.input_layernorm.weight": "model-00002-of-00003.safetensors",
72
+ "model.layers.15.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
73
+ "model.layers.15.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
74
+ "model.layers.15.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
75
+ "model.layers.15.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
76
+ "model.layers.15.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
77
+ "model.layers.15.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
78
+ "model.layers.15.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
79
+ "model.layers.15.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
80
+ "model.layers.16.input_layernorm.weight": "model-00002-of-00003.safetensors",
81
+ "model.layers.16.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
82
+ "model.layers.16.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
83
+ "model.layers.16.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
84
+ "model.layers.16.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
85
+ "model.layers.16.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
86
+ "model.layers.16.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
87
+ "model.layers.16.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
88
+ "model.layers.16.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
89
+ "model.layers.17.input_layernorm.weight": "model-00002-of-00003.safetensors",
90
+ "model.layers.17.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
91
+ "model.layers.17.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
92
+ "model.layers.17.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
93
+ "model.layers.17.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
94
+ "model.layers.17.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
95
+ "model.layers.17.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
96
+ "model.layers.17.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
97
+ "model.layers.17.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
98
+ "model.layers.18.input_layernorm.weight": "model-00002-of-00003.safetensors",
99
+ "model.layers.18.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
100
+ "model.layers.18.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
101
+ "model.layers.18.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
102
+ "model.layers.18.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
103
+ "model.layers.18.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
104
+ "model.layers.18.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
105
+ "model.layers.18.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
106
+ "model.layers.18.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
107
+ "model.layers.19.input_layernorm.weight": "model-00002-of-00003.safetensors",
108
+ "model.layers.19.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
109
+ "model.layers.19.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
110
+ "model.layers.19.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
111
+ "model.layers.19.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
112
+ "model.layers.19.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
113
+ "model.layers.19.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
114
+ "model.layers.19.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
115
+ "model.layers.19.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
116
+ "model.layers.2.input_layernorm.weight": "model-00001-of-00003.safetensors",
117
+ "model.layers.2.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
118
+ "model.layers.2.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
119
+ "model.layers.2.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
120
+ "model.layers.2.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
121
+ "model.layers.2.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
122
+ "model.layers.2.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
123
+ "model.layers.2.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
124
+ "model.layers.2.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
125
+ "model.layers.20.input_layernorm.weight": "model-00002-of-00003.safetensors",
126
+ "model.layers.20.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
127
+ "model.layers.20.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
128
+ "model.layers.20.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
129
+ "model.layers.20.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
130
+ "model.layers.20.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
131
+ "model.layers.20.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
132
+ "model.layers.20.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
133
+ "model.layers.20.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
134
+ "model.layers.21.input_layernorm.weight": "model-00003-of-00003.safetensors",
135
+ "model.layers.21.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
136
+ "model.layers.21.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
137
+ "model.layers.21.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
138
+ "model.layers.21.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
139
+ "model.layers.21.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
140
+ "model.layers.21.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
141
+ "model.layers.21.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
142
+ "model.layers.21.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
143
+ "model.layers.22.input_layernorm.weight": "model-00003-of-00003.safetensors",
144
+ "model.layers.22.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
145
+ "model.layers.22.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
146
+ "model.layers.22.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
147
+ "model.layers.22.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
148
+ "model.layers.22.self_attn.k_proj.weight": "model-00003-of-00003.safetensors",
149
+ "model.layers.22.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
150
+ "model.layers.22.self_attn.q_proj.weight": "model-00003-of-00003.safetensors",
151
+ "model.layers.22.self_attn.v_proj.weight": "model-00003-of-00003.safetensors",
152
+ "model.layers.23.input_layernorm.weight": "model-00003-of-00003.safetensors",
153
+ "model.layers.23.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
154
+ "model.layers.23.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
155
+ "model.layers.23.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
156
+ "model.layers.23.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
157
+ "model.layers.23.self_attn.k_proj.weight": "model-00003-of-00003.safetensors",
158
+ "model.layers.23.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
159
+ "model.layers.23.self_attn.q_proj.weight": "model-00003-of-00003.safetensors",
160
+ "model.layers.23.self_attn.v_proj.weight": "model-00003-of-00003.safetensors",
161
+ "model.layers.24.input_layernorm.weight": "model-00003-of-00003.safetensors",
162
+ "model.layers.24.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
163
+ "model.layers.24.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
164
+ "model.layers.24.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
165
+ "model.layers.24.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
166
+ "model.layers.24.self_attn.k_proj.weight": "model-00003-of-00003.safetensors",
167
+ "model.layers.24.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
168
+ "model.layers.24.self_attn.q_proj.weight": "model-00003-of-00003.safetensors",
169
+ "model.layers.24.self_attn.v_proj.weight": "model-00003-of-00003.safetensors",
170
+ "model.layers.25.input_layernorm.weight": "model-00003-of-00003.safetensors",
171
+ "model.layers.25.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
172
+ "model.layers.25.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
173
+ "model.layers.25.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
174
+ "model.layers.25.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
175
+ "model.layers.25.self_attn.k_proj.weight": "model-00003-of-00003.safetensors",
176
+ "model.layers.25.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
177
+ "model.layers.25.self_attn.q_proj.weight": "model-00003-of-00003.safetensors",
178
+ "model.layers.25.self_attn.v_proj.weight": "model-00003-of-00003.safetensors",
179
+ "model.layers.26.input_layernorm.weight": "model-00003-of-00003.safetensors",
180
+ "model.layers.26.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
181
+ "model.layers.26.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
182
+ "model.layers.26.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
183
+ "model.layers.26.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
184
+ "model.layers.26.self_attn.k_proj.weight": "model-00003-of-00003.safetensors",
185
+ "model.layers.26.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
186
+ "model.layers.26.self_attn.q_proj.weight": "model-00003-of-00003.safetensors",
187
+ "model.layers.26.self_attn.v_proj.weight": "model-00003-of-00003.safetensors",
188
+ "model.layers.27.input_layernorm.weight": "model-00003-of-00003.safetensors",
189
+ "model.layers.27.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
190
+ "model.layers.27.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
191
+ "model.layers.27.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
192
+ "model.layers.27.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
193
+ "model.layers.27.self_attn.k_proj.weight": "model-00003-of-00003.safetensors",
194
+ "model.layers.27.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
195
+ "model.layers.27.self_attn.q_proj.weight": "model-00003-of-00003.safetensors",
196
+ "model.layers.27.self_attn.v_proj.weight": "model-00003-of-00003.safetensors",
197
+ "model.layers.28.input_layernorm.weight": "model-00003-of-00003.safetensors",
198
+ "model.layers.28.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
199
+ "model.layers.28.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
200
+ "model.layers.28.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
201
+ "model.layers.28.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
202
+ "model.layers.28.self_attn.k_proj.weight": "model-00003-of-00003.safetensors",
203
+ "model.layers.28.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
204
+ "model.layers.28.self_attn.q_proj.weight": "model-00003-of-00003.safetensors",
205
+ "model.layers.28.self_attn.v_proj.weight": "model-00003-of-00003.safetensors",
206
+ "model.layers.29.input_layernorm.weight": "model-00003-of-00003.safetensors",
207
+ "model.layers.29.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
208
+ "model.layers.29.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
209
+ "model.layers.29.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
210
+ "model.layers.29.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
211
+ "model.layers.29.self_attn.k_proj.weight": "model-00003-of-00003.safetensors",
212
+ "model.layers.29.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
213
+ "model.layers.29.self_attn.q_proj.weight": "model-00003-of-00003.safetensors",
214
+ "model.layers.29.self_attn.v_proj.weight": "model-00003-of-00003.safetensors",
215
+ "model.layers.3.input_layernorm.weight": "model-00001-of-00003.safetensors",
216
+ "model.layers.3.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
217
+ "model.layers.3.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
218
+ "model.layers.3.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
219
+ "model.layers.3.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
220
+ "model.layers.3.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
221
+ "model.layers.3.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
222
+ "model.layers.3.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
223
+ "model.layers.3.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
224
+ "model.layers.30.input_layernorm.weight": "model-00003-of-00003.safetensors",
225
+ "model.layers.30.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
226
+ "model.layers.30.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
227
+ "model.layers.30.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
228
+ "model.layers.30.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
229
+ "model.layers.30.self_attn.k_proj.weight": "model-00003-of-00003.safetensors",
230
+ "model.layers.30.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
231
+ "model.layers.30.self_attn.q_proj.weight": "model-00003-of-00003.safetensors",
232
+ "model.layers.30.self_attn.v_proj.weight": "model-00003-of-00003.safetensors",
233
+ "model.layers.31.input_layernorm.weight": "model-00003-of-00003.safetensors",
234
+ "model.layers.31.mlp.down_proj.weight": "model-00003-of-00003.safetensors",
235
+ "model.layers.31.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
236
+ "model.layers.31.mlp.up_proj.weight": "model-00003-of-00003.safetensors",
237
+ "model.layers.31.post_attention_layernorm.weight": "model-00003-of-00003.safetensors",
238
+ "model.layers.31.self_attn.k_proj.weight": "model-00003-of-00003.safetensors",
239
+ "model.layers.31.self_attn.o_proj.weight": "model-00003-of-00003.safetensors",
240
+ "model.layers.31.self_attn.q_proj.weight": "model-00003-of-00003.safetensors",
241
+ "model.layers.31.self_attn.v_proj.weight": "model-00003-of-00003.safetensors",
242
+ "model.layers.4.input_layernorm.weight": "model-00001-of-00003.safetensors",
243
+ "model.layers.4.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
244
+ "model.layers.4.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
245
+ "model.layers.4.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
246
+ "model.layers.4.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
247
+ "model.layers.4.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
248
+ "model.layers.4.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
249
+ "model.layers.4.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
250
+ "model.layers.4.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
251
+ "model.layers.5.input_layernorm.weight": "model-00001-of-00003.safetensors",
252
+ "model.layers.5.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
253
+ "model.layers.5.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
254
+ "model.layers.5.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
255
+ "model.layers.5.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
256
+ "model.layers.5.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
257
+ "model.layers.5.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
258
+ "model.layers.5.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
259
+ "model.layers.5.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
260
+ "model.layers.6.input_layernorm.weight": "model-00001-of-00003.safetensors",
261
+ "model.layers.6.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
262
+ "model.layers.6.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
263
+ "model.layers.6.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
264
+ "model.layers.6.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
265
+ "model.layers.6.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
266
+ "model.layers.6.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
267
+ "model.layers.6.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
268
+ "model.layers.6.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
269
+ "model.layers.7.input_layernorm.weight": "model-00001-of-00003.safetensors",
270
+ "model.layers.7.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
271
+ "model.layers.7.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
272
+ "model.layers.7.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
273
+ "model.layers.7.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
274
+ "model.layers.7.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
275
+ "model.layers.7.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
276
+ "model.layers.7.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
277
+ "model.layers.7.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
278
+ "model.layers.8.input_layernorm.weight": "model-00001-of-00003.safetensors",
279
+ "model.layers.8.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
280
+ "model.layers.8.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
281
+ "model.layers.8.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
282
+ "model.layers.8.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
283
+ "model.layers.8.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
284
+ "model.layers.8.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
285
+ "model.layers.8.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
286
+ "model.layers.8.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
287
+ "model.layers.9.input_layernorm.weight": "model-00001-of-00003.safetensors",
288
+ "model.layers.9.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
289
+ "model.layers.9.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
290
+ "model.layers.9.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
291
+ "model.layers.9.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
292
+ "model.layers.9.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
293
+ "model.layers.9.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
294
+ "model.layers.9.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
295
+ "model.layers.9.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
296
+ "model.norm.weight": "model-00003-of-00003.safetensors"
297
+ }
298
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "</s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": "▁<AIX-PAD>",
17
+ "unk_token": {
18
+ "content": "<unk>",
19
+ "lstrip": false,
20
+ "normalized": false,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ }
24
+ }
25
+
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<unk>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<s>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "263": {
28
+ "content": "▁<AIX-END-TURN>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ }
35
+ },
36
+ "additional_special_tokens": [],
37
+ "bos_token": "<s>",
38
+ "clean_up_tokenization_spaces": false,
39
+ "eos_token": "</s>",
40
+ "legacy": true,
41
+ "model_max_length": 1000000000000000019884624838656,
42
+ "pad_token": "▁<AIX-PAD>",
43
+ "sp_model_kwargs": {},
44
+ "spaces_between_special_tokens": false,
45
+ "tokenizer_class": "PreTrainedTokenizerFast",
46
+ "unk_token": "<unk>",
47
+ "use_default_system_prompt": false
48
+ }
49
+
50
+
trainer_state.json ADDED
@@ -0,0 +1,792 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": null,
3
+ "best_model_checkpoint": null,
4
+ "epoch": 1.9997197702115734,
5
+ "eval_steps": 50,
6
+ "global_step": 892,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "epoch": 0.02241838307412078,
13
+ "grad_norm": 11.57065531644996,
14
+ "learning_rate": 1.9775784753363228e-06,
15
+ "loss": 0.2054,
16
+ "step": 10
17
+ },
18
+ {
19
+ "epoch": 0.04483676614824156,
20
+ "grad_norm": 10.25086765406488,
21
+ "learning_rate": 1.9551569506726456e-06,
22
+ "loss": 0.1492,
23
+ "step": 20
24
+ },
25
+ {
26
+ "epoch": 0.06725514922236234,
27
+ "grad_norm": 7.822767237983853,
28
+ "learning_rate": 1.9327354260089685e-06,
29
+ "loss": 0.152,
30
+ "step": 30
31
+ },
32
+ {
33
+ "epoch": 0.08967353229648312,
34
+ "grad_norm": 8.157708200242721,
35
+ "learning_rate": 1.9103139013452914e-06,
36
+ "loss": 0.1472,
37
+ "step": 40
38
+ },
39
+ {
40
+ "epoch": 0.1120919153706039,
41
+ "grad_norm": 12.8257632132892,
42
+ "learning_rate": 1.8878923766816142e-06,
43
+ "loss": 0.1356,
44
+ "step": 50
45
+ },
46
+ {
47
+ "epoch": 0.1120919153706039,
48
+ "eval_loss": 0.1209246814250946,
49
+ "eval_runtime": 477.8616,
50
+ "eval_samples_per_second": 6.288,
51
+ "eval_steps_per_second": 0.787,
52
+ "step": 50
53
+ },
54
+ {
55
+ "epoch": 0.13451029844472467,
56
+ "grad_norm": 8.90816614814061,
57
+ "learning_rate": 1.865470852017937e-06,
58
+ "loss": 0.1281,
59
+ "step": 60
60
+ },
61
+ {
62
+ "epoch": 0.15692868151884545,
63
+ "grad_norm": 9.628880600516572,
64
+ "learning_rate": 1.84304932735426e-06,
65
+ "loss": 0.1381,
66
+ "step": 70
67
+ },
68
+ {
69
+ "epoch": 0.17934706459296623,
70
+ "grad_norm": 8.899868902286775,
71
+ "learning_rate": 1.8206278026905828e-06,
72
+ "loss": 0.1324,
73
+ "step": 80
74
+ },
75
+ {
76
+ "epoch": 0.201765447667087,
77
+ "grad_norm": 130.29750942761115,
78
+ "learning_rate": 1.798206278026906e-06,
79
+ "loss": 0.1277,
80
+ "step": 90
81
+ },
82
+ {
83
+ "epoch": 0.2241838307412078,
84
+ "grad_norm": 6.4686996638845145,
85
+ "learning_rate": 1.7757847533632286e-06,
86
+ "loss": 0.1221,
87
+ "step": 100
88
+ },
89
+ {
90
+ "epoch": 0.2241838307412078,
91
+ "eval_loss": 0.11111436039209366,
92
+ "eval_runtime": 477.824,
93
+ "eval_samples_per_second": 6.289,
94
+ "eval_steps_per_second": 0.787,
95
+ "step": 100
96
+ },
97
+ {
98
+ "epoch": 0.24660221381532857,
99
+ "grad_norm": 8.356076121379255,
100
+ "learning_rate": 1.7533632286995514e-06,
101
+ "loss": 0.1198,
102
+ "step": 110
103
+ },
104
+ {
105
+ "epoch": 0.26902059688944935,
106
+ "grad_norm": 6.743428651631778,
107
+ "learning_rate": 1.7309417040358743e-06,
108
+ "loss": 0.1076,
109
+ "step": 120
110
+ },
111
+ {
112
+ "epoch": 0.29143897996357016,
113
+ "grad_norm": 6.8645809938665465,
114
+ "learning_rate": 1.7085201793721974e-06,
115
+ "loss": 0.1187,
116
+ "step": 130
117
+ },
118
+ {
119
+ "epoch": 0.3138573630376909,
120
+ "grad_norm": 5.999842993813071,
121
+ "learning_rate": 1.68609865470852e-06,
122
+ "loss": 0.1182,
123
+ "step": 140
124
+ },
125
+ {
126
+ "epoch": 0.3362757461118117,
127
+ "grad_norm": 7.7478230858078305,
128
+ "learning_rate": 1.6636771300448429e-06,
129
+ "loss": 0.1127,
130
+ "step": 150
131
+ },
132
+ {
133
+ "epoch": 0.3362757461118117,
134
+ "eval_loss": 0.10507839918136597,
135
+ "eval_runtime": 477.8167,
136
+ "eval_samples_per_second": 6.289,
137
+ "eval_steps_per_second": 0.787,
138
+ "step": 150
139
+ },
140
+ {
141
+ "epoch": 0.35869412918593246,
142
+ "grad_norm": 6.564504796683449,
143
+ "learning_rate": 1.641255605381166e-06,
144
+ "loss": 0.1144,
145
+ "step": 160
146
+ },
147
+ {
148
+ "epoch": 0.38111251226005327,
149
+ "grad_norm": 20.75971255482363,
150
+ "learning_rate": 1.6188340807174888e-06,
151
+ "loss": 0.1078,
152
+ "step": 170
153
+ },
154
+ {
155
+ "epoch": 0.403530895334174,
156
+ "grad_norm": 6.913685427209001,
157
+ "learning_rate": 1.5964125560538115e-06,
158
+ "loss": 0.1189,
159
+ "step": 180
160
+ },
161
+ {
162
+ "epoch": 0.42594927840829483,
163
+ "grad_norm": 11.143255383687194,
164
+ "learning_rate": 1.5739910313901345e-06,
165
+ "loss": 0.1189,
166
+ "step": 190
167
+ },
168
+ {
169
+ "epoch": 0.4483676614824156,
170
+ "grad_norm": 4.747347623410045,
171
+ "learning_rate": 1.5515695067264574e-06,
172
+ "loss": 0.1072,
173
+ "step": 200
174
+ },
175
+ {
176
+ "epoch": 0.4483676614824156,
177
+ "eval_loss": 0.09710206091403961,
178
+ "eval_runtime": 477.8045,
179
+ "eval_samples_per_second": 6.289,
180
+ "eval_steps_per_second": 0.787,
181
+ "step": 200
182
+ },
183
+ {
184
+ "epoch": 0.4707860445565364,
185
+ "grad_norm": 5.512284850848945,
186
+ "learning_rate": 1.5291479820627803e-06,
187
+ "loss": 0.1136,
188
+ "step": 210
189
+ },
190
+ {
191
+ "epoch": 0.49320442763065714,
192
+ "grad_norm": 17.278200474599828,
193
+ "learning_rate": 1.506726457399103e-06,
194
+ "loss": 0.1094,
195
+ "step": 220
196
+ },
197
+ {
198
+ "epoch": 0.5156228107047779,
199
+ "grad_norm": 46.879816629413476,
200
+ "learning_rate": 1.484304932735426e-06,
201
+ "loss": 0.1091,
202
+ "step": 230
203
+ },
204
+ {
205
+ "epoch": 0.5380411937788987,
206
+ "grad_norm": 8.493451669290872,
207
+ "learning_rate": 1.4618834080717489e-06,
208
+ "loss": 0.1019,
209
+ "step": 240
210
+ },
211
+ {
212
+ "epoch": 0.5604595768530195,
213
+ "grad_norm": 9.286929395356434,
214
+ "learning_rate": 1.4394618834080715e-06,
215
+ "loss": 0.1075,
216
+ "step": 250
217
+ },
218
+ {
219
+ "epoch": 0.5604595768530195,
220
+ "eval_loss": 0.09229769557714462,
221
+ "eval_runtime": 477.6411,
222
+ "eval_samples_per_second": 6.291,
223
+ "eval_steps_per_second": 0.787,
224
+ "step": 250
225
+ },
226
+ {
227
+ "epoch": 0.5828779599271403,
228
+ "grad_norm": 7.559299755990858,
229
+ "learning_rate": 1.4170403587443946e-06,
230
+ "loss": 0.0985,
231
+ "step": 260
232
+ },
233
+ {
234
+ "epoch": 0.605296343001261,
235
+ "grad_norm": 3.9026812483393387,
236
+ "learning_rate": 1.3946188340807175e-06,
237
+ "loss": 0.0896,
238
+ "step": 270
239
+ },
240
+ {
241
+ "epoch": 0.6277147260753818,
242
+ "grad_norm": 8.335653074155596,
243
+ "learning_rate": 1.3721973094170403e-06,
244
+ "loss": 0.0893,
245
+ "step": 280
246
+ },
247
+ {
248
+ "epoch": 0.6501331091495026,
249
+ "grad_norm": 6.897239653871222,
250
+ "learning_rate": 1.349775784753363e-06,
251
+ "loss": 0.0914,
252
+ "step": 290
253
+ },
254
+ {
255
+ "epoch": 0.6725514922236234,
256
+ "grad_norm": 29.648594106464095,
257
+ "learning_rate": 1.327354260089686e-06,
258
+ "loss": 0.0972,
259
+ "step": 300
260
+ },
261
+ {
262
+ "epoch": 0.6725514922236234,
263
+ "eval_loss": 0.08544214069843292,
264
+ "eval_runtime": 477.8106,
265
+ "eval_samples_per_second": 6.289,
266
+ "eval_steps_per_second": 0.787,
267
+ "step": 300
268
+ },
269
+ {
270
+ "epoch": 0.6949698752977441,
271
+ "grad_norm": 3.969809179016744,
272
+ "learning_rate": 1.304932735426009e-06,
273
+ "loss": 0.089,
274
+ "step": 310
275
+ },
276
+ {
277
+ "epoch": 0.7173882583718649,
278
+ "grad_norm": 5.34402617506164,
279
+ "learning_rate": 1.2825112107623318e-06,
280
+ "loss": 0.0867,
281
+ "step": 320
282
+ },
283
+ {
284
+ "epoch": 0.7398066414459857,
285
+ "grad_norm": 5.506697389437645,
286
+ "learning_rate": 1.2600896860986546e-06,
287
+ "loss": 0.0803,
288
+ "step": 330
289
+ },
290
+ {
291
+ "epoch": 0.7622250245201065,
292
+ "grad_norm": 8.056193416126446,
293
+ "learning_rate": 1.2376681614349775e-06,
294
+ "loss": 0.0917,
295
+ "step": 340
296
+ },
297
+ {
298
+ "epoch": 0.7846434075942272,
299
+ "grad_norm": 4.7612681555615,
300
+ "learning_rate": 1.2152466367713004e-06,
301
+ "loss": 0.089,
302
+ "step": 350
303
+ },
304
+ {
305
+ "epoch": 0.7846434075942272,
306
+ "eval_loss": 0.07997283339500427,
307
+ "eval_runtime": 477.8824,
308
+ "eval_samples_per_second": 6.288,
309
+ "eval_steps_per_second": 0.787,
310
+ "step": 350
311
+ },
312
+ {
313
+ "epoch": 0.807061790668348,
314
+ "grad_norm": 9.702063628974917,
315
+ "learning_rate": 1.1928251121076232e-06,
316
+ "loss": 0.0886,
317
+ "step": 360
318
+ },
319
+ {
320
+ "epoch": 0.8294801737424689,
321
+ "grad_norm": 4.937196401368392,
322
+ "learning_rate": 1.170403587443946e-06,
323
+ "loss": 0.0805,
324
+ "step": 370
325
+ },
326
+ {
327
+ "epoch": 0.8518985568165897,
328
+ "grad_norm": 6.8769697579900555,
329
+ "learning_rate": 1.147982062780269e-06,
330
+ "loss": 0.0754,
331
+ "step": 380
332
+ },
333
+ {
334
+ "epoch": 0.8743169398907104,
335
+ "grad_norm": 10.270492248069893,
336
+ "learning_rate": 1.1255605381165918e-06,
337
+ "loss": 0.0855,
338
+ "step": 390
339
+ },
340
+ {
341
+ "epoch": 0.8967353229648312,
342
+ "grad_norm": 8.129704483519474,
343
+ "learning_rate": 1.103139013452915e-06,
344
+ "loss": 0.0898,
345
+ "step": 400
346
+ },
347
+ {
348
+ "epoch": 0.8967353229648312,
349
+ "eval_loss": 0.0756540298461914,
350
+ "eval_runtime": 477.7284,
351
+ "eval_samples_per_second": 6.29,
352
+ "eval_steps_per_second": 0.787,
353
+ "step": 400
354
+ },
355
+ {
356
+ "epoch": 0.919153706038952,
357
+ "grad_norm": 7.785886269800572,
358
+ "learning_rate": 1.0807174887892376e-06,
359
+ "loss": 0.0827,
360
+ "step": 410
361
+ },
362
+ {
363
+ "epoch": 0.9415720891130728,
364
+ "grad_norm": 7.119394503978009,
365
+ "learning_rate": 1.0582959641255604e-06,
366
+ "loss": 0.0862,
367
+ "step": 420
368
+ },
369
+ {
370
+ "epoch": 0.9639904721871935,
371
+ "grad_norm": 5.06264160032142,
372
+ "learning_rate": 1.0358744394618835e-06,
373
+ "loss": 0.078,
374
+ "step": 430
375
+ },
376
+ {
377
+ "epoch": 0.9864088552613143,
378
+ "grad_norm": 6.034467278570911,
379
+ "learning_rate": 1.0134529147982064e-06,
380
+ "loss": 0.0745,
381
+ "step": 440
382
+ },
383
+ {
384
+ "epoch": 1.008827238335435,
385
+ "grad_norm": 6.107563557049032,
386
+ "learning_rate": 9.91031390134529e-07,
387
+ "loss": 0.0635,
388
+ "step": 450
389
+ },
390
+ {
391
+ "epoch": 1.008827238335435,
392
+ "eval_loss": 0.07114721089601517,
393
+ "eval_runtime": 477.6398,
394
+ "eval_samples_per_second": 6.291,
395
+ "eval_steps_per_second": 0.787,
396
+ "step": 450
397
+ },
398
+ {
399
+ "epoch": 1.0312456214095558,
400
+ "grad_norm": 8.472989007522791,
401
+ "learning_rate": 9.68609865470852e-07,
402
+ "loss": 0.0687,
403
+ "step": 460
404
+ },
405
+ {
406
+ "epoch": 1.0536640044836767,
407
+ "grad_norm": 5.080789399491274,
408
+ "learning_rate": 9.461883408071749e-07,
409
+ "loss": 0.0635,
410
+ "step": 470
411
+ },
412
+ {
413
+ "epoch": 1.0760823875577974,
414
+ "grad_norm": 8.06976721278842,
415
+ "learning_rate": 9.237668161434977e-07,
416
+ "loss": 0.0674,
417
+ "step": 480
418
+ },
419
+ {
420
+ "epoch": 1.098500770631918,
421
+ "grad_norm": 9.00400557879772,
422
+ "learning_rate": 9.013452914798207e-07,
423
+ "loss": 0.0623,
424
+ "step": 490
425
+ },
426
+ {
427
+ "epoch": 1.120919153706039,
428
+ "grad_norm": 10.419684143484417,
429
+ "learning_rate": 8.789237668161434e-07,
430
+ "loss": 0.0742,
431
+ "step": 500
432
+ },
433
+ {
434
+ "epoch": 1.120919153706039,
435
+ "eval_loss": 0.06726241111755371,
436
+ "eval_runtime": 478.2051,
437
+ "eval_samples_per_second": 6.284,
438
+ "eval_steps_per_second": 0.786,
439
+ "step": 500
440
+ },
441
+ {
442
+ "epoch": 1.1433375367801597,
443
+ "grad_norm": 5.834457752004053,
444
+ "learning_rate": 8.565022421524663e-07,
445
+ "loss": 0.0698,
446
+ "step": 510
447
+ },
448
+ {
449
+ "epoch": 1.1657559198542806,
450
+ "grad_norm": 5.920575213015667,
451
+ "learning_rate": 8.340807174887892e-07,
452
+ "loss": 0.061,
453
+ "step": 520
454
+ },
455
+ {
456
+ "epoch": 1.1881743029284013,
457
+ "grad_norm": 11.4664445988588,
458
+ "learning_rate": 8.11659192825112e-07,
459
+ "loss": 0.0612,
460
+ "step": 530
461
+ },
462
+ {
463
+ "epoch": 1.210592686002522,
464
+ "grad_norm": 5.074793088570693,
465
+ "learning_rate": 7.892376681614349e-07,
466
+ "loss": 0.0582,
467
+ "step": 540
468
+ },
469
+ {
470
+ "epoch": 1.233011069076643,
471
+ "grad_norm": 4.964609158362768,
472
+ "learning_rate": 7.668161434977578e-07,
473
+ "loss": 0.0627,
474
+ "step": 550
475
+ },
476
+ {
477
+ "epoch": 1.233011069076643,
478
+ "eval_loss": 0.06428983807563782,
479
+ "eval_runtime": 477.8813,
480
+ "eval_samples_per_second": 6.288,
481
+ "eval_steps_per_second": 0.787,
482
+ "step": 550
483
+ },
484
+ {
485
+ "epoch": 1.2554294521507636,
486
+ "grad_norm": 5.775433133659037,
487
+ "learning_rate": 7.443946188340807e-07,
488
+ "loss": 0.0613,
489
+ "step": 560
490
+ },
491
+ {
492
+ "epoch": 1.2778478352248843,
493
+ "grad_norm": 7.504088139984216,
494
+ "learning_rate": 7.219730941704035e-07,
495
+ "loss": 0.0625,
496
+ "step": 570
497
+ },
498
+ {
499
+ "epoch": 1.3002662182990052,
500
+ "grad_norm": 12.423874958836857,
501
+ "learning_rate": 6.995515695067265e-07,
502
+ "loss": 0.0541,
503
+ "step": 580
504
+ },
505
+ {
506
+ "epoch": 1.322684601373126,
507
+ "grad_norm": 13.037612748873672,
508
+ "learning_rate": 6.771300448430492e-07,
509
+ "loss": 0.0638,
510
+ "step": 590
511
+ },
512
+ {
513
+ "epoch": 1.3451029844472466,
514
+ "grad_norm": 7.643633731840427,
515
+ "learning_rate": 6.547085201793722e-07,
516
+ "loss": 0.0528,
517
+ "step": 600
518
+ },
519
+ {
520
+ "epoch": 1.3451029844472466,
521
+ "eval_loss": 0.06060384213924408,
522
+ "eval_runtime": 477.628,
523
+ "eval_samples_per_second": 6.292,
524
+ "eval_steps_per_second": 0.787,
525
+ "step": 600
526
+ },
527
+ {
528
+ "epoch": 1.3675213675213675,
529
+ "grad_norm": 9.627024940339941,
530
+ "learning_rate": 6.322869955156951e-07,
531
+ "loss": 0.0569,
532
+ "step": 610
533
+ },
534
+ {
535
+ "epoch": 1.3899397505954882,
536
+ "grad_norm": 14.780208722543843,
537
+ "learning_rate": 6.098654708520179e-07,
538
+ "loss": 0.0504,
539
+ "step": 620
540
+ },
541
+ {
542
+ "epoch": 1.4123581336696092,
543
+ "grad_norm": 7.999021557044027,
544
+ "learning_rate": 5.874439461883408e-07,
545
+ "loss": 0.0609,
546
+ "step": 630
547
+ },
548
+ {
549
+ "epoch": 1.4347765167437299,
550
+ "grad_norm": 8.443684543866313,
551
+ "learning_rate": 5.650224215246636e-07,
552
+ "loss": 0.0531,
553
+ "step": 640
554
+ },
555
+ {
556
+ "epoch": 1.4571948998178508,
557
+ "grad_norm": 3.5781417864085086,
558
+ "learning_rate": 5.426008968609865e-07,
559
+ "loss": 0.0542,
560
+ "step": 650
561
+ },
562
+ {
563
+ "epoch": 1.4571948998178508,
564
+ "eval_loss": 0.057715680450201035,
565
+ "eval_runtime": 477.9099,
566
+ "eval_samples_per_second": 6.288,
567
+ "eval_steps_per_second": 0.787,
568
+ "step": 650
569
+ },
570
+ {
571
+ "epoch": 1.4796132828919715,
572
+ "grad_norm": 6.394299603034078,
573
+ "learning_rate": 5.201793721973094e-07,
574
+ "loss": 0.0549,
575
+ "step": 660
576
+ },
577
+ {
578
+ "epoch": 1.5020316659660922,
579
+ "grad_norm": 6.251887881398542,
580
+ "learning_rate": 4.977578475336322e-07,
581
+ "loss": 0.0536,
582
+ "step": 670
583
+ },
584
+ {
585
+ "epoch": 1.524450049040213,
586
+ "grad_norm": 4.45363233623469,
587
+ "learning_rate": 4.753363228699551e-07,
588
+ "loss": 0.0519,
589
+ "step": 680
590
+ },
591
+ {
592
+ "epoch": 1.5468684321143338,
593
+ "grad_norm": 7.236100371063919,
594
+ "learning_rate": 4.5291479820627797e-07,
595
+ "loss": 0.0516,
596
+ "step": 690
597
+ },
598
+ {
599
+ "epoch": 1.5692868151884545,
600
+ "grad_norm": 11.475278894303377,
601
+ "learning_rate": 4.304932735426009e-07,
602
+ "loss": 0.0463,
603
+ "step": 700
604
+ },
605
+ {
606
+ "epoch": 1.5692868151884545,
607
+ "eval_loss": 0.05437139794230461,
608
+ "eval_runtime": 477.9054,
609
+ "eval_samples_per_second": 6.288,
610
+ "eval_steps_per_second": 0.787,
611
+ "step": 700
612
+ },
613
+ {
614
+ "epoch": 1.5917051982625754,
615
+ "grad_norm": 6.676408323109306,
616
+ "learning_rate": 4.0807174887892375e-07,
617
+ "loss": 0.0532,
618
+ "step": 710
619
+ },
620
+ {
621
+ "epoch": 1.614123581336696,
622
+ "grad_norm": 4.965760634983937,
623
+ "learning_rate": 3.856502242152466e-07,
624
+ "loss": 0.0545,
625
+ "step": 720
626
+ },
627
+ {
628
+ "epoch": 1.6365419644108168,
629
+ "grad_norm": 27.900762279082834,
630
+ "learning_rate": 3.632286995515695e-07,
631
+ "loss": 0.0544,
632
+ "step": 730
633
+ },
634
+ {
635
+ "epoch": 1.6589603474849377,
636
+ "grad_norm": 5.498723625213236,
637
+ "learning_rate": 3.4080717488789235e-07,
638
+ "loss": 0.0518,
639
+ "step": 740
640
+ },
641
+ {
642
+ "epoch": 1.6813787305590584,
643
+ "grad_norm": 5.5744858185141775,
644
+ "learning_rate": 3.183856502242152e-07,
645
+ "loss": 0.0463,
646
+ "step": 750
647
+ },
648
+ {
649
+ "epoch": 1.6813787305590584,
650
+ "eval_loss": 0.05197111889719963,
651
+ "eval_runtime": 477.8827,
652
+ "eval_samples_per_second": 6.288,
653
+ "eval_steps_per_second": 0.787,
654
+ "step": 750
655
+ },
656
+ {
657
+ "epoch": 1.703797113633179,
658
+ "grad_norm": 5.741468990313337,
659
+ "learning_rate": 2.9596412556053813e-07,
660
+ "loss": 0.0454,
661
+ "step": 760
662
+ },
663
+ {
664
+ "epoch": 1.7262154967073,
665
+ "grad_norm": 9.331526709026493,
666
+ "learning_rate": 2.73542600896861e-07,
667
+ "loss": 0.0448,
668
+ "step": 770
669
+ },
670
+ {
671
+ "epoch": 1.748633879781421,
672
+ "grad_norm": 8.801924259928679,
673
+ "learning_rate": 2.5112107623318386e-07,
674
+ "loss": 0.0491,
675
+ "step": 780
676
+ },
677
+ {
678
+ "epoch": 1.7710522628555414,
679
+ "grad_norm": 3.9216347240361435,
680
+ "learning_rate": 2.2869955156950672e-07,
681
+ "loss": 0.0456,
682
+ "step": 790
683
+ },
684
+ {
685
+ "epoch": 1.7934706459296623,
686
+ "grad_norm": 5.775711795911055,
687
+ "learning_rate": 2.062780269058296e-07,
688
+ "loss": 0.0434,
689
+ "step": 800
690
+ },
691
+ {
692
+ "epoch": 1.7934706459296623,
693
+ "eval_loss": 0.04976892098784447,
694
+ "eval_runtime": 477.8498,
695
+ "eval_samples_per_second": 6.289,
696
+ "eval_steps_per_second": 0.787,
697
+ "step": 800
698
+ },
699
+ {
700
+ "epoch": 1.8158890290037832,
701
+ "grad_norm": 15.832420424237657,
702
+ "learning_rate": 1.8385650224215245e-07,
703
+ "loss": 0.0549,
704
+ "step": 810
705
+ },
706
+ {
707
+ "epoch": 1.838307412077904,
708
+ "grad_norm": 11.331927257096979,
709
+ "learning_rate": 1.6143497757847531e-07,
710
+ "loss": 0.0479,
711
+ "step": 820
712
+ },
713
+ {
714
+ "epoch": 1.8607257951520246,
715
+ "grad_norm": 3.4779552694261846,
716
+ "learning_rate": 1.390134529147982e-07,
717
+ "loss": 0.045,
718
+ "step": 830
719
+ },
720
+ {
721
+ "epoch": 1.8831441782261455,
722
+ "grad_norm": 14.096117466941674,
723
+ "learning_rate": 1.1659192825112107e-07,
724
+ "loss": 0.0468,
725
+ "step": 840
726
+ },
727
+ {
728
+ "epoch": 1.9055625613002662,
729
+ "grad_norm": 4.258023342595773,
730
+ "learning_rate": 9.417040358744393e-08,
731
+ "loss": 0.043,
732
+ "step": 850
733
+ },
734
+ {
735
+ "epoch": 1.9055625613002662,
736
+ "eval_loss": 0.047411367297172546,
737
+ "eval_runtime": 478.0538,
738
+ "eval_samples_per_second": 6.286,
739
+ "eval_steps_per_second": 0.787,
740
+ "step": 850
741
+ },
742
+ {
743
+ "epoch": 1.927980944374387,
744
+ "grad_norm": 4.472323710821738,
745
+ "learning_rate": 7.174887892376681e-08,
746
+ "loss": 0.0485,
747
+ "step": 860
748
+ },
749
+ {
750
+ "epoch": 1.9503993274485079,
751
+ "grad_norm": 8.238550394871776,
752
+ "learning_rate": 4.932735426008968e-08,
753
+ "loss": 0.0461,
754
+ "step": 870
755
+ },
756
+ {
757
+ "epoch": 1.9728177105226286,
758
+ "grad_norm": 3.661996756017377,
759
+ "learning_rate": 2.6905829596412556e-08,
760
+ "loss": 0.0414,
761
+ "step": 880
762
+ },
763
+ {
764
+ "epoch": 1.9952360935967492,
765
+ "grad_norm": 4.240752961770031,
766
+ "learning_rate": 4.484304932735426e-09,
767
+ "loss": 0.0496,
768
+ "step": 890
769
+ }
770
+ ],
771
+ "logging_steps": 10,
772
+ "max_steps": 892,
773
+ "num_input_tokens_seen": 0,
774
+ "num_train_epochs": 2,
775
+ "save_steps": 360,
776
+ "stateful_callbacks": {
777
+ "TrainerControl": {
778
+ "args": {
779
+ "should_epoch_stop": false,
780
+ "should_evaluate": false,
781
+ "should_log": false,
782
+ "should_save": true,
783
+ "should_training_stop": true
784
+ },
785
+ "attributes": {}
786
+ }
787
+ },
788
+ "total_flos": 1.1364495346169283e+18,
789
+ "train_batch_size": 1,
790
+ "trial_name": null,
791
+ "trial_params": null
792
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7ce387afef464020256b03b489b6c74423b942a6a25930f553fe109bdb623979
3
+ size 6968