bigscience-bot commited on
Commit
1b1b903
·
1 Parent(s): ade1445
Files changed (1) hide show
  1. logs/main_log.txt +176 -0
logs/main_log.txt CHANGED
@@ -57155,3 +57155,179 @@ time (ms)
57155
  time (ms)
57156
  iteration 48/ 292968 | consumed samples: 98304 | consumed tokens: 6291456 | elapsed time per iteration (ms): 95739.1 | learning rate: 2.727E-05 | global batch size: 2048 | lm loss: 8.545946E+00 | loss scale: 4096.0 | grad norm: 47054.317 | num zeros: 0.0 | curriculum seqlen: 64 | number of skipped iterations: 0 | number of nan iterations: 0 |
57157
  time (ms)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57155
  time (ms)
57156
  iteration 48/ 292968 | consumed samples: 98304 | consumed tokens: 6291456 | elapsed time per iteration (ms): 95739.1 | learning rate: 2.727E-05 | global batch size: 2048 | lm loss: 8.545946E+00 | loss scale: 4096.0 | grad norm: 47054.317 | num zeros: 0.0 | curriculum seqlen: 64 | number of skipped iterations: 0 | number of nan iterations: 0 |
57157
  time (ms)
57158
+ iteration 49/ 292968 | consumed samples: 100352 | consumed tokens: 6422528 | elapsed time per iteration (ms): 94691.8 | learning rate: 2.783E-05 | global batch size: 2048 | lm loss: 8.737078E+00 | loss scale: 4096.0 | grad norm: 147984.860 | num zeros: 0.0 | curriculum seqlen: 64 | number of skipped iterations: 0 | number of nan iterations: 0 |
57159
+ time (ms)
57160
+ iteration 50/ 292968 | consumed samples: 102400 | consumed tokens: 6553600 | elapsed time per iteration (ms): 96272.3 | learning rate: 2.840E-05 | global batch size: 2048 | lm loss: 8.645372E+00 | loss scale: 4096.0 | grad norm: 100115.276 | num zeros: 0.0 | curriculum seqlen: 64 | number of skipped iterations: 0 | number of nan iterations: 0 |
57161
+ time (ms)
57162
+ iteration 51/ 292968 | consumed samples: 104448 | consumed tokens: 6684672 | elapsed time per iteration (ms): 96225.8 | learning rate: 2.897E-05 | global batch size: 2048 | lm loss: 8.786609E+00 | loss scale: 4096.0 | grad norm: 138446.949 | num zeros: 0.0 | curriculum seqlen: 64 | number of skipped iterations: 0 | number of nan iterations: 0 |
57163
+ time (ms)
57164
+ iteration 52/ 292968 | consumed samples: 106496 | consumed tokens: 6815744 | elapsed time per iteration (ms): 93767.5 | learning rate: 2.954E-05 | global batch size: 2048 | lm loss: 8.520951E+00 | loss scale: 4096.0 | grad norm: 72259.747 | num zeros: 0.0 | curriculum seqlen: 64 | number of skipped iterations: 0 | number of nan iterations: 0 |
57165
+ time (ms)
57166
+ iteration 53/ 292968 | consumed samples: 108544 | consumed tokens: 6946816 | elapsed time per iteration (ms): 95896.3 | learning rate: 3.011E-05 | global batch size: 2048 | lm loss: 8.274112E+00 | loss scale: 4096.0 | grad norm: 30192.728 | num zeros: 0.0 | curriculum seqlen: 64 | number of skipped iterations: 0 | number of nan iterations: 0 |
57167
+ time (ms)
57168
+ iteration 54/ 292968 | consumed samples: 110592 | consumed tokens: 7077888 | elapsed time per iteration (ms): 94348.2 | learning rate: 3.067E-05 | global batch size: 2048 | lm loss: 8.363799E+00 | loss scale: 4096.0 | grad norm: 70109.113 | num zeros: 0.0 | curriculum seqlen: 64 | number of skipped iterations: 0 | number of nan iterations: 0 |
57169
+ time (ms)
57170
+ iteration 55/ 292968 | consumed samples: 112640 | consumed tokens: 7208960 | elapsed time per iteration (ms): 96086.5 | learning rate: 3.124E-05 | global batch size: 2048 | lm loss: 8.283342E+00 | loss scale: 4096.0 | grad norm: 32869.639 | num zeros: 0.0 | curriculum seqlen: 64 | number of skipped iterations: 0 | number of nan iterations: 0 |
57171
+ time (ms)
57172
+ Killing subprocess 4027110
57173
+ slurmstepd: error: *** STEP 1655850.0 ON r6i4n5 CANCELLED AT 2021-10-22T19:05:02 ***
57174
+ Killing subprocess 4027111
57175
+ Killing subprocess 4027112
57176
+ Killing subprocess 1360613
57177
+ Killing subprocess 4027114
57178
+ Main process received SIGTERM, exiting
57179
+ Killing subprocess 512071
57180
+ Killing subprocess 1360614
57181
+ Killing subprocess 392892
57182
+ Killing subprocess 1360615
57183
+ Killing subprocess 2123183
57184
+ Killing subprocess 1360617
57185
+ Killing subprocess 392893
57186
+ Killing subprocess 512072
57187
+ Killing subprocess 512073
57188
+ Killing subprocess 2123184
57189
+ Killing subprocess 512074
57190
+ Killing subprocess 392894
57191
+ Main process received SIGTERM, exiting
57192
+ Killing subprocess 1339268
57193
+ Main process received SIGTERM, exiting
57194
+ Killing subprocess 4161749
57195
+ Killing subprocess 392895
57196
+ Killing subprocess 4117263
57197
+ Main process received SIGTERM, exiting
57198
+ Killing subprocess 1339269
57199
+ Killing subprocess 642447
57200
+ Killing subprocess 4161750
57201
+ Killing subprocess 4161751
57202
+ Killing subprocess 1311333
57203
+ Killing subprocess 1795309
57204
+ Killing subprocess 1084264
57205
+ Killing subprocess 2354914
57206
+ Killing subprocess 4161753
57207
+ Killing subprocess 1311334
57208
+ Killing subprocess 2123185
57209
+ Killing subprocess 1036398
57210
+ Killing subprocess 1795310
57211
+ Killing subprocess 105206
57212
+ Killing subprocess 2123186
57213
+ Killing subprocess 1084265
57214
+ Main process received SIGTERM, exiting
57215
+ Killing subprocess 1795311
57216
+ Killing subprocess 2354347
57217
+ Killing subprocess 2354915
57218
+ Killing subprocess 642448
57219
+ Killing subprocess 533215
57220
+ Killing subprocess 1084266
57221
+ Killing subprocess 1339270
57222
+ Killing subprocess 105207
57223
+ Killing subprocess 4117264
57224
+ Killing subprocess 23037
57225
+ Killing subprocess 19863
57226
+ Killing subprocess 1311335
57227
+ Killing subprocess 3570489
57228
+ Killing subprocess 837305
57229
+ Killing subprocess 743334
57230
+ Killing subprocess 1339271
57231
+ Killing subprocess 4117265
57232
+ Killing subprocess 2354348
57233
+ Killing subprocess 1036399
57234
+ Killing subprocess 3570490
57235
+ Killing subprocess 1484262
57236
+ Killing subprocess 533216
57237
+ Killing subprocess 743335
57238
+ Killing subprocess 1084268
57239
+ Main process received SIGTERM, exiting
57240
+ Killing subprocess 4117266
57241
+ Killing subprocess 23038
57242
+ Killing subprocess 19864
57243
+ Killing subprocess 2354349
57244
+ Killing subprocess 1036400
57245
+ Killing subprocess 2354916
57246
+ Killing subprocess 1795313
57247
+ Killing subprocess 837306
57248
+ Killing subprocess 105208
57249
+ Main process received SIGTERM, exiting
57250
+ Killing subprocess 23039
57251
+ Killing subprocess 19865
57252
+ Killing subprocess 2354350
57253
+ Killing subprocess 2354918
57254
+ Killing subprocess 3570491
57255
+ Killing subprocess 2376378
57256
+ Main process received SIGTERM, exiting
57257
+ Killing subprocess 1484263
57258
+ Killing subprocess 837307
57259
+ Killing subprocess 1380399
57260
+ Killing subprocess 642449
57261
+ Killing subprocess 533217
57262
+ Killing subprocess 743336
57263
+ Killing subprocess 105209
57264
+ Killing subprocess 19867
57265
+ Killing subprocess 1311336
57266
+ Killing subprocess 572722
57267
+ Killing subprocess 700598
57268
+ Main process received SIGTERM, exiting
57269
+ Killing subprocess 2376379
57270
+ Killing subprocess 2581538
57271
+ Killing subprocess 837309
57272
+ Killing subprocess 1380400
57273
+ Killing subprocess 642450
57274
+ Killing subprocess 533218
57275
+ Killing subprocess 743337
57276
+ Main process received SIGTERM, exiting
57277
+ Killing subprocess 23040
57278
+ Main process received SIGTERM, exiting
57279
+ Killing subprocess 1036401
57280
+ Main process received SIGTERM, exiting
57281
+ Killing subprocess 572723
57282
+ Killing subprocess 1729138
57283
+ Killing subprocess 2376380
57284
+ Killing subprocess 1484264
57285
+ Main process received SIGTERM, exiting
57286
+ Main process received SIGTERM, exiting
57287
+ Main process received SIGTERM, exiting
57288
+ Main process received SIGTERM, exiting
57289
+ Main process received SIGTERM, exiting
57290
+ Main process received SIGTERM, exiting
57291
+ Killing subprocess 572724
57292
+ Killing subprocess 700599
57293
+ Killing subprocess 1959733
57294
+ Killing subprocess 1618027
57295
+ Killing subprocess 1654112
57296
+ Killing subprocess 2376381
57297
+ Killing subprocess 1484266
57298
+ Killing subprocess 2581539
57299
+ Killing subprocess 1380401
57300
+ Main process received SIGTERM, exiting
57301
+ Killing subprocess 700600
57302
+ Killing subprocess 3570492
57303
+ Main process received SIGTERM, exiting
57304
+ Killing subprocess 1380403
57305
+ Main process received SIGTERM, exiting
57306
+ Killing subprocess 1959734
57307
+ Killing subprocess 1618028
57308
+ Main process received SIGTERM, exiting
57309
+ Main process received SIGTERM, exiting
57310
+ Killing subprocess 1729139
57311
+ Killing subprocess 1654113
57312
+ Killing subprocess 2581540
57313
+ Killing subprocess 572725
57314
+ Killing subprocess 700601
57315
+ Killing subprocess 1618029
57316
+ Killing subprocess 2581542
57317
+ Main process received SIGTERM, exiting
57318
+ Main process received SIGTERM, exiting
57319
+ Killing subprocess 1959735
57320
+ Main process received SIGTERM, exiting
57321
+ Killing subprocess 1618030
57322
+ Killing subprocess 1654114
57323
+ Killing subprocess 1729140
57324
+ Killing subprocess 1729141
57325
+ Killing subprocess 1959737
57326
+ Main process received SIGTERM, exiting
57327
+ Killing subprocess 1654116
57328
+ Main process received SIGTERM, exiting
57329
+ Main process received SIGTERM, exiting
57330
+ Main process received SIGTERM, exiting
57331
+ Main process received SIGTERM, exiting
57332
+ Main process received SIGTERM, exiting
57333
+ srun: Job step aborted: Waiting up to 62 seconds for job step to finish.