bigscience-bot commited on
Commit
8457336
·
1 Parent(s): 44b1547
Files changed (1) hide show
  1. logs/main_log.txt +68 -0
logs/main_log.txt CHANGED
@@ -86889,3 +86889,71 @@ time (ms)
86889
  time (ms)
86890
  iteration 1256/ 292968 | consumed samples: 2572288 | consumed tokens: 243695616 | elapsed time per iteration (ms): 91479.6 | learning rate: 6.859E-05 | global batch size: 2048 | lm loss: 4.337495E+00 | loss scale: 16384.0 | grad norm: 9382.482 | num zeros: 0.0 | curriculum seqlen: 128 | number of skipped iterations: 0 | number of nan iterations: 0 |
86891
  time (ms)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86889
  time (ms)
86890
  iteration 1256/ 292968 | consumed samples: 2572288 | consumed tokens: 243695616 | elapsed time per iteration (ms): 91479.6 | learning rate: 6.859E-05 | global batch size: 2048 | lm loss: 4.337495E+00 | loss scale: 16384.0 | grad norm: 9382.482 | num zeros: 0.0 | curriculum seqlen: 128 | number of skipped iterations: 0 | number of nan iterations: 0 |
86891
  time (ms)
86892
+ iteration 1257/ 292968 | consumed samples: 2574336 | consumed tokens: 243957760 | elapsed time per iteration (ms): 89077.2 | learning rate: 6.865E-05 | global batch size: 2048 | lm loss: 4.360833E+00 | loss scale: 16384.0 | grad norm: 10931.909 | num zeros: 0.0 | curriculum seqlen: 128 | number of skipped iterations: 0 | number of nan iterations: 0 |
86893
+ time (ms)
86894
+ iteration 1258/ 292968 | consumed samples: 2576384 | consumed tokens: 244219904 | elapsed time per iteration (ms): 89543.6 | learning rate: 6.870E-05 | global batch size: 2048 | lm loss: 4.355038E+00 | loss scale: 16384.0 | grad norm: 12315.148 | num zeros: 0.0 | curriculum seqlen: 128 | number of skipped iterations: 0 | number of nan iterations: 0 |
86895
+ time (ms)
86896
+ iteration 1259/ 292968 | consumed samples: 2578432 | consumed tokens: 244482048 | elapsed time per iteration (ms): 86626.2 | learning rate: 6.876E-05 | global batch size: 2048 | lm loss: 4.332624E+00 | loss scale: 16384.0 | grad norm: 9028.785 | num zeros: 0.0 | curriculum seqlen: 128 | number of skipped iterations: 0 | number of nan iterations: 0 |
86897
+ time (ms)
86898
+ iteration 1260/ 292968 | consumed samples: 2580480 | consumed tokens: 244744192 | elapsed time per iteration (ms): 88403.0 | learning rate: 6.881E-05 | global batch size: 2048 | lm loss: 4.353878E+00 | loss scale: 16384.0 | grad norm: 8587.953 | num zeros: 0.0 | curriculum seqlen: 128 | number of skipped iterations: 0 | number of nan iterations: 0 |
86899
+ time (ms)
86900
+ iteration 1261/ 292968 | consumed samples: 2582528 | consumed tokens: 245006336 | elapsed time per iteration (ms): 90653.6 | learning rate: 6.887E-05 | global batch size: 2048 | lm loss: 4.406543E+00 | loss scale: 16384.0 | grad norm: 8519.735 | num zeros: 0.0 | curriculum seqlen: 128 | number of skipped iterations: 0 | number of nan iterations: 0 |
86901
+ time (ms)
86902
+ iteration 1262/ 292968 | consumed samples: 2584576 | consumed tokens: 245268480 | elapsed time per iteration (ms): 101721.7 | learning rate: 6.892E-05 | global batch size: 2048 | lm loss: 4.337947E+00 | loss scale: 16384.0 | grad norm: 10856.149 | num zeros: 0.0 | curriculum seqlen: 128 | number of skipped iterations: 0 | number of nan iterations: 0 |
86903
+ time (ms)
86904
+ iteration 1263/ 292968 | consumed samples: 2586624 | consumed tokens: 245530624 | elapsed time per iteration (ms): 98966.3 | learning rate: 6.898E-05 | global batch size: 2048 | lm loss: 4.345151E+00 | loss scale: 16384.0 | grad norm: 12642.575 | num zeros: 0.0 | curriculum seqlen: 128 | number of skipped iterations: 0 | number of nan iterations: 0 |
86905
+ time (ms)
86906
+ iteration 1264/ 292968 | consumed samples: 2588672 | consumed tokens: 245792768 | elapsed time per iteration (ms): 104276.2 | learning rate: 6.903E-05 | global batch size: 2048 | lm loss: 4.373935E+00 | loss scale: 16384.0 | grad norm: 13739.412 | num zeros: 0.0 | curriculum seqlen: 128 | number of skipped iterations: 0 | number of nan iterations: 0 |
86907
+ time (ms)
86908
+ iteration 1265/ 292968 | consumed samples: 2590720 | consumed tokens: 246054912 | elapsed time per iteration (ms): 106458.8 | learning rate: 6.909E-05 | global batch size: 2048 | lm loss: 4.336057E+00 | loss scale: 16384.0 | grad norm: 13718.934 | num zeros: 0.0 | curriculum seqlen: 128 | number of skipped iterations: 0 | number of nan iterations: 0 |
86909
+ time (ms)
86910
+ iteration 1266/ 292968 | consumed samples: 2592768 | consumed tokens: 246317056 | elapsed time per iteration (ms): 109558.3 | learning rate: 6.914E-05 | global batch size: 2048 | lm loss: 4.348790E+00 | loss scale: 16384.0 | grad norm: 15140.293 | num zeros: 0.0 | curriculum seqlen: 128 | number of skipped iterations: 0 | number of nan iterations: 0 |
86911
+ time (ms)
86912
+ iteration 1267/ 292968 | consumed samples: 2594816 | consumed tokens: 246579200 | elapsed time per iteration (ms): 101169.1 | learning rate: 6.920E-05 | global batch size: 2048 | lm loss: 4.336976E+00 | loss scale: 16384.0 | grad norm: 18580.935 | num zeros: 0.0 | curriculum seqlen: 128 | number of skipped iterations: 0 | number of nan iterations: 0 |
86913
+ time (ms)
86914
+ iteration 1268/ 292968 | consumed samples: 2596864 | consumed tokens: 246841344 | elapsed time per iteration (ms): 103186.3 | learning rate: 6.925E-05 | global batch size: 2048 | lm loss: 4.351308E+00 | loss scale: 16384.0 | grad norm: 9034.022 | num zeros: 0.0 | curriculum seqlen: 128 | number of skipped iterations: 0 | number of nan iterations: 0 |
86915
+ time (ms)
86916
+ iteration 1269/ 292968 | consumed samples: 2598912 | consumed tokens: 247103488 | elapsed time per iteration (ms): 103322.1 | learning rate: 6.930E-05 | global batch size: 2048 | lm loss: 4.338009E+00 | loss scale: 16384.0 | grad norm: 10030.218 | num zeros: 0.0 | curriculum seqlen: 128 | number of skipped iterations: 0 | number of nan iterations: 0 |
86917
+ time (ms)
86918
+ iteration 1270/ 292968 | consumed samples: 2600960 | consumed tokens: 247365632 | elapsed time per iteration (ms): 104430.5 | learning rate: 6.936E-05 | global batch size: 2048 | lm loss: 4.323060E+00 | loss scale: 16384.0 | grad norm: 10375.946 | num zeros: 0.0 | curriculum seqlen: 128 | number of skipped iterations: 0 | number of nan iterations: 0 |
86919
+ time (ms)
86920
+ iteration 1271/ 292968 | consumed samples: 2603008 | consumed tokens: 247627776 | elapsed time per iteration (ms): 101797.9 | learning rate: 6.941E-05 | global batch size: 2048 | lm loss: 4.337749E+00 | loss scale: 16384.0 | grad norm: 8465.022 | num zeros: 0.0 | curriculum seqlen: 128 | number of skipped iterations: 0 | number of nan iterations: 0 |
86921
+ time (ms)
86922
+ iteration 1272/ 292968 | consumed samples: 2605056 | consumed tokens: 247889920 | elapsed time per iteration (ms): 105815.4 | learning rate: 6.947E-05 | global batch size: 2048 | lm loss: 4.322408E+00 | loss scale: 16384.0 | grad norm: 8592.805 | num zeros: 0.0 | curriculum seqlen: 128 | number of skipped iterations: 0 | number of nan iterations: 0 |
86923
+ time (ms)
86924
+ iteration 1273/ 292968 | consumed samples: 2607104 | consumed tokens: 248152064 | elapsed time per iteration (ms): 108179.9 | learning rate: 6.952E-05 | global batch size: 2048 | lm loss: 4.321740E+00 | loss scale: 16384.0 | grad norm: 10722.339 | num zeros: 0.0 | curriculum seqlen: 128 | number of skipped iterations: 0 | number of nan iterations: 0 |
86925
+ time (ms)
86926
+ iteration 1274/ 292968 | consumed samples: 2609152 | consumed tokens: 248414208 | elapsed time per iteration (ms): 110063.2 | learning rate: 6.958E-05 | global batch size: 2048 | lm loss: 4.321163E+00 | loss scale: 16384.0 | grad norm: 12199.826 | num zeros: 0.0 | curriculum seqlen: 128 | number of skipped iterations: 0 | number of nan iterations: 0 |
86927
+ time (ms)
86928
+ iteration 1275/ 292968 | consumed samples: 2611200 | consumed tokens: 248676352 | elapsed time per iteration (ms): 112486.2 | learning rate: 6.963E-05 | global batch size: 2048 | lm loss: 4.359476E+00 | loss scale: 16384.0 | grad norm: 13015.753 | num zeros: 0.0 | curriculum seqlen: 128 | number of skipped iterations: 0 | number of nan iterations: 0 |
86929
+ time (ms)
86930
+ iteration 1276/ 292968 | consumed samples: 2613248 | consumed tokens: 248938496 | elapsed time per iteration (ms): 119132.6 | learning rate: 6.969E-05 | global batch size: 2048 | lm loss: 4.368865E+00 | loss scale: 16384.0 | grad norm: 12810.900 | num zeros: 0.0 | curriculum seqlen: 128 | number of skipped iterations: 0 | number of nan iterations: 0 |
86931
+ time (ms)
86932
+ iteration 1277/ 292968 | consumed samples: 2615296 | consumed tokens: 249200640 | elapsed time per iteration (ms): 124483.3 | learning rate: 6.974E-05 | global batch size: 2048 | lm loss: 4.319435E+00 | loss scale: 16384.0 | grad norm: 11086.670 | num zeros: 0.0 | curriculum seqlen: 128 | number of skipped iterations: 0 | number of nan iterations: 0 |
86933
+ time (ms)
86934
+ iteration 1278/ 292968 | consumed samples: 2617344 | consumed tokens: 249462784 | elapsed time per iteration (ms): 131501.7 | learning rate: 6.980E-05 | global batch size: 2048 | lm loss: 4.343135E+00 | loss scale: 16384.0 | grad norm: 10249.176 | num zeros: 0.0 | curriculum seqlen: 128 | number of skipped iterations: 0 | number of nan iterations: 0 |
86935
+ time (ms)
86936
+ iteration 1279/ 292968 | consumed samples: 2619392 | consumed tokens: 249724928 | elapsed time per iteration (ms): 122263.3 | learning rate: 6.985E-05 | global batch size: 2048 | lm loss: 4.333991E+00 | loss scale: 16384.0 | grad norm: 8418.978 | num zeros: 0.0 | curriculum seqlen: 128 | number of skipped iterations: 0 | number of nan iterations: 0 |
86937
+ time (ms)
86938
+ iteration 1280/ 292968 | consumed samples: 2621440 | consumed tokens: 249987072 | elapsed time per iteration (ms): 125027.7 | learning rate: 6.991E-05 | global batch size: 2048 | lm loss: 4.344658E+00 | loss scale: 16384.0 | grad norm: 9345.066 | num zeros: 0.0 | curriculum seqlen: 128 | number of skipped iterations: 0 | number of nan iterations: 0 |
86939
+ time (ms)
86940
+ iteration 1281/ 292968 | consumed samples: 2623488 | consumed tokens: 250249216 | elapsed time per iteration (ms): 119818.3 | learning rate: 6.996E-05 | global batch size: 2048 | lm loss: 4.340658E+00 | loss scale: 16384.0 | grad norm: 11343.930 | num zeros: 0.0 | curriculum seqlen: 128 | number of skipped iterations: 0 | number of nan iterations: 0 |
86941
+ time (ms)
86942
+ iteration 1282/ 292968 | consumed samples: 2625536 | consumed tokens: 250511360 | elapsed time per iteration (ms): 107960.9 | learning rate: 7.001E-05 | global batch size: 2048 | lm loss: 4.367644E+00 | loss scale: 16384.0 | grad norm: 11059.651 | num zeros: 0.0 | curriculum seqlen: 128 | number of skipped iterations: 0 | number of nan iterations: 0 |
86943
+ time (ms)
86944
+ iteration 1283/ 292968 | consumed samples: 2627584 | consumed tokens: 250773504 | elapsed time per iteration (ms): 103476.2 | learning rate: 7.007E-05 | global batch size: 2048 | lm loss: 4.343670E+00 | loss scale: 16384.0 | grad norm: 9443.485 | num zeros: 0.0 | curriculum seqlen: 128 | number of skipped iterations: 0 | number of nan iterations: 0 |
86945
+ time (ms)
86946
+ iteration 1284/ 292968 | consumed samples: 2629632 | consumed tokens: 251035648 | elapsed time per iteration (ms): 113204.7 | learning rate: 7.012E-05 | global batch size: 2048 | lm loss: 4.341036E+00 | loss scale: 16384.0 | grad norm: 10326.934 | num zeros: 0.0 | curriculum seqlen: 128 | number of skipped iterations: 0 | number of nan iterations: 0 |
86947
+ time (ms)
86948
+ iteration 1285/ 292968 | consumed samples: 2631680 | consumed tokens: 251297792 | elapsed time per iteration (ms): 101453.0 | learning rate: 7.018E-05 | global batch size: 2048 | lm loss: 4.335133E+00 | loss scale: 16384.0 | grad norm: 13935.373 | num zeros: 0.0 | curriculum seqlen: 128 | number of skipped iterations: 0 | number of nan iterations: 0 |
86949
+ time (ms)
86950
+ iteration 1286/ 292968 | consumed samples: 2633728 | consumed tokens: 251559936 | elapsed time per iteration (ms): 101126.4 | learning rate: 7.023E-05 | global batch size: 2048 | lm loss: 4.328067E+00 | loss scale: 16384.0 | grad norm: 13261.563 | num zeros: 0.0 | curriculum seqlen: 128 | number of skipped iterations: 0 | number of nan iterations: 0 |
86951
+ time (ms)
86952
+ iteration 1287/ 292968 | consumed samples: 2635776 | consumed tokens: 251822080 | elapsed time per iteration (ms): 101433.7 | learning rate: 7.029E-05 | global batch size: 2048 | lm loss: 4.332537E+00 | loss scale: 16384.0 | grad norm: 10151.353 | num zeros: 0.0 | curriculum seqlen: 128 | number of skipped iterations: 0 | number of nan iterations: 0 |
86953
+ time (ms)
86954
+ iteration 1288/ 292968 | consumed samples: 2637824 | consumed tokens: 252084224 | elapsed time per iteration (ms): 97179.0 | learning rate: 7.034E-05 | global batch size: 2048 | lm loss: 4.328178E+00 | loss scale: 16384.0 | grad norm: 12186.076 | num zeros: 0.0 | curriculum seqlen: 128 | number of skipped iterations: 0 | number of nan iterations: 0 |
86955
+ time (ms)
86956
+ iteration 1289/ 292968 | consumed samples: 2639872 | consumed tokens: 252346368 | elapsed time per iteration (ms): 97410.4 | learning rate: 7.040E-05 | global batch size: 2048 | lm loss: 4.303625E+00 | loss scale: 16384.0 | grad norm: 15999.316 | num zeros: 0.0 | curriculum seqlen: 128 | number of skipped iterations: 0 | number of nan iterations: 0 |
86957
+ time (ms)
86958
+ iteration 1290/ 292968 | consumed samples: 2641920 | consumed tokens: 252608512 | elapsed time per iteration (ms): 97712.4 | learning rate: 7.045E-05 | global batch size: 2048 | lm loss: 4.325552E+00 | loss scale: 16384.0 | grad norm: 17938.209 | num zeros: 0.0 | curriculum seqlen: 128 | number of skipped iterations: 0 | number of nan iterations: 0 |
86959
+ time (ms)