--- tags: - rocm - amd-gpus - amd-ai - rocm-ai - rocm-rwkv - 3B-rwkv --- 3B rocm-rwkv pth record. This 3B is a little different than the usual 3B. This 3B model have 48 Layers, embd of 2048 and Ctxt of 16384 (I think that all pth have the same ctxt size). - rwkv-final-chnk5.pth: 3B rocm-rwkv model trained with Slim pajama chunk1-5 and with a loss of 2.456. - rwkv-final-chnk17.pth: 3B rocm-rwkv model trained with Slim pajama chunk1-10 for the first epoch and an aditional training with chunk1-7 after the first epoch and with a loss of 2.281 - rwkv-code39-16012024.pth: 3B rocm-rwkv model trained with Slim pajama chunk1-10 for the first epoch and an aditional training with chunk1-8 after the first epoch; plus a little bit of code. This pth has a loss of 1.174 for code alone and 2.26 for text. - rwkv-HHMIX-63x1-47-29012024.pth: 3B rocm-rwkv model starting with rwkv-code39-16012024.pth plus a mix of multi-language and code. This model has a loss value of 2.065 for the code+multilingual dataset. - rwkv-coder-63x1-104-29012024.pth: 3B rocm-rwkv model starting with rwkv-HHMIX-63x1-47-29012024.pth plus more code (71.21 Gtokens of code). This model has a loss value of 1.090 for the code dataset. - rwkv-final_HHMIX_chuk3.pth: 3B rocm-rwkv model starting with rwkv-coder-63x1-104-29012024.pth plus a mix of multi-language and code. This model has a loss value of 1.836 for the code+multilingual dataset. - rwkv-1epoch_N8_wrong_lr.pth: rwkv-v5-stp2-N8.pth : 3B rocm-rwkv model starting with the previous one (I think maybe I added more code or random multilangual, I don't remember) plus aditional 3 chunks of my mix of multi-language(ramdom) and code + 3 chunks of my dataset soup multilangual(only languages with character different to the english or latin-greek alphabet,e.g. Japanise, Cherokee, etc) + code + math+ instruct+ chain of thought). This model has 1 epoch (step) on the N8 dataset but with --lr_init 5e-7 --lr_final 5e-8. This pth has a loss of 1.978 for N8. - rwkv-v5-stp2-N8.pth : 3B rocm-rwkv model starting with the previous one + two epochs of N8 dataset with --lr_init 7e-6 --lr_final 7e-6. This pth has a loss of 1.94 for N8. - rwkv-v5-stp5-N8.pth : 3B rocm-rwkv model starting with the previous but now with 5 epochs of N8 dataset with --lr_init 7e-6 --lr_final 7e-6. This pth has a loss of 1.90 for N8. - rwkv-v5-stp18-N8.pth : 3B rocm-rwkv model starting with the previous but now with 18 epochs of N8 dataset with --lr_init 7e-6 --lr_final 7e-6. This pth has a loss of 1.827 for N8 and 13.377 GTokens. - rwkv-v5-stp32-N8.pth : 3B rocm-rwkv model starting with the previous but now with 32 epochs of N8 dataset with --lr_init 7e-6 --lr_final 7e-6. This pth has a loss of 1.810 for N8 and 22.46 GTokens. - rwkv-v5-stp46-N8.pth : 3B rocm-rwkv model starting with the previous but now with 46 epochs of N8 dataset with --lr_init 7e-6 --lr_final 7e-6. This pth has a loss of 1.800 for N8 and 31.874 GTokens. - rwkv-v5-stp62-N8.pth : 3B rocm-rwkv model starting with the previous but now with 62 epochs of N8 dataset with --lr_init 7e-6 --lr_final 7e-6. This pth has a loss of 1.790 for N8 and 42.538 GTokens. - rwkv-v5-stp76-N8.pth : 3B rocm-rwkv model starting with the previous but now with 62 epochs of N8 dataset with --lr_init 7e-6 --lr_final 7e-6. This pth has a loss of 1.780 for N8 and 51.763 GTokens. - rwkv-v5-stp118-N8.pth : 3B rocm-rwkv model starting with the previous but now with 118 epochs of N8 dataset with --lr_init 7e-6 --lr_final 7e-6. This pth has a loss of 1.750 for N8 and 79.508 GTokens. - rwkv-v5-stp146-N8.pth : 3B rocm-rwkv model starting with the previous but now with 146 epochs of N8 dataset with --lr_init 7e-6 --lr_final 7e-6. This pth has a loss of 1.758 for N8 and 97.982 GTokens.