File size: 6,804 Bytes
a53796c 84737d8 9bcda36 73cf444 a6f5f78 a1114d3 39c24b6 d259e98 7f8cb54 4f49839 a5c2924 f80d2c7 f6b5364 b0216c6 35f54e0 d1d4eb4 303418c b085b1a fe65b0f 24a33b7 b085b1a 941052d b085b1a 1e11993 17aceae fe65b0f 02b85c8 56044dd 9ab66cf 6c71610 f9eb4ea b20be4e 24a33b7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
---
tags:
- rocm
- amd-gpus
- amd-ai
- rocm-ai
- rocm-rwkv
- 3B-rwkv
---
3B rocm-rwkv pth record. This 3B is a little different than the usual 3B. This 3B model have 48 Layers, embd of 2048 and Ctxt of 16384 (I think that all pth have the same ctxt size).
- rwkv-final-chnk5.pth: 3B rocm-rwkv model trained with Slim pajama chunk1-5 and with a loss of 2.456.
- rwkv-final-chnk17.pth: 3B rocm-rwkv model trained with Slim pajama chunk1-10 for the first epoch and an aditional training with chunk1-7 after the first epoch and with a loss of 2.281
- rwkv-code39-16012024.pth: 3B rocm-rwkv model trained with Slim pajama chunk1-10 for the first epoch and an aditional training with chunk1-8 after the first epoch; plus a little bit of code. This pth has a loss of 1.174 for code alone and 2.26 for text.
- rwkv-HHMIX-63x1-47-29012024.pth: 3B rocm-rwkv model starting with rwkv-code39-16012024.pth plus a mix of multi-language and code. This model has a loss value of 2.065 for the code+multilingual dataset.
- rwkv-coder-63x1-104-29012024.pth: 3B rocm-rwkv model starting with rwkv-HHMIX-63x1-47-29012024.pth plus more code (71.21 Gtokens of code). This model has a loss value of 1.090 for the code dataset.
- rwkv-final_HHMIX_chuk3.pth: 3B rocm-rwkv model starting with rwkv-coder-63x1-104-29012024.pth plus a mix of multi-language and code. This model has a loss value of 1.836 for the code+multilingual dataset.
- rwkv-1epoch_N8_wrong_lr.pth: rwkv-v5-stp2-N8.pth : 3B rocm-rwkv model starting with the previous one (I think maybe I added more code or random multilangual, I don't remember) plus aditional 3 chunks of my mix of multi-language(ramdom) and code + 3 chunks of my dataset soup multilangual(only languages with character different to the english or latin-greek alphabet,e.g. Japanise, Cherokee, etc) + code + math+ instruct+ chain of thought). This model has 1 epoch (step) on the N8 dataset but with --lr_init 5e-7 --lr_final 5e-8. This pth has a loss of 1.978 for N8.
- rwkv-v5-stp2-N8.pth : 3B rocm-rwkv model starting with the previous one + two epochs of N8 dataset with --lr_init 7e-6 --lr_final 7e-6. This pth has a loss of 1.94 for N8.
- rwkv-v5-stp5-N8.pth : 3B rocm-rwkv model starting with the previous but now with 5 epochs of N8 dataset with --lr_init 7e-6 --lr_final 7e-6. This pth has a loss of 1.90 for N8.
- rwkv-v5-stp18-N8.pth : 3B rocm-rwkv model starting with the previous but now with 18 epochs of N8 dataset with --lr_init 7e-6 --lr_final 7e-6. This pth has a loss of 1.827 for N8 and 13.377 GTokens.
- rwkv-v5-stp32-N8.pth : 3B rocm-rwkv model starting with the previous but now with 32 epochs of N8 dataset with --lr_init 7e-6 --lr_final 7e-6. This pth has a loss of 1.810 for N8 and 22.46 GTokens.
- rwkv-v5-stp46-N8.pth : 3B rocm-rwkv model starting with the previous but now with 46 epochs of N8 dataset with --lr_init 7e-6 --lr_final 7e-6. This pth has a loss of 1.800 for N8 and 31.874 GTokens.
- rwkv-v5-stp62-N8.pth : 3B rocm-rwkv model starting with the previous but now with 62 epochs of N8 dataset with --lr_init 7e-6 --lr_final 7e-6. This pth has a loss of 1.790 for N8 and 42.538 GTokens.
- rwkv-v5-stp76-N8.pth : 3B rocm-rwkv model starting with the previous but now with 62 epochs of N8 dataset with --lr_init 7e-6 --lr_final 7e-6. This pth has a loss of 1.780 for N8 and 51.763 GTokens.
- rwkv-v5-stp118-N8.pth : 3B rocm-rwkv model starting with the previous but now with 118 epochs of N8 dataset with --lr_init 7e-6 --lr_final 7e-6. This pth has a loss of 1.750 for N8 and 79.508 GTokens.
- rwkv-v5-stp146-N8.pth : 3B rocm-rwkv model starting with the previous but now with 146 epochs of N8 dataset with --lr_init 7e-6 --lr_final 7e-6. This pth has a loss of 1.758 for N8 and 97.982 GTokens.
- rwkv-v5-final-N8.pth : 3B rocm-rwkv model starting with the previous but now with the full N8 dataset epoch with --lr_init 3e-8 --lr_final 1e-8 This pth has a loss of 1.73 for the full N8 dataset with 106.098327552 GTokens.
- rwkv-3B-stp634-N8-3.pth : 3B rocm-rwkv model starting with the previous but now with the 104 GTokens of the N8-3 dataset with ctxt=4k. This pth has a loss of 1.92 for the N8-3 dataset.
- rwkv-3B-4K-stp802-N8-3.pth: Using rwkv-3B-stp634-N8-3.pth I added 7 more GTokens of N8-3.
7B rocm-rwkv pth record: I called this model Tlanuwa since I added an extra training focusing on cherokee after each run.
- rwkv-7BTlanuwa-1k-soup91-Final.pth: 7B model 32 layers embd=4096 ctx= 16384. This have all the same training as the 3B but only Slim pajama from 1-9 probably more than 2T tokens but a loss of 2.834 with respect to the the full soup91. I am working on getting a lower loss.
9B rocm-rwkv pth record: 40 layers embd=4096 ctx= 16384 I am calling this model Quetzal. I called this model Quetzal since it is a green model that flies and I am adding an extra training focusing on Spanish and the dataset Axolotl-Spanish-Nahuatl after each run.
- rwkv-9Q-stp101-N8.pth: 9B rocm-rwkv model trained with Slim pajama chunk1-10 for the first epoch and an aditional training with chunk1-2 and a mix of multi-language and code after that I am using the N8 dataset. I am currendly with the N8 dataset 4.222 GTokes. This pth has a loss of 1.904 regarding the N8 dataset.
- rwkv-9Q-1k-stp307-1k-N8.pth: 9B rocm-rwkv model trained with Slim pajama chunk1-10 for the first epoch and an aditional training with chunk1-2 and a mix of multi-language and code after that I am using the N8 dataset. I am currendly with the N8 dataset 12.706 GTokes. This pth has a loss of 1.871 regarding the N8 dataset.
- rwkv-9Q-Soup91-step298.pth : Using the rwkv-9Q-1k-stp307-1k-N8.pth I added 298 epoch steps of my soup of data (code + math+ instruct+ chain of thought) 12.283 Gtokens with a loss of 2.242.
- rwkv-9Q-Soup91-Final.pth : Using the rwkv-9Q-Soup91-step298.pth I added 298 -> 1035 epoch steps of my soup of data (code + math+ instruct+ chain of thought) 42.733 Gtokens with a loss of 2.222.
- rwkv-9Q-stp1447-N8.pth : Using rwkv-9Q-Soup91-Final.pth I added 1447 steps of N8 59.733 Gtokens with a loss of 1.827.
- rwkv-9Q-Final-N8-1k.pth : Using rwkv-9Q-stp1447-N8.pth I added 2569 steps of N8 which are 106 Gtokens with a loss of 1.801.
- rwkv-9Q-1k-stp706-N8-0.pth: Using rwkv-9Q-1k-stp706-N8-0.pth I added 706 new steps and 29.13 Gtokens of N8-0 with a loss of 1.78
- rwkv-9Q-4k-stp248.pth: Using rwkv-9Q-1k-stp706-N8-0.pth I added 2048 new steps with 40.66 Gtokens with a loss of 1.717 Nathan-0 datase and Ctx=4096.
- rwkv-9Q-16k-step6-0-4.pth: Using rwkv-9Q-4k-stp248.pth I added N-0 and N-8 and a Ctx=16384 loss=1.65. This model looks that can chat better.
- rwkv-9Q-step607-N8-3.pth: Using rwkv-9Q-16k-step6-0-4.pth I add 100G tokens of N8-3.
- rwkv-9Q-4k-stp662-N8-3.pth: Using rwkv-9Q-step607-N8-3.pth I added 10G tokes more of N8-3.
|