RAVE v3 training
Hey, first of all thanks for all those amazing models, they sound fantastic!
I myself had trouble training with the model v3, since it always produced silence for me, whatever I did.
No matter what input signal or arguments I used.
Is there some special trick that isn't in the documentation to get v3 working? (something with the snake activation maybe?)
thanks!
Hi, are you also using the victor-shepardson/RAVE fork, are you doing transfer learning, and if so, from which base model?
I have noticed the latent space collapse in certain cases, but I can't recall ever getting pure silence. It's strange that this would happen with different datasets.
I might have an idea if you can provide more information about what you've tried.
Hey, thanks for the quick reply!
No im not using the fork, I will check that out first, thanks!
Hey me again, wanted to write an issue directly on the github repo, but issues are disabled sadly. I asumme you are the maintainer of this fork?
The additions sound really great. Training to 1.000.000 steps went smooth as butter, but when it hit phase 2 of training, it suddenly became very very slow.
Each epoch says it takes 42 hours now, which wasn't the case with the original RAVE version. Is there something I need to change? Also the tensorboard doesn't show any updates in 12 hours, it seems stuck at 1.000.000(although the training continues slowly but surely)
Is it suddenly switching to CPU on phase 2 or is this normal behaviour?
Thanks in advance for any clues!
Hm, not sure what would cause this. Phase 2 is normally slow but not that slow. What GPU do you have? Which OS? Can you post the command you ran to train? Screenshot from nvidia-smi?
Again thanks for the quick reply! :)
I use a NVIDIA RTX 2070
The command was :
rave train --config v3 --config noise --config causal --db_path ./output --out_path ./model --name violin --channels 1 --gpu 0 --augment gain --augment mute
(I use WSL on Windows and Linux had problems detecting my gpu so I had to manually set it with the argument)
In phase 1 i had around 5-10 it/s, now in phase 2 it says:
28/315 [4:04:55<41:50:33, 524.82s/it, v_num=0 , so over 500seconds per iteration now.
I just noticed that VRAM on my RTX2070 is almost full, maybe that is the bottleneck?
Thinking about it, that would make sense, since my lousy 8gb of VRAM aren't enough for v3?
Here is nvidia-smi:
Mon Nov 18 11:39:51 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 566.14 Driver Version: 566.14 CUDA Version: 12.7 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 2070 WDDM | 00000000:04:00.0 On | N/A |
| 69% 61C P2 87W / 175W | 7924MiB / 8192MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 1924 C+G ...crosoft\Edge\Application\msedge.exe N/A |
| 0 N/A N/A 3616 C+G ...__8wekyb3d8bbwe\WindowsTerminal.exe N/A |
| 0 N/A N/A 7448 C+G ...aam7r\AcrobatNotificationClient.exe N/A |
| 0 N/A N/A 8440 C+G C:\Windows\explorer.exe N/A |
| 0 N/A N/A 12044 C+G ...5n1h2txyewy\ShellExperienceHost.exe N/A |
| 0 N/A N/A 12488 C+G ....Search_cw5n1h2txyewy\SearchApp.exe N/A |
| 0 N/A N/A 13712 C+G ...m Files\Mozilla Firefox\firefox.exe N/A |
| 0 N/A N/A 14184 C+G ...t.LockApp_cw5n1h2txyewy\LockApp.exe N/A |
| 0 N/A N/A 14840 C+G ...CBS_cw5n1h2txyewy\TextInputHost.exe N/A |
| 0 N/A N/A 15284 C+G ...1.0_x64__8wekyb3d8bbwe\Video.UI.exe N/A |
| 0 N/A N/A 17016 C+G ...(x86)\CorsairLink4\CorsairLink4.exe N/A |
| 0 N/A N/A 17036 C+G ...on\HEX\Creative Cloud UI Helper.exe N/A |
| 0 N/A N/A 19052 C+G ...ejd91yc\AdobeNotificationClient.exe N/A |
| 0 N/A N/A 20032 C+G ...b3d8bbwe\Microsoft.Media.Player.exe N/A |
| 0 N/A N/A 21088 C+G ...m Files\Mozilla Firefox\firefox.exe N/A |
| 0 N/A N/A 21864 C+G ...siveControlPanel\SystemSettings.exe N/A |
| 0 N/A N/A 23096 C+G ...cal\Microsoft\OneDrive\OneDrive.exe N/A |
+-----------------------------------------------------------------------------------------+
Thanks again for taking the time! appreciated!
hm, I'm surprised that v3 runs at all on 8GB with default settings. makes me think it's swapping with main memory somehow. I've never encountered that, but maybe it is a windows thing?
you could try reducing --n_signal
to something lower than 126976 (make sure it is a multiple of 2048 though) or reducing batch_size
to something less than 8
you say it doesn't slow down in the main repo though? is that with the same configs? if so i must have done something it increase memory use. afraid I don't have time to hunt that down now though.
Hey, yeah that makes sense. Most reasonable thing is to upgrade the computer I guess...
About the original repo problem: Sorry false alarm! I used that on a different machine (with more VRAM), which I didn't consider beeing the problem during my first report (but which probably was the difference, not the software)
Your repo ran better in P1 than the original, if anything!
One last thing while we are at it:
Did you change anything about the prior training in your fork? I read a lot that it is broken and not really working (Didn't get it to work myself on a first quick try with the original repo)
Again, many thanks!
I haven't been working with prior models, last time i tried it didn't work well if i recall.
Ok so the same experience as everyone ;)
Thanks again for your time and patience! Helped me alot