200K Version?
Have you considered training this on the Yi 200K base instead of the 4K model?
Seems like this would be much better for storytelling, especially since your dataset naturally contains very long passages.
And FYI long context training is quite doable on a single A100, or even a 48GB GPU, especially if you use UnSloth to train. It doesn't have to be near 200K to get good long context performance.
Well, to be honest, oom is the main enemy stopping me from doing qlora on the 200k version. I just took a look at UnSloth, and I feel that this is indeed a good project that I need.So, I know what I want to do next.
Again, you can just change the context size in the config and train on whatever context size you can fit, then set it back, and the model will still largely retain 200K performance.