r1 fine tunning
what if fine tunning is done on r1 used for this model ?
what if fine tunning is done on r1 used for this model ?
I think you would have to be more specific. Using the same datasets on the DeepSeek-R1 to Qwen2.5 32B R1 distillation? Or some kind of offline logit distillation?
I'm not sure training on the 32b R1 would be worth it -- it'd likely catastrophically forget too hard to do the fancy CoT, only benefit would be a different flavor of prose. Could be interesting but I'm not sure it's worth spending
@Kearm @Fizzarolli on the 671b model. probably either v3 or r1.
Would love to get the money for that one lol
@Fizzarolli would love to do it from 17th feb to 30 feb. if it takes that much time to tune . probs use it to distill smaller 104b model like command r + maybe.
You do know to get a machine good enough it would be like $23/hr right. Like if you have that kind of money please collaborate I would love to actually do something with the big deepseek models
@Fizzarolli i will be having a dual amd (96 cores ) with somewhat 2TB of ram and 8 x mi300x
Sorry about the previous comment, uh... that's surely not enough to train Deepseek on, even with 1536GB of VRAM. I'm not sure that can even be done at anything higher than 2k context, and distilling is expensive, especially for a 104b model like Command R Plus 0824. 671b on its own at FP8 takes 655GB of VRAM, at FP16 that's over 1.4TB, no way that's going to fit with training on 8x MI300X
@ProdeusUnity i can arrange a 4x mi300 a and a 8xh100 system too.
these a1 labs did it i think
Not to rain on your parade but deepseek-type output, and distillation training, ain't going to benefit a RP model. It's likely a waste of time and money (but I'd love to be proven wrong, so waste away).
@SerialKicked R1 is actually quite good for RP, just really finnicky to use because the UIs aren't properly set up for it. If you're willing to be patient for long rerolls and not being able to reasonably use continues, it can work wonders
That being said..... yeah I'm not convinced this is a great idea either. But with that said, @shivamb250 If you're willing to prove the concept and show it to us, we might be more open to considering the offer in the future. But right now, the math just isn't mathing.
@inflatebot I tried, not the fine-tune, trying out the RP i mean. I'm working on a front-end for mixed use (standard and RP), I'm currently dealing with R1 type models to integrate the "thinking" wall-o-text into the UI in a non disruptive way. In my experience, the R1 distill models i tried (Qwen 32B, 14B and some abliterated thing) tend to give up quickly on the "thinking" / chain-of-thought part the moment the system prompt gets a bit complex (hallmark of RP). That's the first problem, inconsistency. It can work decently for short creative tasks (write me a short story type), but I can't really tell if the output is any better than if it wasn't babbling to itself for hundreds of tokens, tbh. (and gen time / reroll ratio probably ain't in favor of R1 comparatively)
Second problem, R1 distill dataset are overwhelmingly technical/science/logic stuff which doesn't really apply to this model. You're not using Ink to do math (i think :p). Where it'd be actually interesting is if someone / some group were to make a R1-type dataset but specifically RP actions. Something like:
user: A ice dragon is attacking you!
bot: [thinking] so I need to kill this beast. Beast is ice, so maybe weak to fire. Let's check inventory. Let's check my spellbook. [end_think] I cast fireball!
Then, yeah, I'd absolutely love to see that, but until then. Doubts.
@inflatebot I tried, not the fine-tune, trying out the RP i mean. I'm working on a front-end for mixed use (standard and RP), I'm currently dealing with R1 type models to integrate the "thinking" wall-o-text into the UI in a non disruptive way. In my experience, the R1 distill models i tried (Qwen 32B, 14B and some abliterated thing) tend to give up quickly on the "thinking" / chain-of-thought part the moment the system prompt gets a bit complex (hallmark of RP). That's the first problem, inconsistency. It can work decently for short creative tasks (write me a short story type), but I can't really tell if the output is any better than if it wasn't babbling to itself for hundreds of tokens, tbh.
Second problem, R1 distill dataset are overwhelmingly technical/science/logic stuff which doesn't really apply to this model. You're not using Ink to do math (i think :p). Where it'd be actually interesting is if someone / some group were to make a R1-type dataset but specifically RP actions. Something like:
user: A ice dragon is attacking you!
bot: [thinking] so I need to kill this beast. Beast is ice, so maybe weak to fire. Let's check inventory. Let's check my spellbook. [end_think] I cast fireball!Then, yeah, I'd absolutely love to see that, but until then. Doubts.
Mayhaps thou is working on something.
WHOOPS didn't mean to close this, clicked the wrong button, sorry
In my experience, the R1 distill models i tried (Qwen 32B, 14B and some abliterated thing) tend to give up quickly on the "thinking" / chain-of-thought part the moment the system prompt gets a bit complex (hallmark of RP).
Yeah, existing distills are kind of a loss in my experience too. But really that's more because they're... bad distills. Using synthetic data isn't "distillation" and I'm annoyed with this redefinition that's been happening as of late. Actual distillation is a different process entirely, a much more effective and expensive one.
It can work decently for short creative tasks (write me a short story type), but I can't really tell if the output is any better than if it wasn't babbling to itself for hundreds of tokens, tbh. (and gen time / reroll ratio probably ain't in favor of R1 comparatively)
Also due to frontend shenanigans I had a hard time getting a 1:1 test of this. If I could ask it "hey don't do any reasoning here at all" for a gen so I could get an apples-to-apples comparison, it would really help my opinion of the whole R1 line become more concrete.
Second problem, R1 distill dataset are overwhelmingly technical/science/logic stuff which doesn't really apply to this model. You're not using Ink to do math (i think :p). Where it'd be actually interesting is if someone / some group were to make a R1-type dataset but specifically RP actions.
On the one hand, see above. But on the other hand? Technical knowledge, if the data is of good quality and the ratio is well-balanced, can benefit an RP model greatly. That knowledge generalizes shockingly far, to the point where many of the best RP models are tunes and merges which integrate a significant amount of technical knowledge (an observation which influenced my own Mag Mell which has also done really well. (God I sound like a one-hit wonder at this point lmao))
But that's the key phrase: "if the data is of good quality and well-balanced." I don't think so-called "offline distillation" fine-tuning sets are in this category. We'd need to see a proper logit distillation to really see what Whale Brain can do for smaller models.
I would also be totally down for some RP-flavored reasoning datasets. That could only help matters, I think.
To be fair, both Mag Mel and Nitral's captain series have been some of the most amazing RP models I have ever tested. All of them contain a good amount of science, medical, CS data. I do believe it helps the model to apply some logic in situations where it needs to think a bit before spewing out tokens.
I am rather curious if the llama R1 "Distil" and Qwen "Distil" can be used as a means to improve the intelligence of the model and make some smart decisions in terms of consistency.
Yeah, existing distills are kind of a loss in my experience too. But really that's more because they're... bad distills. Using synthetic data isn't "distillation" and I'm annoyed with this redefinition that's been happening as of late. Actual distillation is a different process entirely, a much more effective and expensive one.
Don't get me started on the redefinition of terms to fit the latest hype narrative π I perfectly get that those "distill"-models are just mimicking the process. No reinforced learning distillation happened. It's just CoT prompting (with the prompting being implied). "As of late" is a nice way to put it, it's been rampant since ChatGPT, really. It's just getting worse. Now they're even dumbing down the definition of AGI just so they can "achieve it" faster, lol.
Also due to frontend shenanigans I had a hard time getting a 1:1 test of this. If I could ask it "hey don't do any reasoning here at all" for a gen so I could get an apples-to-apples comparison, it would really help my opinion of the whole R1 line become more concrete.
Generally, changing the first line in the system prompt to remove the "think step by step" bit can (sometimes) do that, but it's not really 1:1 anymore, and not consistent. A grammar approach preventing the output of the first [think] token might work? I haven't tried that one yet, been busy on other fronts. And yeah, no current solution make any of that any easier, yet (hopefully)
On the one hand, see above. But on the other hand? Technical knowledge, if the data is of good quality and the ratio is well-balanced, can benefit an RP model greatly. That knowledge generalizes shockingly far, to the point where many of the best RP models are tunes and merges which integrate a significant amount of technical knowledge (an observation which influenced my own Mag Mell which has also done really well. (God I sound like a one-hit wonder at this point lmao))
Oh I'm not trying to denigrate "knowledge" data for RP models, I understand that finetuning is inherently destructive, and that you absolutely need to balance it out to make sure that the base model doesn't forget quite relevant knowledge. I'm not a fine-tuner and you know better than me in that domain, I just used a lot of models both professionally and for fun. I agree that models who can still score decently in knowledge evals (on top of IFEval) are a lot less prone to logically inconsistent responses. Heck, old "un-modded" Mistral 22B could be absolutely wild, well, for 2 whole messages, before it starts copy pasting the exact same sentence structure and phrasing forever (but that's a mistral thing, doesn't detract from the point).
I vaguely remember I had a point starting this post, but second beer in, and I'm rambling. Oh right! Something along the line of a Roleplay-type R1 type dataset makes at least make more sense than a full data that's just math equations. Also, RP and knowledge ain't necessarily exclusive, can do both in the same dataset, as long as it's "in context". Of course, I pity the sanity of whoever wants to dedicate themselves to such a gargantuan task :D
Cheers!
We can swirl that around a little bit, see if anything comes of it. Thanks for your input! @SerialKicked
Yeah, existing distills are kind of a loss in my experience too. But really that's more because they're... bad distills. Using synthetic data isn't "distillation" and I'm annoyed with this redefinition that's been happening as of late. Actual distillation is a different process entirely, a much more effective and expensive one.
Don't get me started on the redefinition of terms to fit the latest hype narrative π I perfectly get that those "distill"-models are just mimicking the process. No reinforced learning distillation happened. It's just CoT prompting (with the prompting being implied). "As of late" is a nice way to put it, it's been rampant since ChatGPT, really. It's just getting worse. Now they're even dumbing down the definition of AGI just so they can "achieve it" faster, lol.
Also due to frontend shenanigans I had a hard time getting a 1:1 test of this. If I could ask it "hey don't do any reasoning here at all" for a gen so I could get an apples-to-apples comparison, it would really help my opinion of the whole R1 line become more concrete.
Generally, changing the first line in the system prompt to remove the "think step by step" bit can (sometimes) do that, but it's not really 1:1 anymore, and not consistent. A grammar approach preventing the output of the first [think] token might work? I haven't tried that one yet, been busy on other fronts. And yeah, no current solution make any of that any easier, yet (hopefully)
On the one hand, see above. But on the other hand? Technical knowledge, if the data is of good quality and the ratio is well-balanced, can benefit an RP model greatly. That knowledge generalizes shockingly far, to the point where many of the best RP models are tunes and merges which integrate a significant amount of technical knowledge (an observation which influenced my own Mag Mell which has also done really well. (God I sound like a one-hit wonder at this point lmao))
Oh I'm not trying to denigrate "knowledge" data for RP models, I understand that finetuning is inherently destructive, and that you absolutely need to balance it out to make sure that the base model doesn't forget quite relevant knowledge. I'm not a fine-tuner and you know better than me in that domain, I just used a lot of models both professionally and for fun. I agree that models who can still score decently in knowledge evals (on top of IFEval) are a lot less prone to logically inconsistent responses. Heck, old "un-modded" Mistral 22B could be absolutely wild, well, for 2 whole messages, before it starts copy pasting the exact same sentence structure and phrasing forever (but that's a mistral thing, doesn't detract from the point).
I vaguely remember I had a point starting this post, but second beer in, and I'm rambling. Oh right! Something along the line of a Roleplay-type R1 type dataset makes at least make more sense than a full data that's just math equations. Also, RP and knowledge ain't necessarily exclusive, can do both in the same dataset, as long as it's "in context". Of course, I pity the sanity of whoever wants to dedicate themselves to such a gargantuan task :D
Cheers!
Good intuition! I AM insane!