No, this is promising

#1
by CultriX - opened

@sometimesanotion No, this is promising :)

image (1).webp

Now you're talking! Now that you know how LoRAs can help merges of text generation models, I invite you to consider using this LoRA on a base model:

https://huggingface.co/sometimesanotion/LoRA-256-Base-Qwenvergence

Because the LoRA captures the difference between a base and target model, you want the base to differ in ways that your other models, especially the consensus merge models in anything TIES, will work off of. I know you can add up the rest. You actually got me started thinking about custom base models to begin with.

My recipe here is complete. https://huggingface.co/sometimesanotion/Base-Chocolatine-2-14B-Instruct-v2.0b3

Thank you I'll take a look!

Now you're talking! Now that you know how LoRAs can help merges of text generation models, I invite you to consider using this LoRA on a base model:

https://huggingface.co/sometimesanotion/LoRA-256-Base-Qwenvergence

Because the LoRA captures the difference between a base and target model, you want the base to differ in ways that your other models, especially the consensus merge models in anything TIES, will work off of. I know you can add up the rest. You actually got me started thinking about custom base models to begin with.

My recipe here is complete. https://huggingface.co/sometimesanotion/Base-Chocolatine-2-14B-Instruct-v2.0b3

Ok mate I need help haha it basically beats everything across the board but loses huge on IFEval... If I remember correctly you found a way to minimize the IFEval loss right? :)

newplot (1).png

Yes, here's the deal! Think of this as being like parameters frozen in fine-tuning. By injecting the same LoRA to members of a merge, there's a tiny fraction of the model with no differences, meaning in that most merge styles that part remains the same while all else merges.

I guess then the question also becomes: How do I know which part is responsible for that horrible IFEval loss so I can target it directly and not cause damage by freezing the parts that made it top out on the others. One would almost start to think that we were moving around at the frontiers of a new and rapidly evolving science that is not yet fully understood and changing on a day-to-day basis with all this nonsense! :)

The way that I use LoRAs to avoid IFEVAL loss in model_stock merges involves LoRAs of varying higher ranks, usually between 32-128, using higher ones where the model's contribution is moderated. The idea is that the majority of the stock's models all have overlays of some small sliver of the same model. Is it the part that boosts IFEVAL? I have not dug for all the details, but I noticed the large difference in compute used for extracting a LoRA, versus doing a DELLA merge. They determine tensor rankings using the same algorithm, but the results show a difference between a LoRA of 128 and a della model entry at density of 0.075 and weight 1.0. There's some analysis going into it which you can bring into the other merge methods.

Sign up or log in to comment