Draft model as accelerator for DeepSeek-R1?

#174
by inputout - opened

Is there a compatible draft model for use with llama.cpp for speculative decoding as an accelerator for Deepseek R1?
I have tested some but llama.cpp does not accept them. Is a Draft model even possible in principle with Deepseek R1?

Meh, perhaps it was just my own negative experience, but when I tested speculative decoding on a different set of models, the quality of the output was actually worse for me. The model generated very low quality words for the given context, words it would never generate by itself. It felt like chatting with a very small and dumb model and yet the main model was a 14B model. With that small model loaded in memory, I was wasting more memory for lower quality and the increase of speed that's supposed to happen with it? Nope, didn't happen either. In fact, due to having more stuff loaded in the memory, the system was actually slower overall. I'd rather them focus on new better models, pushing boundaries of what's possible with smaller models. 😉

No the quality cannot be degraded by speculative decoding, the results of the outputs always correspond to those that the main model alone would have produced. If the draft model predicts incorrectly then the main model corrects it. It only affects the speed and memory consumption. Speculative decoding is particularly worthwhile for large models.

No the quality cannot be degraded by speculative decoding, the results of the outputs always correspond to those that the main model alone would have produced. If the draft model predicts incorrectly then the main model corrects it. It only affects the speed and memory consumption. Speculative decoding is particularly worthwhile for large models.

So, you're basically negating everything I wrote there, huh? Look, I described my own experience, something I saw for myself in action. The quality of the output was objectively much lower with draft model loaded alongside the big model and the supposed speed increase was questionable. If I didn't speak from my own experience, maybe then such negating would leave me doubtful, but that's not the case here obviously. Perhaps someone else had better luck with it, but my experience with speculative decoding was very bad. The big model loaded alone (without the smaller draft model) performed much better for me.

Hey all is well, what i am writing is not a personal opinion and not based on experience or conjecture, it is a scientific fact ;-) I didn't make this up, the information can be found in the relevant publications:

"Speculative Decoding is a widely used technique to speed up inference for Large Language Models (LLMs) without sacrificing quality. When performing inference, speculative decoding uses a smaller draft model to generate speculative tokens and then uses the target LLM to verify those draft tokens."
https://arxiv.org/html/2402.01528v3

"will speculative decoding in LLMs harm the accuracy of the original model? The simple answer is no."
https://blogs.novita.ai/will-speculative-decoding-harm-llm-inference-accuracy/ (with mathematical proof)

In your case, a 14B model main model is already very small, I suspect that the technology only tends to be worthwhile from a certain size 32B/70B/etc. upwards.
In my experience it pays off enormously so it would be great if there was a draft model for R1. But I am not sure if a Draft model even possible in principle with Deepseek R1 hence the question here.

Sign up or log in to comment