Benchmarking Assisted Generation with Gemma 3 and Qwen 2.5: A Code-First Guide
Community Article
Published
March 12, 2025
In this blog post we will explore the performance of assisted generation using the newly released Gemma 3 (27B) and Qwen 2.5 (0.5B) models. Assisted generation leverages a smaller model to boost the throughput of a larger model—pretty cool, right? Let’s dive into the code and results.
What is Assisted Generation?
Assisted generation uses a smaller, faster model to guide a larger model during token generation, improving efficiency without sacrificing quality. Curious? Check out Hugging Face’s detailed explanation.
The Setup
We’ll benchmark generation throughput with and without assistance using PyTorch and Hugging Face Transformers. Here’s the code:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1"
import torch
from torch.utils import benchmark
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.models.gemma3 import Gemma3ForCausalLM
def load_models():
# Load Gemma 3 (27B)
large_ckpt = "google/gemma-3-27b-it"
large_model = Gemma3ForCausalLM.from_pretrained(large_ckpt, torch_dtype=torch.bfloat16).to("cuda")
large_tokenizer = AutoTokenizer.from_pretrained(large_ckpt)
# Load Qwen 2.5 (0.5B)
small_ckpt = "Qwen/Qwen2.5-0.5B-Instruct"
small_model = AutoModelForCausalLM.from_pretrained(small_ckpt, torch_dtype=torch.bfloat16).to("cuda")
small_tokenizer = AutoTokenizer.from_pretrained(small_ckpt)
return large_tokenizer, small_tokenizer, small_model, large_model
def generate_large(large_model, model_inputs):
large_model.generate(**model_inputs, do_sample=False, max_new_tokens=256, eos_token_id=-1)
def generate_assisted(large_model, small_model, tokenizer, assistant_tokenizer, model_inputs):
large_model.generate(
**model_inputs, do_sample=False, max_new_tokens=256, eos_token_id=-1,
assistant_model=small_model, tokenizer=tokenizer, assistant_tokenizer=assistant_tokenizer
)
if __name__ == "__main__":
large_tokenizer, small_tokenizer, small_model, large_model = load_models()
# Disable caching for fair comparison
small_model.generation_config.cache_implementation = None
large_model.generation_config.cache_implementation = None
# Input prompt
messages = [{"role": "user", "content": [{"type": "text", "text": "Write me a long essay on Deep Learning"}]}]
model_inputs = large_tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt").to("cuda")
# Benchmarking
results = []
label = "Generation"
results.append(benchmark.Timer(
stmt="generate_large(large_model, model_inputs)",
setup="from __main__ import generate_large",
globals={"large_model": large_model, "model_inputs": model_inputs},
num_threads=torch.get_num_threads(),
label=label, sub_label="without assistant", description="generation"
).blocked_autorange())
results.append(benchmark.Timer(
stmt="generate_assisted(large_model, small_model, tokenizer, assistant_tokenizer, model_inputs)",
setup="from __main__ import generate_assisted",
globals={"large_model": large_model, "small_model": small_model, "tokenizer": large_tokenizer,
"assistant_tokenizer": small_tokenizer, "model_inputs": model_inputs},
num_threads=torch.get_num_threads(),
label=label, sub_label="with assistant", description="generation"
).blocked_autorange())
benchmark.Compare(results).print()
Results
Running this on a CUDA-enabled GPU with 64 threads yielded:
[------------ Generation ------------]
| generation
64 threads: --------------------------
without assistant | 23.9
with assistant | 20.5
Times are in seconds (s).
The assisted setup (20.5s) outperforms the standalone Gemma 3 (23.9s) by ~14%. Not bad for pairing it with tiny Qwen 2.5!