Benchmarking Assisted Generation with Gemma 3 and Qwen 2.5: A Code-First Guide

Community Article Published March 12, 2025

In this blog post we will explore the performance of assisted generation using the newly released Gemma 3 (27B) and Qwen 2.5 (0.5B) models. Assisted generation leverages a smaller model to boost the throughput of a larger model—pretty cool, right? Let’s dive into the code and results.

What is Assisted Generation?

Assisted generation uses a smaller, faster model to guide a larger model during token generation, improving efficiency without sacrificing quality. Curious? Check out Hugging Face’s detailed explanation.

The Setup

We’ll benchmark generation throughput with and without assistance using PyTorch and Hugging Face Transformers. Here’s the code:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1"

import torch
from torch.utils import benchmark
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers.models.gemma3 import Gemma3ForCausalLM

def load_models():
    # Load Gemma 3 (27B)
    large_ckpt = "google/gemma-3-27b-it"
    large_model = Gemma3ForCausalLM.from_pretrained(large_ckpt, torch_dtype=torch.bfloat16).to("cuda")
    large_tokenizer = AutoTokenizer.from_pretrained(large_ckpt)

    # Load Qwen 2.5 (0.5B)
    small_ckpt = "Qwen/Qwen2.5-0.5B-Instruct"
    small_model = AutoModelForCausalLM.from_pretrained(small_ckpt, torch_dtype=torch.bfloat16).to("cuda")
    small_tokenizer = AutoTokenizer.from_pretrained(small_ckpt)

    return large_tokenizer, small_tokenizer, small_model, large_model

def generate_large(large_model, model_inputs):
    large_model.generate(**model_inputs, do_sample=False, max_new_tokens=256, eos_token_id=-1)

def generate_assisted(large_model, small_model, tokenizer, assistant_tokenizer, model_inputs):
    large_model.generate(
        **model_inputs, do_sample=False, max_new_tokens=256, eos_token_id=-1,
        assistant_model=small_model, tokenizer=tokenizer, assistant_tokenizer=assistant_tokenizer
    )

if __name__ == "__main__":
    large_tokenizer, small_tokenizer, small_model, large_model = load_models()
    
    # Disable caching for fair comparison
    small_model.generation_config.cache_implementation = None
    large_model.generation_config.cache_implementation = None

    # Input prompt
    messages = [{"role": "user", "content": [{"type": "text", "text": "Write me a long essay on Deep Learning"}]}]
    model_inputs = large_tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, 
                                                       return_dict=True, return_tensors="pt").to("cuda")

    # Benchmarking
    results = []
    label = "Generation"
    
    results.append(benchmark.Timer(
        stmt="generate_large(large_model, model_inputs)",
        setup="from __main__ import generate_large",
        globals={"large_model": large_model, "model_inputs": model_inputs},
        num_threads=torch.get_num_threads(),
        label=label, sub_label="without assistant", description="generation"
    ).blocked_autorange())

    results.append(benchmark.Timer(
        stmt="generate_assisted(large_model, small_model, tokenizer, assistant_tokenizer, model_inputs)",
        setup="from __main__ import generate_assisted",
        globals={"large_model": large_model, "small_model": small_model, "tokenizer": large_tokenizer, 
                 "assistant_tokenizer": small_tokenizer, "model_inputs": model_inputs},
        num_threads=torch.get_num_threads(),
        label=label, sub_label="with assistant", description="generation"
    ).blocked_autorange())
    
    benchmark.Compare(results).print()

Results

Running this on a CUDA-enabled GPU with 64 threads yielded:

[------------ Generation ------------]
                         |  generation
64 threads: --------------------------
      without assistant  |     23.9   
      with assistant     |     20.5   
Times are in seconds (s).

The assisted setup (20.5s) outperforms the standalone Gemma 3 (23.9s) by ~14%. Not bad for pairing it with tiny Qwen 2.5!

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote