text-generation-inference/Mixtral-8x7B-Instruct-v0.1-medusa

I was able to benchmark this on a 4x40GB A100 and saw no latency improvement. Medusa got 95 tokens per second and the default model got 100 tokens per second. This is the launch config I used. Not sure what I'm missing:

volume=$PWD/data
docker run \
    -p 8080:80 \
    -e CUDA_VISIBLE_DEVICES=0,1,2,3 \
    --gpus all \
    --shm-size 5g \
    -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0.1 \
    --model-id text-generation-inference/Mixtral-8x7B-Instruct-v0.1-medusa \
    --max-input-tokens 4096 \
    --max-total-tokens 8192 \
    --num-shard 4

Based on docs here:
In order to use medusa models in TGI, simply point to a medusa enabled model, and everything will load automatically.
but I also tried it with --speculate 3 as well and got the same performance

text-generation-inference
/

Mixtral-8x7B-Instruct-v0.1-medusa

Throughput Measurement