Throughput Measurement
#4
by
jacktenyx
- opened
I was able to benchmark this on a 4x40GB A100 and saw no latency improvement. Medusa got 95 tokens per second and the default model got 100 tokens per second. This is the launch config I used. Not sure what I'm missing:
volume=$PWD/data
docker run \
-p 8080:80 \
-e CUDA_VISIBLE_DEVICES=0,1,2,3 \
--gpus all \
--shm-size 5g \
-v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0.1 \
--model-id text-generation-inference/Mixtral-8x7B-Instruct-v0.1-medusa \
--max-input-tokens 4096 \
--max-total-tokens 8192 \
--num-shard 4
Based on docs here:In order to use medusa models in TGI, simply point to a medusa enabled model, and everything will load automatically.
but I also tried it with --speculate 3
as well and got the same performance