The original Llama 3.3 70B Instruct model quantized using AutoAWQ. Follow the instruction here.
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = 'meta-llama/Llama-3.3-70B-Instruct'
quant_path = 'Llama-3.3-70B-Instruct-AWQ-4bit'
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }
# Load model
model = AutoAWQForCausalLM.from_pretrained(
model_path, **{"low_cpu_mem_usage": True, "use_cache": False}
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Quantize
model.quantize(tokenizer, quant_config=quant_config)
# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
vLLM serve
vllm serve lambdalabs/Llama-3.3-70B-Instruct-AWQ-4bit \
--swap-space 16 \
--disable-log-requests \
--tokenizer meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 2
Benchmark
python benchmark_serving.py \
--backend vllm \
--model lambdalabs/Llama-3.3-70B-Instruct-AWQ-4bit \
--tokenizer meta-llama/Meta-Llama-3-70B \
--dataset-name sharegpt \
--dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json \
--num-prompts 1000
============ Serving Benchmark Result ============
Successful requests: 902
Benchmark duration (s): 128.07
Total input tokens: 177877
Total generated tokens: 182359
Request throughput (req/s): 7.04
Output token throughput (tok/s): 1423.85
Total Token throughput (tok/s): 2812.71
---------------Time to First Token----------------
Mean TTFT (ms): 47225.59
Median TTFT (ms): 43313.95
P99 TTFT (ms): 105587.66
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 141.01
Median TPOT (ms): 148.94
P99 TPOT (ms): 174.16
---------------Inter-token Latency----------------
Mean ITL (ms): 131.55
Median ITL (ms): 150.82
P99 ITL (ms): 344.50
==================================================
- Downloads last month
- 376