|
--- |
|
license: llama3.1 |
|
--- |
|
|
|
## Introduction |
|
This is vllm-compatible fp8 ptq model based on [Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct). |
|
For detailed quantization scheme, refer to the official documentation of [AMD Quark 0.2.0 quantizer](https://quark.docs.amd.com/latest/index.html). |
|
|
|
## Quickstart |
|
|
|
To run this fp8 model on vLLM framework, |
|
|
|
### Modle Preparation |
|
1. build the rocm-vllm docker image by using this [dockerfile](https://github.com/ROCm/vllm/blob/main/Dockerfile.rocm) and launch a vllm docker container. |
|
|
|
```sh |
|
docker build -f Dockerfile.rocm -t vllm_test . |
|
docker run --rm -it --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 16G vllm_test:latest |
|
``` |
|
|
|
2. clone the baseline [Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct). |
|
3. clone this [fp8 model](https://huggingface.co/amd/Meta-Llama-3.1-8B-Instruct-fp8-quark-vllm). |
|
4. move llama.safetensors and llama.json from [fp8 model](https://huggingface.co/amd/Meta-Llama-3.1-8B-Instruct-fp8-quark-vllm) to the saved directory of [Meta-Llama-3.1-8B-Instruct] by this command. Model snapshot commit# 8c22764a7e3675c50d4c7c9a4edb474456022b16 can be different. |
|
```sh |
|
cp llama.json ~/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/8c22764a7e3675c50d4c7c9a4edb474456022b16/. |
|
cp llama.safetensors ~/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/8c22764a7e3675c50d4c7c9a4edb474456022b16/. |
|
``` |
|
|
|
### Running fp8 model |
|
|
|
```sh |
|
# single GPU |
|
python run_vllm_fp8.py |
|
|
|
# 8 GPUs |
|
torchrun --standalone --nproc_per_node=8 run_vllm_fp8.py |
|
``` |
|
|
|
```python |
|
# run_vllm_fp8.py |
|
from vllm import LLM, SamplingParams |
|
prompt = "Write me an essay about bear and knight" |
|
|
|
model_name="models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/8c22764a7e3675c50d4c7c9a4edb474456022b16/" |
|
tp=1 # single GPU |
|
tp=8 # 8 GPUs |
|
|
|
model = LLM(model=model_name, tensor_parallel_size=tp, max_model_len=8192, trust_remote_code=True, dtype="float16", quantization="fp8", quantized_weights_path="/llama.safetensors") |
|
sampling_params = SamplingParams( |
|
top_k=1.0, |
|
ignore_eos=True, |
|
max_tokens=200, |
|
) |
|
result = model.generate(prompt, sampling_params=sampling_params) |
|
print(result) |
|
``` |
|
### Running fp16 model (For comparison) |
|
|
|
```sh |
|
# single GPU |
|
python run_vllm_fp16.py |
|
|
|
# 8 GPUs |
|
torchrun --standalone --nproc_per_node=8 run_vllm_fp8.py |
|
``` |
|
|
|
```python |
|
# run_vllm_fp16.py |
|
from vllm import LLM, SamplingParams |
|
prompt = "Write me an essay about bear and knight" |
|
|
|
model_name="models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/8c22764a7e3675c50d4c7c9a4edb474456022b16/" |
|
tp=1 # single GPU |
|
tp=8 # 8 GPUs |
|
model = LLM(model=model_name, tensor_parallel_size=tp, max_model_len=8192, trust_remote_code=True, dtype="bfloat16") |
|
sampling_params = SamplingParams( |
|
top_k=1.0, |
|
ignore_eos=True, |
|
max_tokens=200, |
|
) |
|
result = model.generate(prompt, sampling_params=sampling_params) |
|
print(result) |
|
``` |
|
## fp8 gemm_tuning |
|
Will update soon. |
|
|
|
#### License |
|
Copyright (c) 2018-2024 Advanced Micro Devices, Inc. All Rights Reserved. |
|
|
|
Licensed under the Apache License, Version 2.0 (the "License"); |
|
you may not use this file except in compliance with the License. |
|
You may obtain a copy of the License at |
|
|
|
http://www.apache.org/licenses/LICENSE-2.0 |
|
|
|
Unless required by applicable law or agreed to in writing, software |
|
distributed under the License is distributed on an "AS IS" BASIS, |
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
|
See the License for the specific language governing permissions and |
|
limitations under the License. |