Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,97 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: llama3.1
|
3 |
+
---
|
4 |
+
|
5 |
+
## Introduction
|
6 |
+
This is vllm-compatible fp8 ptq model based on [Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct).
|
7 |
+
For detailed quantization scheme, refer to the official documentation of [AMD Quark 0.2.0 quantizer](https://quark.docs.amd.com/latest/index.html).
|
8 |
+
|
9 |
+
## Quickstart
|
10 |
+
|
11 |
+
To run this fp8 model on vLLM framework,
|
12 |
+
|
13 |
+
### Modle Preparation
|
14 |
+
1. build the rocm-vllm docker image by using this [dockerfile](https://github.com/ROCm/vllm/blob/main/Dockerfile.rocm) and launch a vllm docker container.
|
15 |
+
|
16 |
+
```sh
|
17 |
+
docker build -f Dockerfile_amd -t vllm_test .
|
18 |
+
docker run --rm -it --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 16G vllm_test:latest
|
19 |
+
```
|
20 |
+
|
21 |
+
2. clone the baseline [Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct).
|
22 |
+
3. clone this [fp8 model](https://huggingface.co/amd/Meta-Llama-3.1-405B-Instruct-fp8-quark-vllm) and inside the [fp8 model](https://huggingface.co/amd/Meta-Llama-3.1-405B-Instruct-fp8-quark-vllm) folder run this to merge the splitted llama-*.safetensors into a single llama.safetensors.
|
23 |
+
|
24 |
+
```sh
|
25 |
+
python merge.py
|
26 |
+
```
|
27 |
+
|
28 |
+
4. once the merged llama.safetensors is created, move this file and llama.json to the saved directory of [Meta-Llama-3.1-405B-Instruct] by this command. Model snapshot commit# 069992c75aed59df00ec06c17177e76c63296a26 can be different.
|
29 |
+
```sh
|
30 |
+
cp llama.json ~/models--meta-llama--Meta-Llama-3.1-405B-Instruct/snapshots/069992c75aed59df00ec06c17177e76c63296a26/.
|
31 |
+
cp llama.safetensors ~/models--meta-llama--Meta-Llama-3.1-405B-Instruct/snapshots/069992c75aed59df00ec06c17177e76c63296a26/.
|
32 |
+
```
|
33 |
+
|
34 |
+
### Running fp8 model
|
35 |
+
|
36 |
+
```sh
|
37 |
+
# 8 GPUs
|
38 |
+
torchrun --standalone --nproc_per_node=8 run_vllm_fp8.py
|
39 |
+
```
|
40 |
+
|
41 |
+
```python
|
42 |
+
# run_vllm_fp8.py
|
43 |
+
from vllm import LLM, SamplingParams
|
44 |
+
prompt = "Write me an essay about bear and knight"
|
45 |
+
|
46 |
+
model_name="/workspace/models--meta-llama--Meta-Llama-3.1-405B-Instruct/snapshots/069992c75aed59df00ec06c17177e76c63296a26/"
|
47 |
+
tp=8 # 8 GPUs
|
48 |
+
|
49 |
+
model = LLM(model=model_name, tensor_parallel_size=tp, max_model_len=8192, trust_remote_code=True, dtype="float16", quantization="fp8", quantized_weights_path="/llama.safetensors")
|
50 |
+
sampling_params = SamplingParams(
|
51 |
+
top_k=1.0,
|
52 |
+
ignore_eos=True,
|
53 |
+
max_tokens=200,
|
54 |
+
)
|
55 |
+
result = model.generate(prompt, sampling_params=sampling_params)
|
56 |
+
print(result)
|
57 |
+
```
|
58 |
+
### Running fp16 model (For comparison)
|
59 |
+
|
60 |
+
```sh
|
61 |
+
# 8 GPUs
|
62 |
+
torchrun --standalone --nproc_per_node=8 run_vllm_fp8.py
|
63 |
+
```
|
64 |
+
|
65 |
+
```python
|
66 |
+
# run_vllm_fp16.py
|
67 |
+
from vllm import LLM, SamplingParams
|
68 |
+
prompt = "Write me an essay about bear and knight"
|
69 |
+
|
70 |
+
model_name="/workspace/models--meta-llama--Meta-Llama-3.1-405B-Instruct/snapshots/069992c75aed59df00ec06c17177e76c63296a26/"
|
71 |
+
tp=8 # 8 GPUs
|
72 |
+
model = LLM(model=model_name, tensor_parallel_size=tp, max_model_len=8192, trust_remote_code=True, dtype="bfloat16")
|
73 |
+
sampling_params = SamplingParams(
|
74 |
+
top_k=1.0,
|
75 |
+
ignore_eos=True,
|
76 |
+
max_tokens=200,
|
77 |
+
)
|
78 |
+
result = model.generate(prompt, sampling_params=sampling_params)
|
79 |
+
print(result)
|
80 |
+
```
|
81 |
+
## fp8 gemm_tuning
|
82 |
+
|
83 |
+
|
84 |
+
#### License
|
85 |
+
Copyright (c) 2018-2024 Advanced Micro Devices, Inc. All Rights Reserved.
|
86 |
+
|
87 |
+
Licensed under the Apache License, Version 2.0 (the "License");
|
88 |
+
you may not use this file except in compliance with the License.
|
89 |
+
You may obtain a copy of the License at
|
90 |
+
|
91 |
+
http://www.apache.org/licenses/LICENSE-2.0
|
92 |
+
|
93 |
+
Unless required by applicable law or agreed to in writing, software
|
94 |
+
distributed under the License is distributed on an "AS IS" BASIS,
|
95 |
+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
96 |
+
See the License for the specific language governing permissions and
|
97 |
+
limitations under the License.
|