seungrok81 commited on
Commit
dfb27fd
·
verified ·
1 Parent(s): 0cb4189

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +97 -0
README.md ADDED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: llama3.1
3
+ ---
4
+
5
+ ## Introduction
6
+ This is vllm-compatible fp8 ptq model based on [Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct).
7
+ For detailed quantization scheme, refer to the official documentation of [AMD Quark 0.2.0 quantizer](https://quark.docs.amd.com/latest/index.html).
8
+
9
+ ## Quickstart
10
+
11
+ To run this fp8 model on vLLM framework,
12
+
13
+ ### Modle Preparation
14
+ 1. build the rocm-vllm docker image by using this [dockerfile](https://github.com/ROCm/vllm/blob/main/Dockerfile.rocm) and launch a vllm docker container.
15
+
16
+ ```sh
17
+ docker build -f Dockerfile_amd -t vllm_test .
18
+ docker run --rm -it --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 16G vllm_test:latest
19
+ ```
20
+
21
+ 2. clone the baseline [Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct).
22
+ 3. clone this [fp8 model](https://huggingface.co/amd/Meta-Llama-3.1-405B-Instruct-fp8-quark-vllm) and inside the [fp8 model](https://huggingface.co/amd/Meta-Llama-3.1-405B-Instruct-fp8-quark-vllm) folder run this to merge the splitted llama-*.safetensors into a single llama.safetensors.
23
+
24
+ ```sh
25
+ python merge.py
26
+ ```
27
+
28
+ 4. once the merged llama.safetensors is created, move this file and llama.json to the saved directory of [Meta-Llama-3.1-405B-Instruct] by this command. Model snapshot commit# 069992c75aed59df00ec06c17177e76c63296a26 can be different.
29
+ ```sh
30
+ cp llama.json ~/models--meta-llama--Meta-Llama-3.1-405B-Instruct/snapshots/069992c75aed59df00ec06c17177e76c63296a26/.
31
+ cp llama.safetensors ~/models--meta-llama--Meta-Llama-3.1-405B-Instruct/snapshots/069992c75aed59df00ec06c17177e76c63296a26/.
32
+ ```
33
+
34
+ ### Running fp8 model
35
+
36
+ ```sh
37
+ # 8 GPUs
38
+ torchrun --standalone --nproc_per_node=8 run_vllm_fp8.py
39
+ ```
40
+
41
+ ```python
42
+ # run_vllm_fp8.py
43
+ from vllm import LLM, SamplingParams
44
+ prompt = "Write me an essay about bear and knight"
45
+
46
+ model_name="/workspace/models--meta-llama--Meta-Llama-3.1-405B-Instruct/snapshots/069992c75aed59df00ec06c17177e76c63296a26/"
47
+ tp=8 # 8 GPUs
48
+
49
+ model = LLM(model=model_name, tensor_parallel_size=tp, max_model_len=8192, trust_remote_code=True, dtype="float16", quantization="fp8", quantized_weights_path="/llama.safetensors")
50
+ sampling_params = SamplingParams(
51
+ top_k=1.0,
52
+ ignore_eos=True,
53
+ max_tokens=200,
54
+ )
55
+ result = model.generate(prompt, sampling_params=sampling_params)
56
+ print(result)
57
+ ```
58
+ ### Running fp16 model (For comparison)
59
+
60
+ ```sh
61
+ # 8 GPUs
62
+ torchrun --standalone --nproc_per_node=8 run_vllm_fp8.py
63
+ ```
64
+
65
+ ```python
66
+ # run_vllm_fp16.py
67
+ from vllm import LLM, SamplingParams
68
+ prompt = "Write me an essay about bear and knight"
69
+
70
+ model_name="/workspace/models--meta-llama--Meta-Llama-3.1-405B-Instruct/snapshots/069992c75aed59df00ec06c17177e76c63296a26/"
71
+ tp=8 # 8 GPUs
72
+ model = LLM(model=model_name, tensor_parallel_size=tp, max_model_len=8192, trust_remote_code=True, dtype="bfloat16")
73
+ sampling_params = SamplingParams(
74
+ top_k=1.0,
75
+ ignore_eos=True,
76
+ max_tokens=200,
77
+ )
78
+ result = model.generate(prompt, sampling_params=sampling_params)
79
+ print(result)
80
+ ```
81
+ ## fp8 gemm_tuning
82
+
83
+
84
+ #### License
85
+ Copyright (c) 2018-2024 Advanced Micro Devices, Inc. All Rights Reserved.
86
+
87
+ Licensed under the Apache License, Version 2.0 (the "License");
88
+ you may not use this file except in compliance with the License.
89
+ You may obtain a copy of the License at
90
+
91
+ http://www.apache.org/licenses/LICENSE-2.0
92
+
93
+ Unless required by applicable law or agreed to in writing, software
94
+ distributed under the License is distributed on an "AS IS" BASIS,
95
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
96
+ See the License for the specific language governing permissions and
97
+ limitations under the License.