amd
/

Meta-Llama-3.1-405B-Instruct-fp8-quark-vllm

Model card Files Files and versions Community

seungrok81 commited on Aug 14, 2024

Commit

dfb27fd

verified ·

1 Parent(s): 0cb4189

Create README.md

Browse files

Files changed (1) hide show

README.md +97 -0

README.md ADDED Viewed

	@@ -0,0 +1,97 @@

+---
+license: llama3.1
+---
+## Introduction
+This is vllm-compatible fp8 ptq model based on [Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct).
+For detailed quantization scheme, refer to the official documentation of [AMD Quark 0.2.0 quantizer](https://quark.docs.amd.com/latest/index.html).
+## Quickstart
+To run this fp8 model on vLLM framework,
+### Modle Preparation
+1. build the rocm-vllm docker image by using this [dockerfile](https://github.com/ROCm/vllm/blob/main/Dockerfile.rocm) and launch a vllm docker container.
+```sh
+docker build -f Dockerfile_amd -t vllm_test .
+docker run --rm -it --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 16G vllm_test:latest
+```
+2. clone the baseline [Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct).
+3. clone this [fp8 model](https://huggingface.co/amd/Meta-Llama-3.1-405B-Instruct-fp8-quark-vllm) and inside the [fp8 model](https://huggingface.co/amd/Meta-Llama-3.1-405B-Instruct-fp8-quark-vllm) folder run this to merge the splitted llama-*.safetensors into a single llama.safetensors.
+```sh
+python merge.py
+```
+4. once the merged llama.safetensors is created, move this file and llama.json to the saved directory of [Meta-Llama-3.1-405B-Instruct] by this command. Model snapshot commit# 069992c75aed59df00ec06c17177e76c63296a26 can be different.
+```sh
+cp llama.json ~/models--meta-llama--Meta-Llama-3.1-405B-Instruct/snapshots/069992c75aed59df00ec06c17177e76c63296a26/.
+cp llama.safetensors ~/models--meta-llama--Meta-Llama-3.1-405B-Instruct/snapshots/069992c75aed59df00ec06c17177e76c63296a26/.
+```
+### Running fp8 model
+```sh
+# 8 GPUs
+torchrun --standalone --nproc_per_node=8 run_vllm_fp8.py
+```
+```python
+# run_vllm_fp8.py
+from vllm import LLM, SamplingParams
+prompt = "Write me an essay about bear and knight"
+model_name="/workspace/models--meta-llama--Meta-Llama-3.1-405B-Instruct/snapshots/069992c75aed59df00ec06c17177e76c63296a26/"
+tp=8 # 8 GPUs
+model = LLM(model=model_name, tensor_parallel_size=tp, max_model_len=8192, trust_remote_code=True, dtype="float16", quantization="fp8", quantized_weights_path="/llama.safetensors")
+sampling_params = SamplingParams(
+                  top_k=1.0,
+                  ignore_eos=True,
+                  max_tokens=200,
+                  )
+result = model.generate(prompt, sampling_params=sampling_params)
+print(result)
+```
+### Running fp16 model (For comparison)
+```sh
+# 8 GPUs
+torchrun --standalone --nproc_per_node=8 run_vllm_fp8.py
+```
+```python
+# run_vllm_fp16.py
+from vllm import LLM, SamplingParams
+prompt = "Write me an essay about bear and knight"
+model_name="/workspace/models--meta-llama--Meta-Llama-3.1-405B-Instruct/snapshots/069992c75aed59df00ec06c17177e76c63296a26/"
+tp=8 # 8 GPUs
+model = LLM(model=model_name, tensor_parallel_size=tp, max_model_len=8192, trust_remote_code=True, dtype="bfloat16")
+sampling_params = SamplingParams(
+                  top_k=1.0,
+                  ignore_eos=True,
+                  max_tokens=200,
+                  )
+result = model.generate(prompt, sampling_params=sampling_params)
+print(result)
+```
+## fp8 gemm_tuning
+#### License
+Copyright (c) 2018-2024 Advanced Micro Devices, Inc. All Rights Reserved.
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+    http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.