amd
/

Meta-Llama-3.1-8B-Instruct-fp8-quark-vllm

Model card Files Files and versions Community

Meta-Llama-3.1-8B-Instruct-fp8-quark-vllm / README.md

seungrok81

Update README.md

5ded130 verified 12 months ago

preview code

raw

history blame contribute delete

3.72 kB

	---
	license: llama3.1
	---

	## Introduction
	This is vllm-compatible fp8 ptq model based on [Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct).
	For detailed quantization scheme, refer to the official documentation of [AMD Quark 0.2.0 quantizer](https://quark.docs.amd.com/latest/index.html).

	## Quickstart

	To run this fp8 model on vLLM framework,

	### Modle Preparation
	1. build the rocm-vllm docker image by using this [dockerfile](https://github.com/ROCm/vllm/blob/main/Dockerfile.rocm) and launch a vllm docker container.

	```sh
	docker build -f Dockerfile.rocm -t vllm_test .
	docker run --rm -it --device=/dev/kfd --device=/dev/dri --group-add video --ipc=host --shm-size 16G vllm_test:latest
	```

	2. clone the baseline [Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct).
	3. clone this [fp8 model](https://huggingface.co/amd/Meta-Llama-3.1-8B-Instruct-fp8-quark-vllm).
	4. move llama.safetensors and llama.json from [fp8 model](https://huggingface.co/amd/Meta-Llama-3.1-8B-Instruct-fp8-quark-vllm) to the saved directory of [Meta-Llama-3.1-8B-Instruct] by this command. Model snapshot commit# 8c22764a7e3675c50d4c7c9a4edb474456022b16 can be different.
	```sh
	cp llama.json ~/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/8c22764a7e3675c50d4c7c9a4edb474456022b16/.
	cp llama.safetensors ~/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/8c22764a7e3675c50d4c7c9a4edb474456022b16/.
	```

	### Running fp8 model

	```sh
	# single GPU
	python run_vllm_fp8.py

	# 8 GPUs
	torchrun --standalone --nproc_per_node=8 run_vllm_fp8.py
	```

	```python
	# run_vllm_fp8.py
	from vllm import LLM, SamplingParams
	prompt = "Write me an essay about bear and knight"

	model_name="models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/8c22764a7e3675c50d4c7c9a4edb474456022b16/"
	tp=1 # single GPU
	tp=8 # 8 GPUs

	model = LLM(model=model_name, tensor_parallel_size=tp, max_model_len=8192, trust_remote_code=True, dtype="float16", quantization="fp8", quantized_weights_path="/llama.safetensors")
	sampling_params = SamplingParams(
	top_k=1.0,
	ignore_eos=True,
	max_tokens=200,
	)
	result = model.generate(prompt, sampling_params=sampling_params)
	print(result)
	```
	### Running fp16 model (For comparison)

	```sh
	# single GPU
	python run_vllm_fp16.py

	# 8 GPUs
	torchrun --standalone --nproc_per_node=8 run_vllm_fp8.py
	```

	```python
	# run_vllm_fp16.py
	from vllm import LLM, SamplingParams
	prompt = "Write me an essay about bear and knight"

	model_name="models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/8c22764a7e3675c50d4c7c9a4edb474456022b16/"
	tp=1 # single GPU
	tp=8 # 8 GPUs
	model = LLM(model=model_name, tensor_parallel_size=tp, max_model_len=8192, trust_remote_code=True, dtype="bfloat16")
	sampling_params = SamplingParams(
	top_k=1.0,
	ignore_eos=True,
	max_tokens=200,
	)
	result = model.generate(prompt, sampling_params=sampling_params)
	print(result)
	```
	## fp8 gemm_tuning
	Will update soon.

	#### License
	Copyright (c) 2018-2024 Advanced Micro Devices, Inc. All Rights Reserved.

	Licensed under the Apache License, Version 2.0 (the "License");
	you may not use this file except in compliance with the License.
	You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.