Syed-Hasan-8503
/

PaluLlama-3-8B-Instruct

Text Generation

Inference Endpoints

Model card Files Files and versions Community

PaluLlama-3-8B-Instruct / README.md

Syed-Hasan-8503's picture

Syed-Hasan-8503

Create README.md

71b64b2 verified 5 months ago

|

3.3 kB

	---
	license: apache-2.0
	language:
	- en
	library_name: transformers
	---

	# Compressed Meta Llama-3-8B-Instruct with Palu

	## Overview
	This repository contains a compressed version of the Meta Llama-3-8B-Instruct model, utilizing the Palu framework for KV-Cache compression. Palu reduces the hidden dimensions of the KV-Cache through low-rank decomposition, significantly reducing the model's memory footprint while maintaining or enhancing performance.

	## Key Features
	- Model: Meta Llama-3-8B-Instruct
	- Compression Framework: Palu
	- Compression Rate: Up to 91.25% memory reduction
	- Accuracy: Maintained or improved perplexity compared to the base model

	## Installation

	### Clone the Repository
	Ensure you have Git and Conda installed on your system.
	```bash
	git clone --recurse-submodules https://github.com/shadowpa0327/Palu.git
	cd Palu
	```

	### Set Up the Environment
	Create and activate a Conda environment.
	```bash
	conda create -n Palu python=3.10
	conda activate Palu
	pip install -r requirements.txt
	```

	### Install Third-Party Libraries
	```bash
	pip install -e 3rdparty/lm-evaluation-harness
	pip install -e 3rdparty/fast-hadamard-transform
	```

	## Usage

	### Compress the Model
	To compress Meta Llama-3-8B-Instruct using Palu's low-rank decomposition, use the following command:

	```bash
	python compress.py \
	--model_id="meta-llama/Llama-3-8b-instruct" \
	--calib_dataset wikitext2 \
	--param_ratio_target 0.7 \
	--search_method fisher_uniform \
	--head_group_size 4 \
	--dump_huggingface_model \
	--use_cache
	```

	The compressed model will be saved in the `Meta-Llama-3-8b-instruct_ratio-0.7_gs-4-fisher_uniform` directory in Hugging Face format.

	### Evaluate the Compressed Model

	#### Perplexity
	To evaluate the perplexity on the `wikitext2` dataset with sequence length 2048, run:

	```bash
	python run_ppl_eval.py \
	--model_name_or_path /Path/To/Palu/Model \
	--datasets wikitext2 \
	--seqlen 2048
	```

	To evaluate with 3-bit low-rank aware quantization, use:
	```bash
	python run_ppl_eval.py \
	--model_name_or_path /Path/To/Palu/Model \
	--datasets wikitext2 \
	--seqlen 4096 \
	--lt_bits 3 \
	--lt_hadamard
	```

	#### Zero-shot Evaluation
	For zero-shot evaluations, use the following command:
	```bash
	CUDA_VISIBLE_DEVICES=0 python run_lm_eval.py \
	--model_name_or_path "/Path/To/Palu/Model" \
	--tasks "openbookqa,hellaswag,piqa,arc_easy,arc_challenge,winogrande"
	```

	#### Long-Bench Evaluation
	Evaluate the compressed model on long-bench tasks:
	```bash
	CUDA_VISIBLE_DEVICES=0 python run_long_bench.py \
	--model_name_or_path /Path/To/Palu/Model
	```

	## Latency Evaluation

	### Attention Module
	Evaluate the latency of the Palu-compressed attention module:
	```bash
	CUDA_VISIBLE_DEVICES=0 python run_latency_attention.py \
	--rank_k 1024 --rank_v 3072 --group_size 4 \
	--prompt_len 65536 --palu
	```

	### Reconstruction Kernel
	Evaluate the latency of the reconstruction kernel:
	```bash
	CUDA_VISIBLE_DEVICES=0 python run_latency_kernel.py \
	--total_rank 1024 --group_size 4
	```

	## Conclusion
	This compressed version of Meta Llama-3-8B-Instruct, powered by Palu, is optimized for memory efficiency without compromising performance. Whether you're working with large datasets or deploying models in memory-constrained environments, this setup is designed to provide robust results.