Syed-Hasan-8503
/

PaluLlama-3-8B-Instruct

+---
+license: apache-2.0
+language:
+- en
+library_name: transformers
+---
+# Compressed Meta Llama-3-8B-Instruct with Palu
+## Overview
+This repository contains a compressed version of the Meta Llama-3-8B-Instruct model, utilizing the Palu framework for KV-Cache compression. Palu reduces the hidden dimensions of the KV-Cache through low-rank decomposition, significantly reducing the model's memory footprint while maintaining or enhancing performance.
+## Key Features
+- **Model**: Meta Llama-3-8B-Instruct
+- **Compression Framework**: Palu
+- **Compression Rate**: Up to 91.25% memory reduction
+- **Accuracy**: Maintained or improved perplexity compared to the base model
+## Installation
+### Clone the Repository
+Ensure you have Git and Conda installed on your system.
+```bash
+git clone --recurse-submodules https://github.com/shadowpa0327/Palu.git
+cd Palu
+```
+### Set Up the Environment
+Create and activate a Conda environment.
+```bash
+conda create -n Palu python=3.10
+conda activate Palu
+pip install -r requirements.txt
+```
+### Install Third-Party Libraries
+```bash
+pip install -e 3rdparty/lm-evaluation-harness
+pip install -e 3rdparty/fast-hadamard-transform
+```
+## Usage
+### Compress the Model
+To compress Meta Llama-3-8B-Instruct using Palu's low-rank decomposition, use the following command:
+```bash
+python compress.py \
+--model_id="meta-llama/Llama-3-8b-instruct" \
+--calib_dataset wikitext2 \
+--param_ratio_target 0.7 \
+--search_method fisher_uniform \
+--head_group_size 4 \
+--dump_huggingface_model \
+--use_cache
+```
+The compressed model will be saved in the `Meta-Llama-3-8b-instruct_ratio-0.7_gs-4-fisher_uniform` directory in Hugging Face format.
+### Evaluate the Compressed Model
+#### Perplexity
+To evaluate the perplexity on the `wikitext2` dataset with sequence length 2048, run:
+```bash
+python run_ppl_eval.py \
+--model_name_or_path /Path/To/Palu/Model \
+--datasets wikitext2 \
+--seqlen 2048
+```
+To evaluate with 3-bit low-rank aware quantization, use:
+```bash
+python run_ppl_eval.py \
+--model_name_or_path /Path/To/Palu/Model \
+--datasets wikitext2 \
+--seqlen 4096 \
+--lt_bits 3 \
+--lt_hadamard
+```
+#### Zero-shot Evaluation
+For zero-shot evaluations, use the following command:
+```bash
+CUDA_VISIBLE_DEVICES=0 python run_lm_eval.py \
+--model_name_or_path "/Path/To/Palu/Model" \
+--tasks "openbookqa,hellaswag,piqa,arc_easy,arc_challenge,winogrande"
+```
+#### Long-Bench Evaluation
+Evaluate the compressed model on long-bench tasks:
+```bash
+CUDA_VISIBLE_DEVICES=0 python run_long_bench.py \
+--model_name_or_path /Path/To/Palu/Model
+```
+## Latency Evaluation
+### Attention Module
+Evaluate the latency of the Palu-compressed attention module:
+```bash
+CUDA_VISIBLE_DEVICES=0 python run_latency_attention.py \
+--rank_k 1024 --rank_v 3072 --group_size 4 \
+--prompt_len 65536 --palu
+```
+### Reconstruction Kernel
+Evaluate the latency of the reconstruction kernel:
+```bash
+CUDA_VISIBLE_DEVICES=0 python run_latency_kernel.py \
+--total_rank 1024  --group_size 4
+```
+## Conclusion
+This compressed version of Meta Llama-3-8B-Instruct, powered by Palu, is optimized for memory efficiency without compromising performance. Whether you're working with large datasets or deploying models in memory-constrained environments, this setup is designed to provide robust results.