File size: 6,221 Bytes
71b64b2 5043722 71b64b2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 |
---
license: apache-2.0
language:
- en
library_name: transformers
---
# Compressed Meta Llama-3-8B-Instruct with Palu
## Overview
This repository contains a compressed version of the Meta Llama-3-8B-Instruct model, utilizing the Palu framework for KV-Cache compression. Palu reduces the hidden dimensions of the KV-Cache through low-rank decomposition, significantly reducing the model's memory footprint while maintaining or enhancing performance.
# Meta Llama-3-8B-Instruct: Palu Compression Results
## Perplexity (PPL)
| Model | PPL |
|----------------------------------------|-----------------|
| **meta-llama-3-8b-instruct-palu** | **8.8309** |
| **meta-llama-3-8b-instruct (Base)** | **8.2845** |
## Zero-shot Evaluation
### meta-llama-3-8b-instruct-palu
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr |
|-----------------|---------|--------|--------|---------|--------|---------|
| winogrande | 1 | none | 0 | acc | 0.7277 | ±0.0125 |
| arc_challenge | 1 | none | 0 | acc | 0.4949 | ±0.0146 |
| | | | 0 | acc_norm| 0.5427 | ±0.0146 |
| arc_easy | 1 | none | 0 | acc | 0.7942 | ±0.0083 |
| | | | 0 | acc_norm| 0.7551 | ±0.0088 |
| piqa | 1 | none | 0 | acc | 0.7655 | ±0.0099 |
| | | | 0 | acc_norm| 0.7644 | ±0.0099 |
| hellaswag | 1 | none | 0 | acc | 0.5664 | ±0.0049 |
| | | | 0 | acc_norm| 0.7511 | ±0.0043 |
| openbookqa | 1 | none | 0 | acc | 0.3360 | ±0.0211 |
| | | | 0 | acc_norm| 0.4380 | ±0.0222 |
### meta-llama-3-8b-instruct (Base)
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr |
|-----------------|---------|--------|--------|---------|--------|---------|
| winogrande | 1 | none | 0 | acc | 0.7206 | ±0.0126 |
| arc_challenge | 1 | none | 0 | acc | 0.5299 | ±0.0146 |
| | | | 0 | acc_norm| 0.5683 | ±0.0145 |
| arc_easy | 1 | none | 0 | acc | 0.8161 | ±0.0079 |
| | | | 0 | acc_norm| 0.7976 | ±0.0082 |
| piqa | 1 | none | 0 | acc | 0.7867 | ±0.0096 |
| | | | 0 | acc_norm| 0.7856 | ±0.0096 |
| hellaswag | 1 | none | 0 | acc | 0.5769 | ±0.0049 |
| | | | 0 | acc_norm| 0.7581 | ±0.0043 |
| openbookqa | 1 | none | 0 | acc | 0.3420 | ±0.0212 |
| | | | 0 | acc_norm| 0.4320 | ±0.0222 |
## Long-Bench Evaluation
### triviaqa
| Model | Score |
|----------------------------------------|--------|
| **meta-llama-3-8b-instruct-palu** | 89.45 |
| **meta-llama-3-8b-instruct (Base)** | 90.56 |
### qasper
| Model | Score |
|----------------------------------------|--------|
| **meta-llama-3-8b-instruct-palu** | 34.92 |
| **meta-llama-3-8b-instruct (Base)** | 31.74 |
---
## Key Features
- **Model**: Meta Llama-3-8B-Instruct
- **Compression Framework**: Palu
- **Compression Rate**: Up to 91.25% memory reduction
- **Accuracy**: Maintained or improved perplexity compared to the base model
## Installation
### Clone the Repository
Ensure you have Git and Conda installed on your system.
```bash
git clone --recurse-submodules https://github.com/shadowpa0327/Palu.git
cd Palu
```
### Set Up the Environment
Create and activate a Conda environment.
```bash
conda create -n Palu python=3.10
conda activate Palu
pip install -r requirements.txt
```
### Install Third-Party Libraries
```bash
pip install -e 3rdparty/lm-evaluation-harness
pip install -e 3rdparty/fast-hadamard-transform
```
## Usage
### Compress the Model
To compress Meta Llama-3-8B-Instruct using Palu's low-rank decomposition, use the following command:
```bash
python compress.py \
--model_id="meta-llama/Llama-3-8b-instruct" \
--calib_dataset wikitext2 \
--param_ratio_target 0.7 \
--search_method fisher_uniform \
--head_group_size 4 \
--dump_huggingface_model \
--use_cache
```
The compressed model will be saved in the `Meta-Llama-3-8b-instruct_ratio-0.7_gs-4-fisher_uniform` directory in Hugging Face format.
### Evaluate the Compressed Model
#### Perplexity
To evaluate the perplexity on the `wikitext2` dataset with sequence length 2048, run:
```bash
python run_ppl_eval.py \
--model_name_or_path /Path/To/Palu/Model \
--datasets wikitext2 \
--seqlen 2048
```
To evaluate with 3-bit low-rank aware quantization, use:
```bash
python run_ppl_eval.py \
--model_name_or_path /Path/To/Palu/Model \
--datasets wikitext2 \
--seqlen 4096 \
--lt_bits 3 \
--lt_hadamard
```
#### Zero-shot Evaluation
For zero-shot evaluations, use the following command:
```bash
CUDA_VISIBLE_DEVICES=0 python run_lm_eval.py \
--model_name_or_path "/Path/To/Palu/Model" \
--tasks "openbookqa,hellaswag,piqa,arc_easy,arc_challenge,winogrande"
```
#### Long-Bench Evaluation
Evaluate the compressed model on long-bench tasks:
```bash
CUDA_VISIBLE_DEVICES=0 python run_long_bench.py \
--model_name_or_path /Path/To/Palu/Model
```
## Latency Evaluation
### Attention Module
Evaluate the latency of the Palu-compressed attention module:
```bash
CUDA_VISIBLE_DEVICES=0 python run_latency_attention.py \
--rank_k 1024 --rank_v 3072 --group_size 4 \
--prompt_len 65536 --palu
```
### Reconstruction Kernel
Evaluate the latency of the reconstruction kernel:
```bash
CUDA_VISIBLE_DEVICES=0 python run_latency_kernel.py \
--total_rank 1024 --group_size 4
```
## Conclusion
This compressed version of Meta Llama-3-8B-Instruct, powered by Palu, is optimized for memory efficiency without compromising performance. Whether you're working with large datasets or deploying models in memory-constrained environments, this setup is designed to provide robust results.
|