File size: 6,416 Bytes
71b64b2
 
 
 
 
 
 
 
 
 
 
 
5043722
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71b64b2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
---
license: apache-2.0
language:
- en
library_name: transformers
---

# Compressed Meta Llama-3-8B-Instruct with Palu

## Overview
This repository contains a compressed version of the Meta Llama-3-8B-Instruct model, utilizing the Palu framework for KV-Cache compression. Palu reduces the hidden dimensions of the KV-Cache through low-rank decomposition, significantly reducing the model's memory footprint while maintaining or enhancing performance.


## Evaluation Results

Here's a Markdown file to include the results of your comparisons:

---

# Meta Llama-3-8B-Instruct: Palu Compression Results

## Perplexity (PPL)

| Model                                  | PPL             |
|----------------------------------------|-----------------|
| **meta-llama-3-8b-instruct-palu**       | **8.8309**      |
| **meta-llama-3-8b-instruct (Base)**     | **8.2845**      |

## Zero-shot Evaluation

### meta-llama-3-8b-instruct-palu

|    Tasks        | Version | Filter | n-shot | Metric  | Value  | Stderr  |
|-----------------|---------|--------|--------|---------|--------|---------|
| winogrande      |      1  | none   |    0   | acc     | 0.7277 | ±0.0125 |
| arc_challenge   |      1  | none   |    0   | acc     | 0.4949 | ±0.0146 |
|                 |         |        |    0   | acc_norm| 0.5427 | ±0.0146 |
| arc_easy        |      1  | none   |    0   | acc     | 0.7942 | ±0.0083 |
|                 |         |        |    0   | acc_norm| 0.7551 | ±0.0088 |
| piqa            |      1  | none   |    0   | acc     | 0.7655 | ±0.0099 |
|                 |         |        |    0   | acc_norm| 0.7644 | ±0.0099 |
| hellaswag       |      1  | none   |    0   | acc     | 0.5664 | ±0.0049 |
|                 |         |        |    0   | acc_norm| 0.7511 | ±0.0043 |
| openbookqa      |      1  | none   |    0   | acc     | 0.3360 | ±0.0211 |
|                 |         |        |    0   | acc_norm| 0.4380 | ±0.0222 |

### meta-llama-3-8b-instruct (Base)

|    Tasks        | Version | Filter | n-shot | Metric  | Value  | Stderr  |
|-----------------|---------|--------|--------|---------|--------|---------|
| winogrande      |      1  | none   |    0   | acc     | 0.7206 | ±0.0126 |
| arc_challenge   |      1  | none   |    0   | acc     | 0.5299 | ±0.0146 |
|                 |         |        |    0   | acc_norm| 0.5683 | ±0.0145 |
| arc_easy        |      1  | none   |    0   | acc     | 0.8161 | ±0.0079 |
|                 |         |        |    0   | acc_norm| 0.7976 | ±0.0082 |
| piqa            |      1  | none   |    0   | acc     | 0.7867 | ±0.0096 |
|                 |         |        |    0   | acc_norm| 0.7856 | ±0.0096 |
| hellaswag       |      1  | none   |    0   | acc     | 0.5769 | ±0.0049 |
|                 |         |        |    0   | acc_norm| 0.7581 | ±0.0043 |
| openbookqa      |      1  | none   |    0   | acc     | 0.3420 | ±0.0212 |
|                 |         |        |    0   | acc_norm| 0.4320 | ±0.0222 |

## Long-Bench Evaluation

### triviaqa

| Model                                  | Score  |
|----------------------------------------|--------|
| **meta-llama-3-8b-instruct-palu**       | 89.45  |
| **meta-llama-3-8b-instruct (Base)**     | 90.56  |

### qasper

| Model                                  | Score  |
|----------------------------------------|--------|
| **meta-llama-3-8b-instruct-palu**       | 34.92  |
| **meta-llama-3-8b-instruct (Base)**     | 31.74  |

---

This Markdown file should effectively summarize and present the results of your model comparison.


## Key Features
- **Model**: Meta Llama-3-8B-Instruct
- **Compression Framework**: Palu
- **Compression Rate**: Up to 91.25% memory reduction
- **Accuracy**: Maintained or improved perplexity compared to the base model

## Installation

### Clone the Repository
Ensure you have Git and Conda installed on your system.
```bash
git clone --recurse-submodules https://github.com/shadowpa0327/Palu.git
cd Palu
```

### Set Up the Environment
Create and activate a Conda environment.
```bash
conda create -n Palu python=3.10
conda activate Palu
pip install -r requirements.txt
```

### Install Third-Party Libraries
```bash
pip install -e 3rdparty/lm-evaluation-harness
pip install -e 3rdparty/fast-hadamard-transform
```

## Usage

### Compress the Model
To compress Meta Llama-3-8B-Instruct using Palu's low-rank decomposition, use the following command:

```bash
python compress.py \
--model_id="meta-llama/Llama-3-8b-instruct" \
--calib_dataset wikitext2 \
--param_ratio_target 0.7 \
--search_method fisher_uniform \
--head_group_size 4 \
--dump_huggingface_model \
--use_cache 
```

The compressed model will be saved in the `Meta-Llama-3-8b-instruct_ratio-0.7_gs-4-fisher_uniform` directory in Hugging Face format.

### Evaluate the Compressed Model

#### Perplexity
To evaluate the perplexity on the `wikitext2` dataset with sequence length 2048, run:

```bash
python run_ppl_eval.py \
--model_name_or_path /Path/To/Palu/Model \
--datasets wikitext2 \
--seqlen 2048
```

To evaluate with 3-bit low-rank aware quantization, use:
```bash
python run_ppl_eval.py \
--model_name_or_path /Path/To/Palu/Model \
--datasets wikitext2 \
--seqlen 4096 \
--lt_bits 3 \
--lt_hadamard 
```

#### Zero-shot Evaluation
For zero-shot evaluations, use the following command:
```bash
CUDA_VISIBLE_DEVICES=0 python run_lm_eval.py \
--model_name_or_path "/Path/To/Palu/Model" \
--tasks "openbookqa,hellaswag,piqa,arc_easy,arc_challenge,winogrande"
```

#### Long-Bench Evaluation
Evaluate the compressed model on long-bench tasks:
```bash
CUDA_VISIBLE_DEVICES=0 python run_long_bench.py \
--model_name_or_path /Path/To/Palu/Model
```

## Latency Evaluation

### Attention Module
Evaluate the latency of the Palu-compressed attention module:
```bash
CUDA_VISIBLE_DEVICES=0 python run_latency_attention.py \
--rank_k 1024 --rank_v 3072 --group_size 4 \
--prompt_len 65536 --palu
```

### Reconstruction Kernel
Evaluate the latency of the reconstruction kernel:
```bash
CUDA_VISIBLE_DEVICES=0 python run_latency_kernel.py \
--total_rank 1024  --group_size 4
```

## Conclusion
This compressed version of Meta Llama-3-8B-Instruct, powered by Palu, is optimized for memory efficiency without compromising performance. Whether you're working with large datasets or deploying models in memory-constrained environments, this setup is designed to provide robust results.