Syed-Hasan-8503 commited on
Commit
71b64b2
1 Parent(s): e1ab6c2

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +115 -0
README.md ADDED
@@ -0,0 +1,115 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ library_name: transformers
6
+ ---
7
+
8
+ # Compressed Meta Llama-3-8B-Instruct with Palu
9
+
10
+ ## Overview
11
+ This repository contains a compressed version of the Meta Llama-3-8B-Instruct model, utilizing the Palu framework for KV-Cache compression. Palu reduces the hidden dimensions of the KV-Cache through low-rank decomposition, significantly reducing the model's memory footprint while maintaining or enhancing performance.
12
+
13
+ ## Key Features
14
+ - **Model**: Meta Llama-3-8B-Instruct
15
+ - **Compression Framework**: Palu
16
+ - **Compression Rate**: Up to 91.25% memory reduction
17
+ - **Accuracy**: Maintained or improved perplexity compared to the base model
18
+
19
+ ## Installation
20
+
21
+ ### Clone the Repository
22
+ Ensure you have Git and Conda installed on your system.
23
+ ```bash
24
+ git clone --recurse-submodules https://github.com/shadowpa0327/Palu.git
25
+ cd Palu
26
+ ```
27
+
28
+ ### Set Up the Environment
29
+ Create and activate a Conda environment.
30
+ ```bash
31
+ conda create -n Palu python=3.10
32
+ conda activate Palu
33
+ pip install -r requirements.txt
34
+ ```
35
+
36
+ ### Install Third-Party Libraries
37
+ ```bash
38
+ pip install -e 3rdparty/lm-evaluation-harness
39
+ pip install -e 3rdparty/fast-hadamard-transform
40
+ ```
41
+
42
+ ## Usage
43
+
44
+ ### Compress the Model
45
+ To compress Meta Llama-3-8B-Instruct using Palu's low-rank decomposition, use the following command:
46
+
47
+ ```bash
48
+ python compress.py \
49
+ --model_id="meta-llama/Llama-3-8b-instruct" \
50
+ --calib_dataset wikitext2 \
51
+ --param_ratio_target 0.7 \
52
+ --search_method fisher_uniform \
53
+ --head_group_size 4 \
54
+ --dump_huggingface_model \
55
+ --use_cache
56
+ ```
57
+
58
+ The compressed model will be saved in the `Meta-Llama-3-8b-instruct_ratio-0.7_gs-4-fisher_uniform` directory in Hugging Face format.
59
+
60
+ ### Evaluate the Compressed Model
61
+
62
+ #### Perplexity
63
+ To evaluate the perplexity on the `wikitext2` dataset with sequence length 2048, run:
64
+
65
+ ```bash
66
+ python run_ppl_eval.py \
67
+ --model_name_or_path /Path/To/Palu/Model \
68
+ --datasets wikitext2 \
69
+ --seqlen 2048
70
+ ```
71
+
72
+ To evaluate with 3-bit low-rank aware quantization, use:
73
+ ```bash
74
+ python run_ppl_eval.py \
75
+ --model_name_or_path /Path/To/Palu/Model \
76
+ --datasets wikitext2 \
77
+ --seqlen 4096 \
78
+ --lt_bits 3 \
79
+ --lt_hadamard
80
+ ```
81
+
82
+ #### Zero-shot Evaluation
83
+ For zero-shot evaluations, use the following command:
84
+ ```bash
85
+ CUDA_VISIBLE_DEVICES=0 python run_lm_eval.py \
86
+ --model_name_or_path "/Path/To/Palu/Model" \
87
+ --tasks "openbookqa,hellaswag,piqa,arc_easy,arc_challenge,winogrande"
88
+ ```
89
+
90
+ #### Long-Bench Evaluation
91
+ Evaluate the compressed model on long-bench tasks:
92
+ ```bash
93
+ CUDA_VISIBLE_DEVICES=0 python run_long_bench.py \
94
+ --model_name_or_path /Path/To/Palu/Model
95
+ ```
96
+
97
+ ## Latency Evaluation
98
+
99
+ ### Attention Module
100
+ Evaluate the latency of the Palu-compressed attention module:
101
+ ```bash
102
+ CUDA_VISIBLE_DEVICES=0 python run_latency_attention.py \
103
+ --rank_k 1024 --rank_v 3072 --group_size 4 \
104
+ --prompt_len 65536 --palu
105
+ ```
106
+
107
+ ### Reconstruction Kernel
108
+ Evaluate the latency of the reconstruction kernel:
109
+ ```bash
110
+ CUDA_VISIBLE_DEVICES=0 python run_latency_kernel.py \
111
+ --total_rank 1024 --group_size 4
112
+ ```
113
+
114
+ ## Conclusion
115
+ This compressed version of Meta Llama-3-8B-Instruct, powered by Palu, is optimized for memory efficiency without compromising performance. Whether you're working with large datasets or deploying models in memory-constrained environments, this setup is designed to provide robust results.