Chat-UniVi commited on
Commit
7d2c4e2
โ€ข
1 Parent(s): ea78f8c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +48 -3
README.md CHANGED
@@ -1,3 +1,48 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ # MoH: Multi-Head Attention as Mixture-of-Head Attention
5
+
6
+ **Paper or resources for more information:**
7
+ [[Paper]()] [[Code](https://github.com/SkyworkAI/MoH)]
8
+
9
+ ## โšก Overview
10
+ We propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages:
11
+ * First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters.
12
+ * Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential.
13
+
14
+
15
+
16
+ ## ๐Ÿ˜ฎ Highlights
17
+ ### ๐Ÿ’ก General Framework
18
+ We evaluate our proposed MoH across various popular model frameworks, including Vision Transformers (ViT) for image classification, Diffusion models with Transformers (DiT) for class-conditional image generation, and Large Language Models (LLMs) for language tasks.
19
+
20
+ <div align=center>
21
+
22
+ | Code | HuggingFace Model |
23
+ |:-----------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|
24
+ | **[MoH-ViT](https://github.com/SkyworkAI/MoH/tree/main/MoH-ViT)** | ๐Ÿค— [MoH-ViT-B-75](https://huggingface.co/Chat-UniVi/MoH-ViT-B-75), [MoH-ViT-B-50](https://huggingface.co/Chat-UniVi/MoH-ViT-B-50), [MoH-ViT-S-80](https://huggingface.co/Chat-UniVi/MoH-ViT-S-80), [MoH-ViT-S-75](https://huggingface.co/Chat-UniVi/MoH-ViT-S-75) |
25
+ | **[MoH-DiT](https://github.com/SkyworkAI/MoH/tree/main/MoH-DiT)** | ๐Ÿ˜Š [MoH-DiT-90](https://huggingface.co/Chat-UniVi/MoH-DiT-XL-90) |
26
+ | **[MoH-LLaMA3-8B](https://github.com/SkyworkAI/MoH/tree/main/MoH-LLaMA3)** | ๐Ÿ˜Š [MoH-LLaMA3-8B](https://huggingface.co/Chat-UniVi/MoH-LLaMA3-8B) |
27
+
28
+ </div>
29
+
30
+ ### ๐Ÿ”ฅ High Performance
31
+ Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only **50%~90%** of the attention heads.
32
+
33
+ ### ๐Ÿค— Support Continue-Tuning Starting from the Multi-Head Attention Models
34
+ we demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads.
35
+
36
+
37
+ The MoH model quickly recovers to over **95%** of the performance of the original model within a training budget of 10B tokens. Then, the performance gradually improves with the increase of the training tokens.
38
+
39
+
40
+ ## โœ๏ธ Citation
41
+ If you find this paper useful, please consider staring ๐ŸŒŸ this repo and citing ๐Ÿ“‘ our paper:
42
+ ```
43
+ @article{jin2024moh,
44
+ title={MoH: Multi-Head Attention as Mixture-of-Head Attention},
45
+ author={Peng Jin and Bo Zhu and Li Yuan and Shuicheng Yan},
46
+ year={2024}
47
+ }
48
+ ```