File size: 4,763 Bytes
8c29641
 
fa0c5c6
 
 
 
8c29641
8301028
8c29641
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9e40f24
8c29641
9e40f24
 
 
 
 
 
8c29641
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
---
license: apache-2.0
tags:
- gqa
- mla
- TransMLA
---
# Llama 3.2 1B MLA - Multi-head Latent Attention Model (Experimental)

This repository contains a version of Llama 3.2 1B converted to use Multi-head Latent Attention (MLA) instead of Group Query Attention (GQA).

## Model Details

- **Base Model**: [Meta-Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B)
- **Attention Mechanism**: Multi-head Latent Attention (MLA)
- **Performance Improvement**: Approximately 70% faster inference than GQA with the same KV cache size

## What is MLA?

Multi-head Latent Attention (MLA) is an attention mechanism introduced in the DeepSeek-V2 paper and further explored in the [TransMLA paper](https://arxiv.org/abs/2502.07864). MLA uses low-rank factorization to compress Key (K) and Value (V) representations during attention, significantly reducing the KV cache size while maintaining or even improving model expressivity.

Unlike Group Query Attention (GQA), which simply reduces the number of KV heads, MLA maintains the expressivity of having unique K and V representations for each query head by using factorized projection matrices.

## Advantages over GQA

- **Same KV Cache Size**: MLA maintains the same KV cache size as GQA
- **Greater Expressivity**: Each Q head can have its own K and V representation (unlike GQA)
- **Better Performance**: Significantly faster generation due to better memory utilization
- **No Retraining Required**: Conversion can be performed post-training using SVD

## Implementation Details

The model was converted using SVD (Singular Value Decomposition) to factorize the weight matrices. The process:

1. Decomposes the original K and V matrices into low-rank approximations
2. Creates compression and decompression layers that maintain the same KV cache size as GQA
3. Preserves the original model's knowledge while improving inference efficiency

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load the model
model = AutoModelForCausalLM.from_pretrained(
    "BarraHome/llama3.2-1b-mla",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("BarraHome/llama3.2-1b-mla")

# Example chat prompt
prompt = """<|begin_of_text|><|system|>
You are a helpful, respectful, and honest assistant.
<|user|>
What is Multi-head Latent Attention (MLA)?
<|assistant|>"""

# Generate response
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    do_sample=True,
    temperature=0.7,
    top_p=0.9
)

# Print response
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## Performance Benchmarks

When compared to the original Llama 3.2 1B model with GQA, our performance tests show:


![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b6afa756f1af7b46f1b513/ODqCFMR_hNH_EgfV6zsPu.png)

These variations in performance likely depend on various factors including GPU utilization, batch size, and system load. In general, the MLA version provides at least comparable performance to the GQA version, with significant speed improvements possible under certain conditions.

Both models maintain the same KV cache memory footprint while the MLA version provides greater expressivity by allowing each query head to have its own unique key and value representations.

## Conversion Method

The conversion from GQA to MLA was performed using the approach described in the [TransMLA: Multi-Head Latent Attention Is All You Need](https://arxiv.org/abs/2502.07864) paper. The key insight is that GQA can always be represented by MLA with the same KV cache overhead, but MLA offers greater expressivity.

## Citation

If you use this model in your research or projects, please cite:

```bibtex
@misc{ferrer2025llama32mla,
  title={Llama 3.2 1B MLA - Multi-head Latent Attention},
  author={Ferrer, Alberto},
  year={2025},
  howpublished={\url{https://huggingface.co/BarraHome/llama3.2-1b-mla}}
}
```

Also consider citing the underlying TransMLA methodology:

```bibtex
@article{meng2025transmla,
  title={TransMLA: Multi-Head Latent Attention Is All You Need},
  author={Meng, Fanxu and Yao, Zengwei and Zhang, Muhan},
  journal={arXiv preprint arXiv:2502.07864},
  year={2025}
}
```

## License

This model is subject to the same license as the original Meta-Llama-3.2-1B model. Please refer to Meta's licensing terms for usage restrictions.

## Acknowledgements

- Developed by Alberto Ferrer (BarraHome)
- Thanks to the authors of the TransMLA paper for their insights on converting GQA to MLA
- Thanks to DeepSeek AI for the original introduction of MLA in their DeepSeek-V2 model
- Thanks to Meta for releasing the Llama 3.2 models