BerenMillidge
commited on
Commit
•
07bf624
1
Parent(s):
47fb219
Update README.md
Browse files
README.md
CHANGED
@@ -49,17 +49,17 @@ print((tokenizer.decode(outputs[0])))
|
|
49 |
|
50 |
## Performance
|
51 |
|
52 |
-
Zamba2-2.7B
|
53 |
|
54 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/64e40335c0edca443ef8af3e/wXFMLXZA2-xz2PDyUMwTI.png" width="600"/>
|
55 |
|
56 |
-
Zamba2-2.7B
|
57 |
-
|
58 |
<center>
|
59 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/65c05e75c084467acab2f84a/U7VD9PYLj3XcEjgV08sP5.png" width="700" alt="Zamba performance">
|
60 |
</center>
|
61 |
|
62 |
|
|
|
63 |
Time to First Token (TTFT) | Output Generation
|
64 |
:-------------------------:|:-------------------------:
|
65 |
![](https://cdn-uploads.huggingface.co/production/uploads/65bc13717c6ad1994b6619e9/BmE8X6tDNVw5OJcbZt8sZ.png) | ![](https://cdn-uploads.huggingface.co/production/uploads/65bc13717c6ad1994b6619e9/wECc9cItK1FW1MOMGSLrp.png)
|
@@ -69,6 +69,8 @@ Time to First Token (TTFT) | Output Generation
|
|
69 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/65bc13717c6ad1994b6619e9/nhoss41xlzfEBZzcQXI6z.png" width="700" alt="Zamba inference and memory cost">
|
70 |
</center>
|
71 |
|
|
|
|
|
72 |
## Model Details
|
73 |
|
74 |
Zamba2-2.7B-Instruct utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba layers interleaved with one or more shared attention layers (one shared attention in Zamba1, two in Zamba2). This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings to the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared MLP to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small.
|
|
|
49 |
|
50 |
## Performance
|
51 |
|
52 |
+
Zamba2-2.7B-Instruct punches dramatically above its weight, achieving extremely strong instruction-following benchmark scores, significantly outperforming Gemma2-2B-Instruct of the same size and outperforming Mistral-7B-Instruct in most metrics.
|
53 |
|
54 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/64e40335c0edca443ef8af3e/wXFMLXZA2-xz2PDyUMwTI.png" width="600"/>
|
55 |
|
56 |
+
Moreover, due to its unique hybrid SSM architecture, Zamba2-2.7B-Instruct achieves extremely low inference latency and rapid generation with a significantly smaller memory footprint than comparable transformer based models.
|
|
|
57 |
<center>
|
58 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/65c05e75c084467acab2f84a/U7VD9PYLj3XcEjgV08sP5.png" width="700" alt="Zamba performance">
|
59 |
</center>
|
60 |
|
61 |
|
62 |
+
|
63 |
Time to First Token (TTFT) | Output Generation
|
64 |
:-------------------------:|:-------------------------:
|
65 |
![](https://cdn-uploads.huggingface.co/production/uploads/65bc13717c6ad1994b6619e9/BmE8X6tDNVw5OJcbZt8sZ.png) | ![](https://cdn-uploads.huggingface.co/production/uploads/65bc13717c6ad1994b6619e9/wECc9cItK1FW1MOMGSLrp.png)
|
|
|
69 |
<img src="https://cdn-uploads.huggingface.co/production/uploads/65bc13717c6ad1994b6619e9/nhoss41xlzfEBZzcQXI6z.png" width="700" alt="Zamba inference and memory cost">
|
70 |
</center>
|
71 |
|
72 |
+
Zamba2-2.7B-Instruct's high performance, strong instruction-following and reasoning capabilities, and small inference compute and memory footprint renders it an ideal generalist model for on-device applications.
|
73 |
+
|
74 |
## Model Details
|
75 |
|
76 |
Zamba2-2.7B-Instruct utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba layers interleaved with one or more shared attention layers (one shared attention in Zamba1, two in Zamba2). This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings to the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared MLP to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small.
|