jeffra commited on
Commit
76ca3c1
·
verified ·
1 Parent(s): eefe7ac

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -1
README.md CHANGED
@@ -2,4 +2,27 @@
2
  license: llama3.1
3
  base_model:
4
  - meta-llama/Llama-3.1-8B-Instruct
5
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: llama3.1
3
  base_model:
4
  - meta-llama/Llama-3.1-8B-Instruct
5
+ ---
6
+
7
+ # SwiftKV
8
+
9
+ The Snowflake AI Research team is releasing a series of SwiftKV optimized Llama-3.1 models. [SwiftKV](https://arxiv.org/abs/2410.03960) is a series of inference optimizations that goes beyond traditional key-value (KV) cache compression. This method reduces computational overhead during prompt processing by combining model rewiring and knowledge-preserving self-distillation, allowing prefill tokens to skip up to half the model's layers. SwiftKV achieves up to 2x improvements in throughput, latency, and cost efficiency with minimal accuracy loss, making LLM deployments more performant and economically viable.
10
+
11
+ For more details about the technique
12
+ * Blog: <!-- add link here -->
13
+ * arXiv paper: https://arxiv.org/abs/2410.03960
14
+
15
+ ## Eval metrics
16
+
17
+ | Model | Arc-Challenge | MMLU | MMLU-CoT | GSM-8k-CoT |
18
+ |----------|--------------|--------------|--------------|--------------|
19
+ | [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) | | | | |
20
+ | [Snowflake/Llama-3.1-SwiftKV-8B-Instruct](https://huggingface.co/Snowflake/Llama-3.1-SwiftKV-8B-Instruct) | | | | |
21
+ | [meta-llama/Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct) | | | | |
22
+ | [Snowflake/Llama-3.1-SwiftKV-405B-Instruct-FP8](https://huggingface.co/meta-llama/Llama-3.1-405B-Instruct-FP8) | | | | |
23
+
24
+
25
+ ## How to use the models
26
+
27
+ Instructions on how to use vLLM for both evaluation and performance benchmarks:
28
+ https://github.com/Snowflake-Labs/vllm/tree/swiftkv/examples/swiftkv