File size: 3,704 Bytes
61c49cc 4cb6198 61c49cc 4cb6198 61c49cc 4cb6198 61c49cc 4cb6198 61c49cc 4cb6198 61c49cc 4cb6198 61c49cc 4cb6198 61c49cc 4cb6198 61c49cc 4cb6198 61c49cc 4cb6198 61c49cc 4cb6198 61c49cc 4cb6198 61c49cc 4cb6198 61c49cc 4cb6198 61c49cc 4cb6198 61c49cc 4cb6198 61c49cc 4cb6198 61c49cc 28759c8 61c49cc 4cb6198 61c49cc 4cb6198 61c49cc 4cb6198 61c49cc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 |
---
library_name: transformers
datasets:
- cerebras/SlimPajama-627B
language:
- en
---
# LCKV
This is a research-purpose pretrained model described in paper "[Layer-Condensed KV Cache for Efficient Inference of Large Language Models](https://arxiv.org/abs/2405.10637)".
## About
Layer-Condensed KV Cache (LCKV) is a variant of transformer decoders in which queries of all layers are paired with keys and values of just the top layer. It reduces the memory and computation cost, reduces the number of parameters, significantly improves the inference throughput with comparable or better task performance. See more details in our github repo: https://github.com/whyNLP/LCKV
## Quick Start
```python
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-generation", model="whynlp/tinyllama-lckv-w10-ft-250b", trust_remote_code=True)
# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("whynlp/tinyllama-lckv-w10-ft-250b", trust_remote_code=True)
```
Sample text generation script:
```python
# This is consistent with the `run_generation.py` script in the github repo: https://github.com/whyNLP/LCKV
import torch
from accelerate.utils import set_seed
from transformers import pipeline
set_seed(42)
pipe = pipeline(
"text-generation",
model="whynlp/tinyllama-lckv-w10-ft-250b",
torch_dtype=torch.bfloat16,
device="cuda",
trust_remote_code=True,
model_kwargs={"attn_implementation": "flash_attention_2"},
)
response = pipe(
"the meaning of life is",
add_special_tokens=False,
max_new_tokens=50,
temperature=1.0,
top_k=0,
top_p=0.9,
repetition_penalty=1.0,
do_sample=True,
)
print(response[0]["generated_text"])
# the meaning of life is the whole point: this is it.
# The truth of the matter is that the world does not lie. It is not who we are.
# Therefore, it is more than a religious principle. It is a moral issue of this world and
```
## The LCKV Collection
The model has 10 warmup layers. i.e. 1/2 KV cache of a standard TinyLlama.
This model was first initialized from the [TinyLlama 2.5T checkpoint](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T), then continued pre-training on 250B tokens from [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B).
Since the model structure has been changed, the initialization cannot inherit the performance of the TinyLlama checkpoint, but it effectively boosts the training process compared to pre-training from scratch.
The evaluation follows that of TinyLlama. Refer to [our paper](https://arxiv.org/abs/2405.10637) for more details.
| Model | Paper Section | Dev ppl. | Common-sense Reasoning |
| ------------------------------------------------------------------------------------------- | ------------------------------ | -------- | ---------------------- |
| **whynlp/tinyllama-lckv-w10-ft-250b** | -- | 7.939 | 50.86 |
| [whynlp/tinyllama-lckv-w2-ft-100b](https://huggingface.co/whynlp/tinyllama-lckv-w2-ft-100b) | Appendix C.1, Table 7 (line 5) | 8.514 | 49.55 |
| [whynlp/tinyllama-lckv-w10-100b](https://huggingface.co/whynlp/tinyllama-lckv-w10-100b) | Section 3.2, Table 2 (line 3) | 9.265 | 46.84 |
| [whynlp/tinyllama-lckv-w2-100b](https://huggingface.co/whynlp/tinyllama-lckv-w2-100b) | Section 3.2, Table 2 (line 2) | 9.746 | 45.45 |
|