File size: 1,332 Bytes
2023f80 ee20391 2023f80 c2e64dd 2023f80 c2e64dd 2023f80 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
---
license: llama2
language:
- en
library_name: transformers
---
## llama-2-7b-chat-marlin
Example of converting a GPTQ model to Marlin format for fast batched decoding with [Marlin Kernels](https://github.com/IST-DASLab/marlin)
### Install Marlin
```bash
pip install torch
git clone https://github.com/IST-DASLab/marlin.git
cd marlin
pip install -e .
```
### Convert Model
Convert the model from GPTQ to Marlin format. Note that this requires:
- `sym=true`
- `group_size=128`
- `desc_activations=false`
```bash
pip install -U transformers accelerate auto-gptq optimum
```
Convert with the `convert.py` script in this repo:
```bash
python3 convert.py --model-id "TheBloke/Llama-2-7B-Chat-GPTQ" --save-path "./marlin-model" --do-generation
```
### Run Model
Load with the `load.load_model` utility from this repo and run inference as usual.
```python
from load import load_model
from transformers import AutoTokenizer
# Load model from disk.
model_path = "./marlin-model"
model = load_model(model_path).to("cuda")
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Generate text.
inputs = tokenizer("My favorite song is", return_tensors="pt")
inputs = {k: v.to("cuda") for k, v in inputs.items()}
outputs = model.generate(**inputs, max_new_tokens=50, do_sample=False)
print(tokenizer.batch_decode(outputs)[0])
```
|