File size: 4,193 Bytes
4f41ca5
 
f50d964
 
 
15eb48b
4f41ca5
15eb48b
 
1e6470e
15eb48b
 
 
 
 
f50d964
 
 
 
 
 
 
 
 
 
 
 
 
8388800
f50d964
 
 
 
 
 
 
 
b7d7dcc
f50d964
 
 
 
 
 
b7d7dcc
f50d964
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fa2f67e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f50d964
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
license: mit
language: en
tags:
- LLM
- baichuan
---
## Special Notes

We have released **7 lyraBaichuan models** including lyraBaichuan-7B, lyraBaichuan-13B-Base, lyraBaichuan-13B-Chat, lyraBaichuan2-7B-Base, lyraBaichuan2-7B-Chat, lyraBaichuan2-13B-Base and lyraBaichuan2-13B-Chat. 

These highly optimized Baichuan models are suitable for Ampere (A100/A10) as well as Volta (V100).

If you like our work and consider to join us, feel free to drop a line at [email protected].

## Model Card for lyraBaichuan

lyraBaichuan is currently the **fastest Baichuan models** (Baichuan-7B, Baichuan-13B, Baichuan2-7B, Baichuan2-13B) available. The inference speed of lyraBaichuan has achieved up to **4300+ tokens/s** on A100, up to **2.4x** acceleration upon the torch version.

Among its main features are:
- device: Nvidia GPU with Amperer architecture or Volta architecture (A100 or higher, V100).
- batch_size: compiled with dynamic batch size, maximum depends on device. 
- MEMOPT mode: significantly optimized VRAM usage and increased speed

We use the Baichuan2-7B-Base and Baichuan2-13B-Base model for measurement, but this optimized inference is also applicable to other Baichuan models, including Baichuan-7B and Baichuan-13B.

## Speed

* Evaluated at tokens/s (#tokens of input and output divided by inference time cost)
* test on A100 40G
* MEMOPT mode

### Baichuan2-7B-Base

| Version | Batch Size 1 | Batch Size 8 | Batch Size 16 | Batch Size 32 | Batch Size 64 |
| --- | --- | --- | --- | --- | --- |
| Torch 2.0.1 | 41.2 | 323.2 | 640.0 | 1256.8 | 2231.0 |
| lyraBaichuan | 125.9 | 948.1 | 1749.3 | 2974.0 | 4370.1 |

### Baichuan2-13B-Base

| Version | Batch Size 1 | Batch Size 8 | Batch Size 16 | Batch Size 32 | Batch Size 64 |
| --- | --- | --- | --- | --- | --- |
| Torch 2.0.1 | 40.9 | 307.9 | 555.6 | 1010.4 | 1601.0 |
| lyraBaichuan | 80.0 | 568.2 | 1124.4 | 1942.6 | 2828.0 |

## Docker Environment Recommendation

- For Cuda 11.X: we recommend ```nvcr.io/nvidia/pytorch:22.12-py3```
- For Cuda 12.0: we recommend ```nvcr.io/nvidia/pytorch:23.02-py3```

```bash
docker pull nvcr.io/nvidia/pytorch:23.02-py3
docker run --rm -it --gpus all -v ./:/lyraBaichuan nvcr.io/nvidia/pytorch:23.02-py3

pip install -r requirements.txt
python demo.py
```

## Uses

```python
from lyra_baichuan import lyraBaichuan7B, lyraBaichuan13B

model_path = "./models/Baichuan2-13B-lyra"
tokenizer_path = "./models/Baichuan2-13B-lyra"
inference_dtype = 'fp16'
prompt = "登鹳雀楼->王之涣\n夜雨寄北->"

memopt_mode = 1
max_output_length = 64
arch = "Ampere" # Ampere or Volta
cuda_version = 12 # cuda version, we currently support 11 and 12

# To use 7B model, initialize with lyraBaichuan7B
model = lyraBaichuan13B(model_path, 
                        tokenizer_path = tokenizer_path, 
                        dtype = inference_dtype,
                        memopt_mode = memopt_mode,
                        arch = arch,
                        cuda_version = cuda_version)

bs = 1
prompts = [prompt, ] * bs
output_texts = model.generate(
        prompts, output_length=max_output_length,
        top_k=30, top_p=0.85, temperature=1.0, repetition_penalty=1.0, do_sample=False)

print(output_texts)
```

## Demo Outputs

### Baichuan2-13B-Base
#### input

登鹳雀楼->王之涣

夜雨寄北->

#### output

李商隐

望洞庭->刘禹锡

黄鹤楼送孟浩然之广陵->李白

登岳阳楼->杜甫

秋词->刘禹锡

枫桥夜泊->张继

饮湖上初晴后雨->苏轼

浪淘沙->刘禹锡

## TODO
1. Support for int4
2. Inference for longer context situations
3. Streaming inference mode.

## Citation
``` bibtex
@Misc{lyraBaichuan2023,
  author =       {Haoxiong Su, Kangjian Wu, Zhengtao Wang, Yibo Lu, Bin Wu},
  title =        {lyraBaichuan: Accelerating Baichuan models to 4300+ tokens/s},
  howpublished = {\url{https://huggingface.co/TMElyralab/lyraBaichuan}},
  year =         {2023}
}
```

## Report bug
- start a discussion to report any bugs!--> https://huggingface.co/TMElyralab/lyraBaichuan
- report bug with a `[bug]` mark in the title.