Commit
·
a44ba12
1
Parent(s):
56e8190
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,79 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
+
base_model: tiiuae/falcon-7b
|
4 |
+
language:
|
5 |
+
- en
|
6 |
+
tags:
|
7 |
+
- falcon-7b
|
8 |
+
- falcon
|
9 |
+
- onnxruntime
|
10 |
+
- onnx
|
11 |
+
- llm
|
12 |
---
|
13 |
+
|
14 |
+
# falcon-7b for ONNX Runtime
|
15 |
+
|
16 |
+
## Introduction
|
17 |
+
|
18 |
+
This repository hosts the optimized version of **falcon-7b** to accelerate inference with ONNX Runtime CUDA execution provider.
|
19 |
+
|
20 |
+
See the [usage instructions](#usage-example) for how to inference this model with the ONNX files hosted in this repository.
|
21 |
+
|
22 |
+
## Model Description
|
23 |
+
|
24 |
+
- **Developed by:** TIIUAE
|
25 |
+
- **Model type:** Pretrained generative text model
|
26 |
+
- **License:** Apache 2.0 License
|
27 |
+
- **Model Description:** This is a conversion of the [falcon-7b](https://huggingface.co/tiiuae/falcon-7b) for [ONNX Runtime](https://github.com/microsoft/onnxruntime) inference with CUDA execution provider.
|
28 |
+
|
29 |
+
|
30 |
+
## Performance Comparison
|
31 |
+
|
32 |
+
#### Latency for token generation
|
33 |
+
|
34 |
+
Below is average latency of generating a token using a prompt of varying size using NVIDIA A100-SXM4-80GB GPU:
|
35 |
+
|
36 |
+
| Prompt Length | Batch Size | PyTorch 2.1 torch.compile | ONNX Runtime CUDA |
|
37 |
+
|-------------|------------|----------------|-------------------|
|
38 |
+
| 16 | 1 | N/A | N/A |
|
39 |
+
| 256 | 1 | N/A | N/A |
|
40 |
+
| 1024 | 1 | N/A | N/A |
|
41 |
+
| 2048 | 1 | N/A | N/A |
|
42 |
+
| 16 | 4 | N/A | N/A |
|
43 |
+
| 256 | 4 | N/A | N/A |
|
44 |
+
| 1024 | 4 | N/A | N/A |
|
45 |
+
| 2048 | 4 | N/A | N/A |
|
46 |
+
|
47 |
+
## Usage Example
|
48 |
+
|
49 |
+
1. Clone onnxruntime repository.
|
50 |
+
```shell
|
51 |
+
git clone https://github.com/microsoft/onnxruntime
|
52 |
+
cd onnxruntime
|
53 |
+
```
|
54 |
+
|
55 |
+
2. Install required dependencies
|
56 |
+
```shell
|
57 |
+
python3 -m pip install -r onnxruntime/python/tools/transformers/models/llama/requirements-cuda.txt
|
58 |
+
```
|
59 |
+
|
60 |
+
5. Inference using custom model API, or use Hugging Face's ORTModelForCausalLM
|
61 |
+
```python
|
62 |
+
from optimum.onnxruntime import ORTModelForCausalLM
|
63 |
+
from onnxruntime import InferenceSession
|
64 |
+
from transformers import AutoConfig, AutoTokenizer
|
65 |
+
|
66 |
+
sess = InferenceSession("falcon-7b.onnx", providers = ["CUDAExecutionProvider"])
|
67 |
+
config = AutoConfig.from_pretrained("tiiuae/falcon-7b")
|
68 |
+
|
69 |
+
model = ORTFalconForCausalLM(sess, config, use_cache = True, use_io_binding = True)
|
70 |
+
|
71 |
+
tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-7b")
|
72 |
+
|
73 |
+
inputs = tokenizer("Instruct: What is a fermi paradox?\nOutput:", return_tensors="pt")
|
74 |
+
|
75 |
+
outputs = model.generate(**inputs)
|
76 |
+
|
77 |
+
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
78 |
+
```
|
79 |
+
|