Commit
·
08cd86f
1
Parent(s):
fcff7bc
Update README with usage
Browse files
README.md
CHANGED
@@ -1,3 +1,79 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
+
base_model: mistralai/Mistral-7B-v0.1
|
4 |
+
language:
|
5 |
+
- en
|
6 |
+
tags:
|
7 |
+
- mistral
|
8 |
+
- onnxruntime
|
9 |
+
- onnx
|
10 |
+
- llm
|
11 |
---
|
12 |
+
|
13 |
+
# Mistral-7b for ONNX Runtime
|
14 |
+
|
15 |
+
## Introduction
|
16 |
+
|
17 |
+
This repository hosts the optimized versions of **Mistral-7B-v0.1** to accelerate inference with ONNX Runtime CUDA execution provider.
|
18 |
+
|
19 |
+
See the [usage instructions](#usage-example) for how to run the SDXL pipeline with the ONNX files hosted in this repository.
|
20 |
+
|
21 |
+
## Model Description
|
22 |
+
|
23 |
+
- **Developed by:** MistralAI
|
24 |
+
- **Model type:** Pretrained generative text model
|
25 |
+
- **License:** Apache 2.0 License
|
26 |
+
- **Model Description:** This is a conversion of the [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) for [ONNX Runtime](https://github.com/microsoft/onnxruntime) inference with CUDA execution provider.
|
27 |
+
|
28 |
+
|
29 |
+
## Performance Comparison
|
30 |
+
|
31 |
+
#### Latency for 30 steps base and 9 steps refiner
|
32 |
+
|
33 |
+
Below is average latency of generating a token using a prompt of varying size using NVIDIA A100-SXM4-80GB GPU:
|
34 |
+
|
35 |
+
| Prompt Length | Batch Size | PyTorch 2.1 torch.compile | ONNX Runtime CUDA |
|
36 |
+
|-------------|------------|----------------|-------------------|
|
37 |
+
| 16 | 1 | N/A | N/A |
|
38 |
+
| 256 | 1 | N/A | N/A |
|
39 |
+
| 1024 | 1 | N/A | N/A |
|
40 |
+
| 2048 | 1 | N/A | N/A |
|
41 |
+
| 16 | 4 | N/A | N/A |
|
42 |
+
| 256 | 4 | N/A | N/A |
|
43 |
+
| 1024 | 4 | N/A | N/A |
|
44 |
+
| 2048 | 4 | N/A | N/A |
|
45 |
+
|
46 |
+
## Usage Example
|
47 |
+
|
48 |
+
Following the [benchmarking instructions](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/models/llama/README.md#mistral). Example steps:
|
49 |
+
|
50 |
+
1. Clone onnxruntime repository.
|
51 |
+
```shell
|
52 |
+
git clone https://github.com/microsoft/onnxruntime
|
53 |
+
cd onnxruntime
|
54 |
+
```
|
55 |
+
|
56 |
+
2. Install required dependencies
|
57 |
+
```shell
|
58 |
+
python3 -m pip install -r onnxruntime/python/tools/transformers/models/llama/requirements-cuda.txt
|
59 |
+
```
|
60 |
+
|
61 |
+
5. Inference using manual model API, or use Hugging Face's ORTModelForCausalLM
|
62 |
+
```python
|
63 |
+
from optimum.onnxruntime import ORTModelForCausalLM
|
64 |
+
from onnxruntime import InferenceSession
|
65 |
+
from transformers import AutoConfig, AutoTokenizer
|
66 |
+
|
67 |
+
sess = InferenceSession("Mistral-7B-v0.1.onnx", providers = ["CUDAExecutionProvider"])
|
68 |
+
config = AutoConfig.from_pretrained("mistralai/Mistral-7B-v0.1")
|
69 |
+
|
70 |
+
new_model = ORTModelForCausalLM(sess, config, use_cache = True, use_io_binding = True)
|
71 |
+
|
72 |
+
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
|
73 |
+
|
74 |
+
inputs = tokenizer("Instruct: What is a fermi paradox?\nOutput:", return_tensors="pt")
|
75 |
+
|
76 |
+
outputs = new_model.generate(**inputs)
|
77 |
+
|
78 |
+
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
79 |
+
```
|