petermcaughan commited on
Commit
08cd86f
·
1 Parent(s): fcff7bc

Update README with usage

Browse files
Files changed (1) hide show
  1. README.md +76 -0
README.md CHANGED
@@ -1,3 +1,79 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ base_model: mistralai/Mistral-7B-v0.1
4
+ language:
5
+ - en
6
+ tags:
7
+ - mistral
8
+ - onnxruntime
9
+ - onnx
10
+ - llm
11
  ---
12
+
13
+ # Mistral-7b for ONNX Runtime
14
+
15
+ ## Introduction
16
+
17
+ This repository hosts the optimized versions of **Mistral-7B-v0.1** to accelerate inference with ONNX Runtime CUDA execution provider.
18
+
19
+ See the [usage instructions](#usage-example) for how to run the SDXL pipeline with the ONNX files hosted in this repository.
20
+
21
+ ## Model Description
22
+
23
+ - **Developed by:** MistralAI
24
+ - **Model type:** Pretrained generative text model
25
+ - **License:** Apache 2.0 License
26
+ - **Model Description:** This is a conversion of the [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) for [ONNX Runtime](https://github.com/microsoft/onnxruntime) inference with CUDA execution provider.
27
+
28
+
29
+ ## Performance Comparison
30
+
31
+ #### Latency for 30 steps base and 9 steps refiner
32
+
33
+ Below is average latency of generating a token using a prompt of varying size using NVIDIA A100-SXM4-80GB GPU:
34
+
35
+ | Prompt Length | Batch Size | PyTorch 2.1 torch.compile | ONNX Runtime CUDA |
36
+ |-------------|------------|----------------|-------------------|
37
+ | 16 | 1 | N/A | N/A |
38
+ | 256 | 1 | N/A | N/A |
39
+ | 1024 | 1 | N/A | N/A |
40
+ | 2048 | 1 | N/A | N/A |
41
+ | 16 | 4 | N/A | N/A |
42
+ | 256 | 4 | N/A | N/A |
43
+ | 1024 | 4 | N/A | N/A |
44
+ | 2048 | 4 | N/A | N/A |
45
+
46
+ ## Usage Example
47
+
48
+ Following the [benchmarking instructions](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/models/llama/README.md#mistral). Example steps:
49
+
50
+ 1. Clone onnxruntime repository.
51
+ ```shell
52
+ git clone https://github.com/microsoft/onnxruntime
53
+ cd onnxruntime
54
+ ```
55
+
56
+ 2. Install required dependencies
57
+ ```shell
58
+ python3 -m pip install -r onnxruntime/python/tools/transformers/models/llama/requirements-cuda.txt
59
+ ```
60
+
61
+ 5. Inference using manual model API, or use Hugging Face's ORTModelForCausalLM
62
+ ```python
63
+ from optimum.onnxruntime import ORTModelForCausalLM
64
+ from onnxruntime import InferenceSession
65
+ from transformers import AutoConfig, AutoTokenizer
66
+
67
+ sess = InferenceSession("Mistral-7B-v0.1.onnx", providers = ["CUDAExecutionProvider"])
68
+ config = AutoConfig.from_pretrained("mistralai/Mistral-7B-v0.1")
69
+
70
+ new_model = ORTModelForCausalLM(sess, config, use_cache = True, use_io_binding = True)
71
+
72
+ tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
73
+
74
+ inputs = tokenizer("Instruct: What is a fermi paradox?\nOutput:", return_tensors="pt")
75
+
76
+ outputs = new_model.generate(**inputs)
77
+
78
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
79
+ ```