microsoft
/

Mistral-7B-v0.1-onnx

Model card Files Files and versions

petermcaughan commited on Dec 11, 2023

Commit

08cd86f

·

1 Parent(s): fcff7bc

Update README with usage

Files changed (1) hide show

README.md +76 -0

README.md CHANGED Viewed

@@ -1,3 +1,79 @@
 ---
 license: apache-2.0
 ---

 ---
 license: apache-2.0
+base_model: mistralai/Mistral-7B-v0.1
+language:
+  - en
+tags:
+  - mistral
+  - onnxruntime
+  - onnx
+  - llm
 ---
+# Mistral-7b for ONNX Runtime
+## Introduction
+This repository hosts the optimized versions of **Mistral-7B-v0.1** to accelerate inference with ONNX Runtime CUDA execution provider.
+See the [usage instructions](#usage-example) for how to run the SDXL pipeline with the ONNX files hosted in this repository.
+## Model Description
+- **Developed by:** MistralAI
+- **Model type:** Pretrained generative text model
+- **License:** Apache 2.0 License
+- **Model Description:** This is a conversion of the [Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) for [ONNX Runtime](https://github.com/microsoft/onnxruntime) inference with CUDA execution provider.
+## Performance Comparison
+#### Latency for 30 steps base and 9 steps refiner
+Below is average latency of generating a token using a prompt of varying size using NVIDIA A100-SXM4-80GB GPU:
+| Prompt Length      | Batch Size | PyTorch 2.1 torch.compile    | ONNX Runtime CUDA |
+|-------------|------------|----------------|-------------------|
+| 16      | 1          | N/A            | N/A           |
+| 256      | 1          | N/A            | N/A       |
+| 1024     | 1          | N/A        | N/A           |
+| 2048     | 1          | N/A       | N/A         |
+| 16      | 4          | N/A            | N/A           |
+| 256      | 4          | N/A            | N/A          |
+| 1024     | 4          | N/A        | N/A           |
+| 2048     | 4          | N/A       | N/A          |
+## Usage Example
+Following the [benchmarking instructions](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/models/llama/README.md#mistral). Example steps:
+1. Clone onnxruntime repository.
+```shell
+git clone https://github.com/microsoft/onnxruntime
+cd onnxruntime
+```
+2. Install required dependencies
+```shell
+python3 -m pip install -r onnxruntime/python/tools/transformers/models/llama/requirements-cuda.txt
+```
+5. Inference using manual model API, or use Hugging Face's ORTModelForCausalLM
+```python
+from optimum.onnxruntime import ORTModelForCausalLM
+from onnxruntime import InferenceSession
+from transformers import AutoConfig, AutoTokenizer
+sess = InferenceSession("Mistral-7B-v0.1.onnx", providers = ["CUDAExecutionProvider"])
+config = AutoConfig.from_pretrained("mistralai/Mistral-7B-v0.1")
+new_model = ORTModelForCausalLM(sess, config, use_cache = True, use_io_binding = True)
+tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
+inputs = tokenizer("Instruct: What is a fermi paradox?\nOutput:", return_tensors="pt")
+outputs = new_model.generate(**inputs)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```