This is 2-bit quantization of mistralai/Mixtral-8x7B-Instruct-v0.1 using QuIP#

Model loading

Please follow the instruction of QuIP-for-all for usage.

As an alternative, you can use vLLM branch for faster inference. QuIP has to launch like 5 kernels for each linear layer, so it's very helpful for vLLM to use cuda-graph to reduce launching overhead. BTW, If you have problem installing fast-hadamard-transform from pip, you can also install it from source

Perplexity

Measured at Wikitext with 4096 context length

fp16 2-bit
3.8825 5.2799

Speed

Measured with examples/benchmark_latency.py script at vLLM repo. At batch size = 1, it generates at 16.3 tokens/s with single 3090.

Downloads last month
11
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.