TheBloke
/

Mistral-7B-v0.1-AWQ

@@ -45,11 +45,12 @@ This repo contains AWQ model files for [Mistral AI's Mistral 7B v0.1](https://hu
 AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference.
-It is also now supported by continuous batching server [vLLM](https://github.com/vllm-project/vllm), allowing use of Llama AWQ models for high-throughput concurrent inference in multi-user server scenarios.
-As of September 25th 2023, preliminary Llama-only AWQ support has also been added to [Huggingface Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference).
-Note that, at the time of writing, overall throughput is still lower than running vLLM or TGI with unquantised models, however using AWQ enables using much smaller GPUs which can lead to easier deployment and overall cost savings. For example, a 70B model can be run on 1 x 48GB GPU instead of 2 x 80GB.
 <!-- description end -->
 <!-- repositories-available start -->
 ## Repositories available
@@ -64,7 +65,6 @@ Note that, at the time of writing, overall throughput is still lower than runnin
 ```
 {prompt}
 ```
 <!-- prompt-template end -->
@@ -83,74 +83,23 @@ Models are released as sharded safetensors files.
 <!-- README_AWQ.md-provided-files end -->
-<!-- README_AWQ.md-use-from-vllm start -->
-## Serving this model from vLLM
-Documentation on installing and using vLLM [can be found here](https://vllm.readthedocs.io/en/latest/).
-- When using vLLM as a server, pass the `--quantization awq` parameter, for example:
-```shell
-python3 python -m vllm.entrypoints.api_server --model TheBloke/Mistral-7B-v0.1-AWQ --quantization awq --dtype half
-```
-Note: at the time of writing, vLLM has not yet done a new release with support for the `quantization` parameter.
-If you try the code below and get an error about `quantization` being unrecognised, please install vLLM from Github source.
-When using vLLM from Python code, pass the `quantization=awq` parameter, for example:
-```python
-from vllm import LLM, SamplingParams
-prompts = [
-    "Hello, my name is",
-    "The president of the United States is",
-    "The capital of France is",
-    "The future of AI is",
-]
-sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
-llm = LLM(model="TheBloke/Mistral-7B-v0.1-AWQ", quantization="awq", dtype="half")
-outputs = llm.generate(prompts, sampling_params)
-# Print the outputs.
-for output in outputs:
-    prompt = output.prompt
-    generated_text = output.outputs[0].text
-    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
-```
-<!-- README_AWQ.md-use-from-vllm start -->
 <!-- README_AWQ.md-use-from-python start -->
-## Serving this model from TGI
-TGI merged support for AWQ on September 25th, 2023. At the time of writing you need to use the `:latest` Docker container: `ghcr.io/huggingface/text-generation-inference:latest`
-Add the parameter `--quantize awq` for AWQ support.
-Example parameters:
-```shell
---model-id TheBloke/Mistral-7B-v0.1-AWQ --port 3000 --quantize awq --max-input-length 3696 --max-total-tokens 4096 --max-batch-prefill-tokens 4096
-```
 ## How to use this AWQ model from Python code
 ### Install the necessary packages
-Requires: [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) 0.0.2 or later
-```shell
-pip3 install autoawq
-```
-If you have problems installing [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) using the pre-built wheels, install it from source instead:
 ```shell
-pip3 uninstall -y autoawq
 git clone https://github.com/casper-hansen/AutoAWQ
 cd AutoAWQ
 pip3 install .
 ```
@@ -220,10 +169,6 @@ print(pipe(prompt_template)[0]['generated_text'])
 The files provided are tested to work with:
 - [AutoAWQ](https://github.com/casper-hansen/AutoAWQ)
-- [vLLM](https://github.com/vllm-project/vllm)
-- [Huggingface Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference)
-TGI merged AWQ support on September 25th, 2023: [TGI PR #1054](https://github.com/huggingface/text-generation-inference/pull/1054).  Use the `:latest` Docker container until the next TGI release is made.
 <!-- README_AWQ.md-compatibility end -->

 AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference.
+### Mistral AWQs
+These are experimental first AWQs for the brand-new model format, Mistral.
+They will not work from vLLM or TGI. They can only be used from AutoAWQ, and they require installing both AutoAWQ and Transformers from Github. More details are below.
 <!-- description end -->
 <!-- repositories-available start -->
 ## Repositories available
 ```
 {prompt}
 ```
 <!-- prompt-template end -->
 <!-- README_AWQ.md-provided-files end -->
 <!-- README_AWQ.md-use-from-python start -->
 ## How to use this AWQ model from Python code
 ### Install the necessary packages
+Requires:
+- Transformers from [commit 72958fcd3c98a7afdc61f953aa58c544ebda2f79](https://github.com/huggingface/transformers/commit/72958fcd3c98a7afdc61f953aa58c544ebda2f79)
+- [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) from [PR #79](https://github.com/casper-hansen/AutoAWQ/pull/79).
 ```shell
+pip3 install git+https://github.com/huggingface/transformers.git@72958fcd3c98a7afdc61f953aa58c544ebda2f79
 git clone https://github.com/casper-hansen/AutoAWQ
 cd AutoAWQ
+git checkout mistral
 pip3 install .
 ```
 The files provided are tested to work with:
 - [AutoAWQ](https://github.com/casper-hansen/AutoAWQ)
 <!-- README_AWQ.md-compatibility end -->