TheBloke commited on
Commit
a68dbfd
·
1 Parent(s): b1c0cfa

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +39 -1
README.md CHANGED
@@ -51,7 +51,7 @@ These are experimental first AWQs for the brand-new model format, Mistral.
51
 
52
  As of September 29th 2023, they are supported by AutoAWQ, and vLLM (version 0.2).
53
 
54
- To use from AutoAW£Q requires installing both AutoAWQ and Transformers from Github. More details are below.
55
 
56
  <!-- description end -->
57
  <!-- repositories-available start -->
@@ -86,6 +86,44 @@ Models are released as sharded safetensors files.
86
 
87
  <!-- README_AWQ.md-provided-files end -->
88
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89
  <!-- README_AWQ.md-use-from-python start -->
90
  ## How to use this AWQ model from Python code
91
 
 
51
 
52
  As of September 29th 2023, they are supported by AutoAWQ, and vLLM (version 0.2).
53
 
54
+ To use from AutoAWQ requires installing both AutoAWQ and Transformers from Github. More details are below.
55
 
56
  <!-- description end -->
57
  <!-- repositories-available start -->
 
86
 
87
  <!-- README_AWQ.md-provided-files end -->
88
 
89
+ <!-- README_AWQ.md-use-from-vllm start -->
90
+ ## Serving this model from vLLM
91
+
92
+ Make sure you are using vLLM version 0.2.
93
+
94
+ Documentation on installing and using vLLM [can be found here](https://vllm.readthedocs.io/en/latest/).
95
+
96
+ - When using vLLM as a server, pass the `--quantization awq` parameter, for example:
97
+
98
+ ```shell
99
+ python3 python -m vllm.entrypoints.api_server --model TheBloke/Mistral-7B-Instruct-v0.1-AWQ --quantization awq --dtype float16
100
+ ```
101
+
102
+ When using vLLM from Python code, pass the `quantization=awq` parameter, for example:
103
+
104
+ ```python
105
+ from vllm import LLM, SamplingParams
106
+
107
+ prompts = [
108
+ "Hello, my name is",
109
+ "The president of the United States is",
110
+ "The capital of France is",
111
+ "The future of AI is",
112
+ ]
113
+ sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
114
+
115
+ llm = LLM(model="TheBloke/Mistral-7B-Instruct-v0.1-AWQ", quantization="awq", dtype="float16")
116
+
117
+ outputs = llm.generate(prompts, sampling_params)
118
+
119
+ # Print the outputs.
120
+ for output in outputs:
121
+ prompt = output.prompt
122
+ generated_text = output.outputs[0].text
123
+ print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
124
+ ```
125
+ <!-- README_AWQ.md-use-from-vllm start -->
126
+
127
  <!-- README_AWQ.md-use-from-python start -->
128
  ## How to use this AWQ model from Python code
129