MBMMurad commited on
Commit
465927c
·
verified ·
1 Parent(s): f73f2cd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -13
README.md CHANGED
@@ -33,15 +33,6 @@ This repository provides the **AWQ 4-bit quantized** version of the **QwQ-32B-Pr
33
 
34
  ---
35
 
36
- ## Requirements
37
-
38
- Ensure you are using the latest version of Hugging Face Transformers, as the code for Qwen2.5 is integrated there. Using a version earlier than **4.37.0** may result in the following error:
39
-
40
- ```plaintext
41
- KeyError: 'qwen2'
42
- ```
43
- ---
44
-
45
  ## Steps to deploying the solution to Inference Endpoints (dedicated)
46
  Use this approach if you want to try out the approach from my <a href="https://www.kaggle.com/code/mbmmurad/lb-20-qwq-32b-preview-optimized-inference">Kaggle notebook</a>, but you don't feel comfortable with coding.
47
 
@@ -67,6 +58,15 @@ Other values should be left at default values.
67
 
68
  ---
69
 
 
 
 
 
 
 
 
 
 
70
  ## Quickstart
71
 
72
  Here's how to load the tokenizer and model, and generate content using the quantized model:
@@ -112,7 +112,7 @@ print(response)
112
 
113
  ---
114
 
115
- Parameter values to get a better response and reproducible values similar to the Kaggle Notebook:
116
 
117
  `model.generate(
118
  **model_inputs,
@@ -124,9 +124,9 @@ Use the sampling method here setting the following parameters:
124
  * `temperature = 1`
125
  * `top_k = 50`
126
 
127
- Setting the `max_new_tokens` to `4096*8` would increase the performance, but it will take a lot of time for inference. Using faster inference engines (e.g. vllm, TGI) would make the inference faster.
128
 
129
- To get the most optimal performance, it is suggested to use the Kaggle Notebook mentioned above.
130
 
131
 
132
  ## Original Model
@@ -140,7 +140,7 @@ https://huggingface.co/Qwen/QwQ-32B-Preview
140
 
141
  ## Citation
142
 
143
- If you find the original model helpful, please consider citing the original authors as well as the Kaggle notebook on which this model is based on:
144
 
145
  ```bibtext
146
  @misc{qwq-32b-preview,
 
33
 
34
  ---
35
 
 
 
 
 
 
 
 
 
 
36
  ## Steps to deploying the solution to Inference Endpoints (dedicated)
37
  Use this approach if you want to try out the approach from my <a href="https://www.kaggle.com/code/mbmmurad/lb-20-qwq-32b-preview-optimized-inference">Kaggle notebook</a>, but you don't feel comfortable with coding.
38
 
 
58
 
59
  ---
60
 
61
+ ## Requirements
62
+
63
+ Ensure you are using the latest version of Hugging Face Transformers, as the code for Qwen2.5 is integrated there. Using a version earlier than **4.37.0** may result in the following error:
64
+
65
+ ```plaintext
66
+ KeyError: 'qwen2'
67
+ ```
68
+ ---
69
+
70
  ## Quickstart
71
 
72
  Here's how to load the tokenizer and model, and generate content using the quantized model:
 
112
 
113
  ---
114
 
115
+ Parameter values to get a better response and reproducible values similar to the Kaggle notebook:
116
 
117
  `model.generate(
118
  **model_inputs,
 
124
  * `temperature = 1`
125
  * `top_k = 50`
126
 
127
+ Setting the `max_new_tokens` to `4096*8` would increase the performance, but it will take a lot of time for inference. Using faster inference engines (e.g. vLLM, TGI) would make the inference faster.
128
 
129
+ To get the most optimal performance, it is suggested to use the <a href="https://www.kaggle.com/code/mbmmurad/lb-20-qwq-32b-preview-optimized-inference">Kaggle notebook</a> mentioned above.
130
 
131
 
132
  ## Original Model
 
140
 
141
  ## Citation
142
 
143
+ If you find the original model helpful, please consider citing the original authors as well as the <a href="https://www.kaggle.com/code/mbmmurad/lb-20-qwq-32b-preview-optimized-inference">Kaggle notebook</a> on which this model is based on:
144
 
145
  ```bibtext
146
  @misc{qwq-32b-preview,