guangy10 commited on
Commit
125d167
·
verified ·
1 Parent(s): e9c5df5

Add quant recipe

Browse files
Files changed (1) hide show
  1. README.md +135 -0
README.md CHANGED
@@ -40,6 +40,141 @@ optimum-cli export executorch \
40
  --output_dir ./smollm3_3b
41
  ```
42
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
43
 
44
  # Disclaimer
45
  PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations.
 
40
  --output_dir ./smollm3_3b
41
  ```
42
 
43
+ # Quantization Recipe
44
+
45
+ First need to install the required packages:
46
+ ```Shell
47
+ pip install git+https://github.com/huggingface/transformers@main
48
+ pip install torchao
49
+ ```
50
+
51
+ ## Untie Embedding Weights
52
+ We want to quantize the embedding and lm_head differently. Since those layers are tied, we first need to untie the model:
53
+
54
+ ```Py
55
+ from transformers import (
56
+ AutoModelForCausalLM,
57
+ AutoProcessor,
58
+ AutoTokenizer,
59
+ )
60
+ import torch
61
+
62
+ model_id = "HuggingFaceTB/SmolLM3-3B"
63
+ untied_model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
64
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
65
+
66
+ print(untied_model)
67
+ from transformers.modeling_utils import find_tied_parameters
68
+ print("tied weights:", find_tied_parameters(untied_model))
69
+ if getattr(untied_model.config.get_text_config(decoder=True), "tie_word_embeddings"):
70
+ setattr(untied_model.config.get_text_config(decoder=True), "tie_word_embeddings", False)
71
+
72
+ untied_model._tied_weights_keys = []
73
+ untied_model.lm_head.weight = torch.nn.Parameter(untied_model.lm_head.weight.clone())
74
+
75
+ print("tied weights:", find_tied_parameters(untied_model))
76
+
77
+ USER_ID = "YOUR_USER_ID"
78
+ MODEL_NAME = model_id.split("/")[-1]
79
+ save_to = f"{USER_ID}/{MODEL_NAME}-untied-weights"
80
+
81
+ untied_model.push_to_hub(save_to)
82
+ tokenizer.push_to_hub(save_to)
83
+
84
+ # or save locally
85
+ save_to_local_path = f"{MODEL_NAME}-untied-weights"
86
+ untied_model.save_pretrained(save_to_local_path)
87
+ tokenizer.save_pretrained(save_to)
88
+ ```
89
+
90
+ Note: to `push_to_hub` you need to run
91
+ ```Shell
92
+ pip install -U "huggingface_hub[cli]"
93
+ huggingface-cli login
94
+ ```
95
+ and use a token with write access, from https://huggingface.co/settings/tokens
96
+
97
+ ## Quantization
98
+
99
+ We used following code to get the quantized model:
100
+
101
+ ```Py
102
+ from transformers import (
103
+ AutoModelForCausalLM,
104
+ AutoProcessor,
105
+ AutoTokenizer,
106
+ TorchAoConfig,
107
+ )
108
+ from torchao.quantization.quant_api import (
109
+ IntxWeightOnlyConfig,
110
+ Int8DynamicActivationIntxWeightConfig,
111
+ ModuleFqnToConfig,
112
+ quantize_,
113
+ )
114
+ from torchao.quantization.granularity import PerGroup, PerAxis
115
+ import torch
116
+
117
+ # we start from the model with untied weights
118
+ model_id = "HuggingFaceTB/SmolLM3-3B"
119
+ USER_ID = "YOUR_USER_ID"
120
+ MODEL_NAME = model_id.split("/")[-1]
121
+ untied_model_id = f"{USER_ID}/{MODEL_NAME}-untied-weights"
122
+ untied_model_local_path = f"{MODEL_NAME}-untied-weights"
123
+
124
+ embedding_config = IntxWeightOnlyConfig(
125
+ weight_dtype=torch.int8,
126
+ granularity=PerAxis(0),
127
+ )
128
+ linear_config = Int8DynamicActivationIntxWeightConfig(
129
+ weight_dtype=torch.int4,
130
+ weight_granularity=PerGroup(32),
131
+ weight_scale_dtype=torch.bfloat16,
132
+ )
133
+ quant_config = ModuleFqnToConfig({"_default": linear_config, "model.embed_tokens": embedding_config})
134
+ quantization_config = TorchAoConfig(quant_type=quant_config, include_embedding=True, untie_embedding_weights=True, modules_to_not_convert=[])
135
+
136
+ # either use `untied_model_id` or `untied_model_local_path`
137
+ quantized_model = AutoModelForCausalLM.from_pretrained(untied_model_id, torch_dtype=torch.float32, device_map="auto", quantization_config=quantization_config)
138
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
139
+
140
+ # Push to hub
141
+ MODEL_NAME = model_id.split("/")[-1]
142
+ save_to = f"{USER_ID}/{MODEL_NAME}-8da4w"
143
+ quantized_model.push_to_hub(save_to, safe_serialization=False)
144
+ tokenizer.push_to_hub(save_to)
145
+
146
+ # Manual testing
147
+ prompt = "Hey, are you conscious? Can you talk to me?"
148
+ messages = [
149
+ {
150
+ "role": "system",
151
+ "content": "",
152
+ },
153
+ {"role": "user", "content": prompt},
154
+ ]
155
+ templated_prompt = tokenizer.apply_chat_template(
156
+ messages,
157
+ tokenize=False,
158
+ add_generation_prompt=True,
159
+ )
160
+ print("Prompt:", prompt)
161
+ print("Templated prompt:", templated_prompt)
162
+ inputs = tokenizer(
163
+ templated_prompt,
164
+ return_tensors="pt",
165
+ ).to("cuda")
166
+ generated_ids = quantized_model.generate(**inputs, max_new_tokens=128)
167
+ output_text = tokenizer.batch_decode(
168
+ generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
169
+ )
170
+ print("Response:", output_text[0][len(prompt):])
171
+ ```
172
+
173
+ The response from the manual testing is:
174
+
175
+ ```
176
+ Okay, the user is asking if I can talk to them. First, I need to clarify that I can't communicate like a human because I don't have consciousness or emotions. I'm an AI model created by Hugging Face.
177
+ ```
178
 
179
  # Disclaimer
180
  PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations.