Update README.md
Browse files
README.md
CHANGED
@@ -1,67 +1,146 @@
|
|
1 |
---
|
2 |
-
|
3 |
-
|
|
|
|
|
4 |
tags:
|
5 |
-
- mergekit
|
6 |
- merge
|
7 |
-
|
|
|
8 |
---
|
9 |
-
# Qwen3-72B-Instruct
|
10 |
|
11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
|
13 |
-
|
14 |
|
15 |
-
|
16 |
-
### Merge Method
|
17 |
|
18 |
-
|
19 |
|
20 |
-
###
|
21 |
|
22 |
-
The
|
23 |
-
|
|
|
|
|
|
|
24 |
|
25 |
-
###
|
26 |
|
27 |
-
|
|
|
|
|
28 |
|
29 |
```yaml
|
30 |
merge_method: linear
|
31 |
-
|
32 |
base_model: ./Qwen3-32B-Upscaled
|
33 |
dtype: bfloat16
|
|
|
34 |
slices:
|
|
|
35 |
- merge_method: linear
|
36 |
sources:
|
37 |
- model: ./Qwen3-32B-Upscaled
|
38 |
layer_range: [0, 32]
|
39 |
parameters:
|
40 |
-
weight: 0.
|
41 |
- model: ./Qwen2.5-72B-Instruct-Aligned
|
42 |
layer_range: [0, 32]
|
43 |
parameters:
|
44 |
-
weight: 0.
|
45 |
-
|
|
|
|
|
46 |
sources:
|
47 |
-
- model: ./Qwen3-32B-Upscaled
|
48 |
-
layer_range: [32, 48]
|
49 |
-
parameters:
|
50 |
-
weight: 0.0
|
51 |
- model: ./Qwen2.5-72B-Instruct-Aligned
|
52 |
layer_range: [32, 48]
|
53 |
-
|
54 |
-
|
55 |
- merge_method: linear
|
56 |
sources:
|
57 |
- model: ./Qwen3-32B-Upscaled
|
58 |
layer_range: [32, 64]
|
59 |
parameters:
|
60 |
-
weight: 0.
|
61 |
- model: ./Qwen2.5-72B-Instruct-Aligned
|
62 |
layer_range: [48, 80]
|
63 |
parameters:
|
64 |
-
weight: 0.
|
|
|
65 |
tokenizer_source: ./Qwen3-32B-Upscaled
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
66 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
67 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
license: apache-2.0
|
3 |
+
base_model:
|
4 |
+
- Qwen/Qwen3-32B
|
5 |
+
- Qwen/Qwen2.5-72B-Instruct
|
6 |
tags:
|
|
|
7 |
- merge
|
8 |
+
- frankenmerge
|
9 |
+
- qwen
|
10 |
---
|
|
|
11 |
|
12 |
+
# Qwen3-72B-Synthesis
|
13 |
+
|
14 |
+
A Qwen3-Architecture 72B Model Forged from `Qwen3-32B` and `Qwen2.5-72B-Instruct`.
|
15 |
+
|
16 |
+
## Model Description
|
17 |
+
|
18 |
+
**Qwen3-72B-Synthesis** is an experimental, 80-layer, 72-billion-parameter large language model. It represents a novel approach to model creation, designed to produce a model with the pure, modern **Qwen3 architecture** while inheriting the vast, high-quality knowledge of the 72B-scale **Qwen2.5-Instruct** model.
|
19 |
+
|
20 |
+
This was not a simple merge. It was a multi-phase surgical procedure involving dimensional up-scaling, architectural alignment, and a strategic "knowledge transplant" using `MergeKit`. The result is a unique checkpoint that serves as an ideal starting point for further fine-tuning.
|
21 |
+
|
22 |
+
The core philosophy was to use `Qwen/Qwen3-32B` as the architectural "foundation" and `Qwen/Qwen2.5-72B-Instruct` as the "knowledge donor."
|
23 |
+
|
24 |
+
## Model Details
|
25 |
+
|
26 |
+
* **Architecture:** Qwen3 (RMSNorm, SwiGLU, no biases, includes `q_norm` and `k_norm`)
|
27 |
+
* **Parameters:** ~72 Billion
|
28 |
+
* **Layers:** 80
|
29 |
+
* **Foundation:** `Qwen/Qwen3-32B`
|
30 |
+
* **Donor:** `Qwen/Qwen2.5-72B-Instruct`
|
31 |
+
* **Tokenizer:** `Qwen/Qwen3-32B` Tokenizer (`vocab_size: 151936`)
|
32 |
+
|
33 |
+
## Model Creation Process
|
34 |
|
35 |
+
The creation of this model was a deliberate, three-phase process designed to overcome significant architectural incompatibilities.
|
36 |
|
37 |
+
### Phase 1: Foundation Upscaling
|
|
|
38 |
|
39 |
+
First, the `Qwen/Qwen3-32B` model (64 layers, 5120 hidden dim) was up-scaled to match the target 72B dimensions. This was done using a sophisticated **self-interpolation** script, where new dimensions were created by averaging different slices of the existing weights, rather than simple tiling. This produced `Qwen3-32B-Upscaled`, a 64-layer model with the correct 72B tensor shapes and Qwen3 architecture.
|
40 |
|
41 |
+
### Phase 2: Donor Alignment
|
42 |
|
43 |
+
The `Qwen/Qwen2.5-72B-Instruct` model was architecturally incompatible with the Qwen3 target. To solve this, a new donor model, `Qwen2.5-72B-Instruct-Aligned`, was created. This process involved:
|
44 |
+
1. Creating an empty 80-layer model shell with the pure Qwen3 architecture.
|
45 |
+
2. Surgically removing all `.bias` tensors from the Qwen2.5 weights.
|
46 |
+
3. Truncating the Qwen2.5 embedding and language model head layers from a vocabulary of 152064 to match Qwen3's 151936.
|
47 |
+
4. Loading the modified Qwen2.5 weights into the pure Qwen3 shell, resulting in a perfectly compatible donor model.
|
48 |
|
49 |
+
### Phase 3: Knowledge Transplant via MergeKit
|
50 |
|
51 |
+
With two architecturally-compatible models, the final merge was performed using `MergeKit`. A "Knowledge Bridge" strategy was employed to transplant a stable reasoning core from the donor while blending the rest.
|
52 |
+
|
53 |
+
The following `MergeKit` configuration was used:
|
54 |
|
55 |
```yaml
|
56 |
merge_method: linear
|
|
|
57 |
base_model: ./Qwen3-32B-Upscaled
|
58 |
dtype: bfloat16
|
59 |
+
|
60 |
slices:
|
61 |
+
# Slice 1: Blend the bottom 32 layers
|
62 |
- merge_method: linear
|
63 |
sources:
|
64 |
- model: ./Qwen3-32B-Upscaled
|
65 |
layer_range: [0, 32]
|
66 |
parameters:
|
67 |
+
weight: 0.3
|
68 |
- model: ./Qwen2.5-72B-Instruct-Aligned
|
69 |
layer_range: [0, 32]
|
70 |
parameters:
|
71 |
+
weight: 0.7
|
72 |
+
|
73 |
+
# Slice 2: The "Knowledge Bridge" - transplant a pure block from the donor
|
74 |
+
- merge_method: passthrough
|
75 |
sources:
|
|
|
|
|
|
|
|
|
76 |
- model: ./Qwen2.5-72B-Instruct-Aligned
|
77 |
layer_range: [32, 48]
|
78 |
+
|
79 |
+
# Slice 3: Blend the top layers
|
80 |
- merge_method: linear
|
81 |
sources:
|
82 |
- model: ./Qwen3-32B-Upscaled
|
83 |
layer_range: [32, 64]
|
84 |
parameters:
|
85 |
+
weight: 0.3
|
86 |
- model: ./Qwen2.5-72B-Instruct-Aligned
|
87 |
layer_range: [48, 80]
|
88 |
parameters:
|
89 |
+
weight: 0.7
|
90 |
+
|
91 |
tokenizer_source: ./Qwen3-32B-Upscaled
|
92 |
+
```
|
93 |
+
|
94 |
+
## How to Use
|
95 |
+
|
96 |
+
This model uses the standard Qwen ChatML prompt format.
|
97 |
+
|
98 |
+
```python
|
99 |
+
import torch
|
100 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
101 |
+
|
102 |
+
model_id = "your-username/Qwen3-72B-Synthesis"
|
103 |
+
device = "cuda"
|
104 |
|
105 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
|
106 |
+
model = AutoModelForCausalLM.from_pretrained(
|
107 |
+
model_id,
|
108 |
+
torch_dtype=torch.bfloat16,
|
109 |
+
device_map="auto",
|
110 |
+
trust_remote_code=True
|
111 |
+
)
|
112 |
+
|
113 |
+
messages = [
|
114 |
+
{"role": "system", "content": "You are a helpful assistant."},
|
115 |
+
{"role": "user", "content": "Explain the importance of the LLaMA paper in one paragraph."}
|
116 |
+
]
|
117 |
+
text = tokenizer.apply_chat_template(
|
118 |
+
messages,
|
119 |
+
tokenize=False,
|
120 |
+
add_generation_prompt=True
|
121 |
+
)
|
122 |
+
model_inputs = tokenizer([text], return_tensors="pt").to(device)
|
123 |
+
|
124 |
+
generated_ids = model.generate(
|
125 |
+
model_inputs.input_ids,
|
126 |
+
max_new_tokens=512
|
127 |
+
)
|
128 |
+
generated_ids = [
|
129 |
+
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
|
130 |
+
]
|
131 |
+
|
132 |
+
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
|
133 |
+
print(response)
|
134 |
```
|
135 |
+
|
136 |
+
## Intended Use and Limitations
|
137 |
+
|
138 |
+
**This is an experimental model and should be considered a high-quality checkpoint, not a finished product.**
|
139 |
+
|
140 |
+
* **Fine-tuning is highly recommended.** While it inherits knowledge from a powerful instruction model, the merging process can create slight incoherence between layers. A round of fine-tuning on a high-quality instruction dataset is necessary to harmonize the weights and unlock its full potential.
|
141 |
+
* The model may exhibit unexpected behaviors, including repetitiveness or nonsensical outputs, prior to fine-tuning.
|
142 |
+
* This model has not been aligned for safety and may produce problematic, biased, or otherwise undesirable content. The user assumes all responsibility for the output generated.
|
143 |
+
|
144 |
+
## Acknowledgements
|
145 |
+
|
146 |
+
This model would not have been possible without the foundational work of Alibaba Cloud on the Qwen models, and the powerful, flexible `MergeKit` toolkit created by Charles Goddard.
|