ehartford commited on
Commit
cd11b2f
·
verified ·
1 Parent(s): bf15251

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +106 -27
README.md CHANGED
@@ -1,67 +1,146 @@
1
  ---
2
- base_model: []
3
- library_name: transformers
 
 
4
  tags:
5
- - mergekit
6
  - merge
7
-
 
8
  ---
9
- # Qwen3-72B-Instruct
10
 
11
- Still testing it! not sure if it works. uploading a gguf to https://huggingface.co/cognitivecomputations/Qwen3-72B-Instruct-gguf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
- This is a merge of pre-trained language models created using [mergekit](https://github.com/cg123/mergekit).
14
 
15
- ## Merge Details
16
- ### Merge Method
17
 
18
- This model was merged using the [Linear](https://arxiv.org/abs/2203.05482) merge method using ./Qwen3-32B-Upscaled as a base.
19
 
20
- ### Models Merged
21
 
22
- The following models were included in the merge:
23
- * ./Qwen2.5-72B-Instruct-Aligned
 
 
 
24
 
25
- ### Configuration
26
 
27
- The following YAML configuration was used to produce this model:
 
 
28
 
29
  ```yaml
30
  merge_method: linear
31
-
32
  base_model: ./Qwen3-32B-Upscaled
33
  dtype: bfloat16
 
34
  slices:
 
35
  - merge_method: linear
36
  sources:
37
  - model: ./Qwen3-32B-Upscaled
38
  layer_range: [0, 32]
39
  parameters:
40
- weight: 0.5
41
  - model: ./Qwen2.5-72B-Instruct-Aligned
42
  layer_range: [0, 32]
43
  parameters:
44
- weight: 0.5
45
- - merge_method: linear
 
 
46
  sources:
47
- - model: ./Qwen3-32B-Upscaled
48
- layer_range: [32, 48]
49
- parameters:
50
- weight: 0.0
51
  - model: ./Qwen2.5-72B-Instruct-Aligned
52
  layer_range: [32, 48]
53
- parameters:
54
- weight: 1.0
55
  - merge_method: linear
56
  sources:
57
  - model: ./Qwen3-32B-Upscaled
58
  layer_range: [32, 64]
59
  parameters:
60
- weight: 0.5
61
  - model: ./Qwen2.5-72B-Instruct-Aligned
62
  layer_range: [48, 80]
63
  parameters:
64
- weight: 0.5
 
65
  tokenizer_source: ./Qwen3-32B-Upscaled
 
 
 
 
 
 
 
 
 
 
 
 
66
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - Qwen/Qwen3-32B
5
+ - Qwen/Qwen2.5-72B-Instruct
6
  tags:
 
7
  - merge
8
+ - frankenmerge
9
+ - qwen
10
  ---
 
11
 
12
+ # Qwen3-72B-Synthesis
13
+
14
+ A Qwen3-Architecture 72B Model Forged from `Qwen3-32B` and `Qwen2.5-72B-Instruct`.
15
+
16
+ ## Model Description
17
+
18
+ **Qwen3-72B-Synthesis** is an experimental, 80-layer, 72-billion-parameter large language model. It represents a novel approach to model creation, designed to produce a model with the pure, modern **Qwen3 architecture** while inheriting the vast, high-quality knowledge of the 72B-scale **Qwen2.5-Instruct** model.
19
+
20
+ This was not a simple merge. It was a multi-phase surgical procedure involving dimensional up-scaling, architectural alignment, and a strategic "knowledge transplant" using `MergeKit`. The result is a unique checkpoint that serves as an ideal starting point for further fine-tuning.
21
+
22
+ The core philosophy was to use `Qwen/Qwen3-32B` as the architectural "foundation" and `Qwen/Qwen2.5-72B-Instruct` as the "knowledge donor."
23
+
24
+ ## Model Details
25
+
26
+ * **Architecture:** Qwen3 (RMSNorm, SwiGLU, no biases, includes `q_norm` and `k_norm`)
27
+ * **Parameters:** ~72 Billion
28
+ * **Layers:** 80
29
+ * **Foundation:** `Qwen/Qwen3-32B`
30
+ * **Donor:** `Qwen/Qwen2.5-72B-Instruct`
31
+ * **Tokenizer:** `Qwen/Qwen3-32B` Tokenizer (`vocab_size: 151936`)
32
+
33
+ ## Model Creation Process
34
 
35
+ The creation of this model was a deliberate, three-phase process designed to overcome significant architectural incompatibilities.
36
 
37
+ ### Phase 1: Foundation Upscaling
 
38
 
39
+ First, the `Qwen/Qwen3-32B` model (64 layers, 5120 hidden dim) was up-scaled to match the target 72B dimensions. This was done using a sophisticated **self-interpolation** script, where new dimensions were created by averaging different slices of the existing weights, rather than simple tiling. This produced `Qwen3-32B-Upscaled`, a 64-layer model with the correct 72B tensor shapes and Qwen3 architecture.
40
 
41
+ ### Phase 2: Donor Alignment
42
 
43
+ The `Qwen/Qwen2.5-72B-Instruct` model was architecturally incompatible with the Qwen3 target. To solve this, a new donor model, `Qwen2.5-72B-Instruct-Aligned`, was created. This process involved:
44
+ 1. Creating an empty 80-layer model shell with the pure Qwen3 architecture.
45
+ 2. Surgically removing all `.bias` tensors from the Qwen2.5 weights.
46
+ 3. Truncating the Qwen2.5 embedding and language model head layers from a vocabulary of 152064 to match Qwen3's 151936.
47
+ 4. Loading the modified Qwen2.5 weights into the pure Qwen3 shell, resulting in a perfectly compatible donor model.
48
 
49
+ ### Phase 3: Knowledge Transplant via MergeKit
50
 
51
+ With two architecturally-compatible models, the final merge was performed using `MergeKit`. A "Knowledge Bridge" strategy was employed to transplant a stable reasoning core from the donor while blending the rest.
52
+
53
+ The following `MergeKit` configuration was used:
54
 
55
  ```yaml
56
  merge_method: linear
 
57
  base_model: ./Qwen3-32B-Upscaled
58
  dtype: bfloat16
59
+
60
  slices:
61
+ # Slice 1: Blend the bottom 32 layers
62
  - merge_method: linear
63
  sources:
64
  - model: ./Qwen3-32B-Upscaled
65
  layer_range: [0, 32]
66
  parameters:
67
+ weight: 0.3
68
  - model: ./Qwen2.5-72B-Instruct-Aligned
69
  layer_range: [0, 32]
70
  parameters:
71
+ weight: 0.7
72
+
73
+ # Slice 2: The "Knowledge Bridge" - transplant a pure block from the donor
74
+ - merge_method: passthrough
75
  sources:
 
 
 
 
76
  - model: ./Qwen2.5-72B-Instruct-Aligned
77
  layer_range: [32, 48]
78
+
79
+ # Slice 3: Blend the top layers
80
  - merge_method: linear
81
  sources:
82
  - model: ./Qwen3-32B-Upscaled
83
  layer_range: [32, 64]
84
  parameters:
85
+ weight: 0.3
86
  - model: ./Qwen2.5-72B-Instruct-Aligned
87
  layer_range: [48, 80]
88
  parameters:
89
+ weight: 0.7
90
+
91
  tokenizer_source: ./Qwen3-32B-Upscaled
92
+ ```
93
+
94
+ ## How to Use
95
+
96
+ This model uses the standard Qwen ChatML prompt format.
97
+
98
+ ```python
99
+ import torch
100
+ from transformers import AutoModelForCausalLM, AutoTokenizer
101
+
102
+ model_id = "your-username/Qwen3-72B-Synthesis"
103
+ device = "cuda"
104
 
105
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
106
+ model = AutoModelForCausalLM.from_pretrained(
107
+ model_id,
108
+ torch_dtype=torch.bfloat16,
109
+ device_map="auto",
110
+ trust_remote_code=True
111
+ )
112
+
113
+ messages = [
114
+ {"role": "system", "content": "You are a helpful assistant."},
115
+ {"role": "user", "content": "Explain the importance of the LLaMA paper in one paragraph."}
116
+ ]
117
+ text = tokenizer.apply_chat_template(
118
+ messages,
119
+ tokenize=False,
120
+ add_generation_prompt=True
121
+ )
122
+ model_inputs = tokenizer([text], return_tensors="pt").to(device)
123
+
124
+ generated_ids = model.generate(
125
+ model_inputs.input_ids,
126
+ max_new_tokens=512
127
+ )
128
+ generated_ids = [
129
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
130
+ ]
131
+
132
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
133
+ print(response)
134
  ```
135
+
136
+ ## Intended Use and Limitations
137
+
138
+ **This is an experimental model and should be considered a high-quality checkpoint, not a finished product.**
139
+
140
+ * **Fine-tuning is highly recommended.** While it inherits knowledge from a powerful instruction model, the merging process can create slight incoherence between layers. A round of fine-tuning on a high-quality instruction dataset is necessary to harmonize the weights and unlock its full potential.
141
+ * The model may exhibit unexpected behaviors, including repetitiveness or nonsensical outputs, prior to fine-tuning.
142
+ * This model has not been aligned for safety and may produce problematic, biased, or otherwise undesirable content. The user assumes all responsibility for the output generated.
143
+
144
+ ## Acknowledgements
145
+
146
+ This model would not have been possible without the foundational work of Alibaba Cloud on the Qwen models, and the powerful, flexible `MergeKit` toolkit created by Charles Goddard.