|
--- |
|
license: apache-2.0 |
|
base_model: |
|
- Qwen/Qwen3-32B |
|
- Qwen/Qwen2.5-72B-Instruct |
|
tags: |
|
- merge |
|
- frankenmerge |
|
- qwen |
|
--- |
|
|
|
# Qwen3-72B-Synthesis |
|
|
|
This still doesn't work, I'm trying to fix it. |
|
|
|
A Qwen3-Architecture 72B Model Forged from `Qwen3-32B` and `Qwen2.5-72B-Instruct`. |
|
|
|
## Model Description |
|
|
|
**Qwen3-72B-Synthesis** is an experimental, 80-layer, 72-billion-parameter large language model. It represents a novel approach to model creation, designed to produce a model with the pure, modern **Qwen3 architecture** while inheriting the vast, high-quality knowledge of the 72B-scale **Qwen2.5-Instruct** model. |
|
|
|
This was not a simple merge. It was a multi-phase surgical procedure involving dimensional up-scaling, architectural alignment, and a strategic "knowledge transplant" using `MergeKit`. The result is a unique checkpoint that serves as an ideal starting point for further fine-tuning. |
|
|
|
The core philosophy was to use `Qwen/Qwen3-32B` as the architectural "foundation" and `Qwen/Qwen2.5-72B-Instruct` as the "knowledge donor." |
|
|
|
## Model Details |
|
|
|
* **Architecture:** Qwen3 (RMSNorm, SwiGLU, no biases, includes `q_norm` and `k_norm`) |
|
* **Parameters:** ~72 Billion |
|
* **Layers:** 80 |
|
* **Foundation:** `Qwen/Qwen3-32B` |
|
* **Donor:** `Qwen/Qwen2.5-72B-Instruct` |
|
* **Tokenizer:** `Qwen/Qwen3-32B` Tokenizer (`vocab_size: 151936`) |
|
|
|
## Model Creation Process |
|
|
|
The creation of this model was a deliberate, three-phase process designed to overcome significant architectural incompatibilities. |
|
|
|
### Phase 1: Foundation Upscaling |
|
|
|
First, the `Qwen/Qwen3-32B` model (64 layers, 5120 hidden dim) was up-scaled to match the target 72B dimensions. This was done using a sophisticated **self-interpolation** script, where new dimensions were created by averaging different slices of the existing weights, rather than simple tiling. This produced `Qwen3-32B-Upscaled`, a 64-layer model with the correct 72B tensor shapes and Qwen3 architecture. |
|
|
|
### Phase 2: Donor Alignment |
|
|
|
The `Qwen/Qwen2.5-72B-Instruct` model was architecturally incompatible with the Qwen3 target. To solve this, a new donor model, `Qwen2.5-72B-Instruct-Aligned`, was created. This process involved: |
|
1. Creating an empty 80-layer model shell with the pure Qwen3 architecture. |
|
2. Surgically removing all `.bias` tensors from the Qwen2.5 weights. |
|
3. Truncating the Qwen2.5 embedding and language model head layers from a vocabulary of 152064 to match Qwen3's 151936. |
|
4. Loading the modified Qwen2.5 weights into the pure Qwen3 shell, resulting in a perfectly compatible donor model. |
|
|
|
### Phase 3: Knowledge Transplant via MergeKit |
|
|
|
With two architecturally-compatible models, the final merge was performed using `MergeKit`. A "Knowledge Bridge" strategy was employed to transplant a stable reasoning core from the donor while blending the rest. |
|
|
|
The following `MergeKit` configuration was used: |
|
|
|
```yaml |
|
merge_method: linear |
|
base_model: ./Qwen3-32B-Upscaled |
|
dtype: bfloat16 |
|
|
|
slices: |
|
# Slice 1: Blend the bottom 32 layers |
|
- merge_method: linear |
|
sources: |
|
- model: ./Qwen3-32B-Upscaled |
|
layer_range: [0, 32] |
|
parameters: |
|
weight: 0.5 |
|
- model: ./Qwen2.5-72B-Instruct-Aligned |
|
layer_range: [0, 32] |
|
parameters: |
|
weight: 0.5 |
|
|
|
# Slice 2: The "Knowledge Bridge" - transplant a pure block from the donor |
|
- merge_method: passthrough |
|
sources: |
|
- model: ./Qwen2.5-72B-Instruct-Aligned |
|
layer_range: [32, 48] |
|
|
|
# Slice 3: Blend the top layers |
|
- merge_method: linear |
|
sources: |
|
- model: ./Qwen3-32B-Upscaled |
|
layer_range: [32, 64] |
|
parameters: |
|
weight: 0.5 |
|
- model: ./Qwen2.5-72B-Instruct-Aligned |
|
layer_range: [48, 80] |
|
parameters: |
|
weight: 0.5 |
|
|
|
tokenizer_source: ./Qwen3-32B-Upscaled |
|
``` |
|
|
|
## How to Use |
|
|
|
This model uses the standard Qwen ChatML prompt format. |
|
|
|
```python |
|
import torch |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
model_id = "cognitivecomputations/Qwen3-72B-Synthesis" |
|
device = "cuda" |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) |
|
model = AutoModelForCausalLM.from_pretrained( |
|
model_id, |
|
torch_dtype=torch.bfloat16, |
|
device_map="auto", |
|
trust_remote_code=True |
|
) |
|
|
|
messages = [ |
|
{"role": "system", "content": "You are a helpful assistant."}, |
|
{"role": "user", "content": "Explain the importance of the LLaMA paper in one paragraph."} |
|
] |
|
text = tokenizer.apply_chat_template( |
|
messages, |
|
tokenize=False, |
|
add_generation_prompt=True |
|
) |
|
model_inputs = tokenizer([text], return_tensors="pt").to(device) |
|
|
|
generated_ids = model.generate( |
|
model_inputs.input_ids, |
|
max_new_tokens=512 |
|
) |
|
generated_ids = [ |
|
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) |
|
] |
|
|
|
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] |
|
print(response) |
|
``` |
|
|
|
## Intended Use and Limitations |
|
|
|
**This is an experimental model and should be considered a high-quality checkpoint, not a finished product.** |
|
|
|
* **Fine-tuning is highly recommended.** While it inherits knowledge from a powerful instruction model, the merging process can create slight incoherence between layers. A round of fine-tuning on a high-quality instruction dataset is necessary to harmonize the weights and unlock its full potential. |
|
* The model may exhibit unexpected behaviors, including repetitiveness or nonsensical outputs, prior to fine-tuning. |
|
* This model has not been aligned for safety and may produce problematic, biased, or otherwise undesirable content. The user assumes all responsibility for the output generated. |
|
|
|
## Acknowledgements |
|
|
|
This model would not have been possible without the foundational work of Alibaba Cloud on the Qwen models, and the powerful, flexible `MergeKit` toolkit created by Charles Goddard and Arcee.ai. |