QuixiAI
/

Qwen3-72B-Synthesis

@@ -1,67 +1,146 @@
 ---
-base_model: []
-library_name: transformers
 tags:
-- mergekit
 - merge
 ---
-# Qwen3-72B-Instruct
-Still testing it!  not sure if it works.  uploading a gguf to https://huggingface.co/cognitivecomputations/Qwen3-72B-Instruct-gguf
-This is a merge of pre-trained language models created using [mergekit](https://github.com/cg123/mergekit).
-## Merge Details
-### Merge Method
-This model was merged using the [Linear](https://arxiv.org/abs/2203.05482) merge method using ./Qwen3-32B-Upscaled as a base.
-### Models Merged
-The following models were included in the merge:
-* ./Qwen2.5-72B-Instruct-Aligned
-### Configuration
-The following YAML configuration was used to produce this model:
 ```yaml
 merge_method: linear
 base_model: ./Qwen3-32B-Upscaled
 dtype: bfloat16
 slices:
   - merge_method: linear
     sources:
     - model: ./Qwen3-32B-Upscaled
       layer_range: [0, 32]
       parameters:
-        weight: 0.5
     - model: ./Qwen2.5-72B-Instruct-Aligned
       layer_range: [0, 32]
       parameters:
-        weight: 0.5
-  - merge_method: linear
     sources:
-    - model: ./Qwen3-32B-Upscaled
-      layer_range: [32, 48]
-      parameters:
-        weight: 0.0
     - model: ./Qwen2.5-72B-Instruct-Aligned
       layer_range: [32, 48]
-      parameters:
-        weight: 1.0
   - merge_method: linear
     sources:
     - model: ./Qwen3-32B-Upscaled
       layer_range: [32, 64]
       parameters:
-        weight: 0.5
     - model: ./Qwen2.5-72B-Instruct-Aligned
       layer_range: [48, 80]
       parameters:
-        weight: 0.5
 tokenizer_source: ./Qwen3-32B-Upscaled
 ```

 ---
+license: apache-2.0
+base_model:
+- Qwen/Qwen3-32B
+- Qwen/Qwen2.5-72B-Instruct
 tags:
 - merge
+- frankenmerge
+- qwen
 ---
+# Qwen3-72B-Synthesis
+A Qwen3-Architecture 72B Model Forged from `Qwen3-32B` and `Qwen2.5-72B-Instruct`.
+## Model Description
+**Qwen3-72B-Synthesis** is an experimental, 80-layer, 72-billion-parameter large language model. It represents a novel approach to model creation, designed to produce a model with the pure, modern **Qwen3 architecture** while inheriting the vast, high-quality knowledge of the 72B-scale **Qwen2.5-Instruct** model.
+This was not a simple merge. It was a multi-phase surgical procedure involving dimensional up-scaling, architectural alignment, and a strategic "knowledge transplant" using `MergeKit`. The result is a unique checkpoint that serves as an ideal starting point for further fine-tuning.
+The core philosophy was to use `Qwen/Qwen3-32B` as the architectural "foundation" and `Qwen/Qwen2.5-72B-Instruct` as the "knowledge donor."
+## Model Details
+*   **Architecture:** Qwen3 (RMSNorm, SwiGLU, no biases, includes `q_norm` and `k_norm`)
+*   **Parameters:** ~72 Billion
+*   **Layers:** 80
+*   **Foundation:** `Qwen/Qwen3-32B`
+*   **Donor:** `Qwen/Qwen2.5-72B-Instruct`
+*   **Tokenizer:** `Qwen/Qwen3-32B` Tokenizer (`vocab_size: 151936`)
+## Model Creation Process
+The creation of this model was a deliberate, three-phase process designed to overcome significant architectural incompatibilities.
+### Phase 1: Foundation Upscaling
+First, the `Qwen/Qwen3-32B` model (64 layers, 5120 hidden dim) was up-scaled to match the target 72B dimensions. This was done using a sophisticated **self-interpolation** script, where new dimensions were created by averaging different slices of the existing weights, rather than simple tiling. This produced `Qwen3-32B-Upscaled`, a 64-layer model with the correct 72B tensor shapes and Qwen3 architecture.
+### Phase 2: Donor Alignment
+The `Qwen/Qwen2.5-72B-Instruct` model was architecturally incompatible with the Qwen3 target. To solve this, a new donor model, `Qwen2.5-72B-Instruct-Aligned`, was created. This process involved:
+1.  Creating an empty 80-layer model shell with the pure Qwen3 architecture.
+2.  Surgically removing all `.bias` tensors from the Qwen2.5 weights.
+3.  Truncating the Qwen2.5 embedding and language model head layers from a vocabulary of 152064 to match Qwen3's 151936.
+4.  Loading the modified Qwen2.5 weights into the pure Qwen3 shell, resulting in a perfectly compatible donor model.
+### Phase 3: Knowledge Transplant via MergeKit
+With two architecturally-compatible models, the final merge was performed using `MergeKit`. A "Knowledge Bridge" strategy was employed to transplant a stable reasoning core from the donor while blending the rest.
+The following `MergeKit` configuration was used:
 ```yaml
 merge_method: linear
 base_model: ./Qwen3-32B-Upscaled
 dtype: bfloat16
 slices:
+  # Slice 1: Blend the bottom 32 layers
   - merge_method: linear
     sources:
     - model: ./Qwen3-32B-Upscaled
       layer_range: [0, 32]
       parameters:
+        weight: 0.3
     - model: ./Qwen2.5-72B-Instruct-Aligned
       layer_range: [0, 32]
       parameters:
+        weight: 0.7
+  # Slice 2: The "Knowledge Bridge" - transplant a pure block from the donor
+  - merge_method: passthrough
     sources:
     - model: ./Qwen2.5-72B-Instruct-Aligned
       layer_range: [32, 48]
+  # Slice 3: Blend the top layers
   - merge_method: linear
     sources:
     - model: ./Qwen3-32B-Upscaled
       layer_range: [32, 64]
       parameters:
+        weight: 0.3
     - model: ./Qwen2.5-72B-Instruct-Aligned
       layer_range: [48, 80]
       parameters:
+        weight: 0.7
 tokenizer_source: ./Qwen3-32B-Upscaled
+```
+## How to Use
+This model uses the standard Qwen ChatML prompt format.
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_id = "your-username/Qwen3-72B-Synthesis"
+device = "cuda"
+tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    trust_remote_code=True
+)
+messages = [
+    {"role": "system", "content": "You are a helpful assistant."},
+    {"role": "user", "content": "Explain the importance of the LLaMA paper in one paragraph."}
+]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+model_inputs = tokenizer([text], return_tensors="pt").to(device)
+generated_ids = model.generate(
+    model_inputs.input_ids,
+    max_new_tokens=512
+)
+generated_ids = [
+    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
+]
+response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
+print(response)
 ```
+## Intended Use and Limitations
+**This is an experimental model and should be considered a high-quality checkpoint, not a finished product.**
+*   **Fine-tuning is highly recommended.** While it inherits knowledge from a powerful instruction model, the merging process can create slight incoherence between layers. A round of fine-tuning on a high-quality instruction dataset is necessary to harmonize the weights and unlock its full potential.
+*   The model may exhibit unexpected behaviors, including repetitiveness or nonsensical outputs, prior to fine-tuning.
+*   This model has not been aligned for safety and may produce problematic, biased, or otherwise undesirable content. The user assumes all responsibility for the output generated.
+## Acknowledgements
+This model would not have been possible without the foundational work of Alibaba Cloud on the Qwen models, and the powerful, flexible `MergeKit` toolkit created by Charles Goddard.