Update README.md

079a835 verified about 2 months ago

5.84 kB

	---
	license: apache-2.0
	base_model:
	- Qwen/Qwen3-32B
	- Qwen/Qwen2.5-72B-Instruct
	tags:
	- merge
	- frankenmerge
	- qwen
	---

	# Qwen3-72B-Synthesis

	This still doesn't work, I'm trying to fix it.

	A Qwen3-Architecture 72B Model Forged from `Qwen3-32B` and `Qwen2.5-72B-Instruct`.

	## Model Description

	Qwen3-72B-Synthesis is an experimental, 80-layer, 72-billion-parameter large language model. It represents a novel approach to model creation, designed to produce a model with the pure, modern Qwen3 architecture while inheriting the vast, high-quality knowledge of the 72B-scale Qwen2.5-Instruct model.

	This was not a simple merge. It was a multi-phase surgical procedure involving dimensional up-scaling, architectural alignment, and a strategic "knowledge transplant" using `MergeKit`. The result is a unique checkpoint that serves as an ideal starting point for further fine-tuning.

	The core philosophy was to use `Qwen/Qwen3-32B` as the architectural "foundation" and `Qwen/Qwen2.5-72B-Instruct` as the "knowledge donor."

	## Model Details

	* Architecture: Qwen3 (RMSNorm, SwiGLU, no biases, includes `q_norm` and `k_norm`)
	* Parameters: ~72 Billion
	* Layers: 80
	* Foundation: `Qwen/Qwen3-32B`
	* Donor: `Qwen/Qwen2.5-72B-Instruct`
	* Tokenizer: `Qwen/Qwen3-32B` Tokenizer (`vocab_size: 151936`)

	## Model Creation Process

	The creation of this model was a deliberate, three-phase process designed to overcome significant architectural incompatibilities.

	### Phase 1: Foundation Upscaling

	First, the `Qwen/Qwen3-32B` model (64 layers, 5120 hidden dim) was up-scaled to match the target 72B dimensions. This was done using a sophisticated self-interpolation script, where new dimensions were created by averaging different slices of the existing weights, rather than simple tiling. This produced `Qwen3-32B-Upscaled`, a 64-layer model with the correct 72B tensor shapes and Qwen3 architecture.

	### Phase 2: Donor Alignment

	The `Qwen/Qwen2.5-72B-Instruct` model was architecturally incompatible with the Qwen3 target. To solve this, a new donor model, `Qwen2.5-72B-Instruct-Aligned`, was created. This process involved:
	1. Creating an empty 80-layer model shell with the pure Qwen3 architecture.
	2. Surgically removing all `.bias` tensors from the Qwen2.5 weights.
	3. Truncating the Qwen2.5 embedding and language model head layers from a vocabulary of 152064 to match Qwen3's 151936.
	4. Loading the modified Qwen2.5 weights into the pure Qwen3 shell, resulting in a perfectly compatible donor model.

	### Phase 3: Knowledge Transplant via MergeKit

	With two architecturally-compatible models, the final merge was performed using `MergeKit`. A "Knowledge Bridge" strategy was employed to transplant a stable reasoning core from the donor while blending the rest.

	The following `MergeKit` configuration was used:

	```yaml
	merge_method: linear
	base_model: ./Qwen3-32B-Upscaled
	dtype: bfloat16

	slices:
	# Slice 1: Blend the bottom 32 layers
	- merge_method: linear
	sources:
	- model: ./Qwen3-32B-Upscaled
	layer_range: [0, 32]
	parameters:
	weight: 0.5
	- model: ./Qwen2.5-72B-Instruct-Aligned
	layer_range: [0, 32]
	parameters:
	weight: 0.5

	# Slice 2: The "Knowledge Bridge" - transplant a pure block from the donor
	- merge_method: passthrough
	sources:
	- model: ./Qwen2.5-72B-Instruct-Aligned
	layer_range: [32, 48]

	# Slice 3: Blend the top layers
	- merge_method: linear
	sources:
	- model: ./Qwen3-32B-Upscaled
	layer_range: [32, 64]
	parameters:
	weight: 0.5
	- model: ./Qwen2.5-72B-Instruct-Aligned
	layer_range: [48, 80]
	parameters:
	weight: 0.5

	tokenizer_source: ./Qwen3-32B-Upscaled
	```

	## How to Use

	This model uses the standard Qwen ChatML prompt format.

	```python
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_id = "cognitivecomputations/Qwen3-72B-Synthesis"
	device = "cuda"

	tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype=torch.bfloat16,
	device_map="auto",
	trust_remote_code=True
	)

	messages = [
	{"role": "system", "content": "You are a helpful assistant."},
	{"role": "user", "content": "Explain the importance of the LLaMA paper in one paragraph."}
	]
	text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)
	model_inputs = tokenizer([text], return_tensors="pt").to(device)

	generated_ids = model.generate(
	model_inputs.input_ids,
	max_new_tokens=512
	)
	generated_ids = [
	output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
	]

	response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
	print(response)
	```

	## Intended Use and Limitations

	This is an experimental model and should be considered a high-quality checkpoint, not a finished product.

	* Fine-tuning is highly recommended. While it inherits knowledge from a powerful instruction model, the merging process can create slight incoherence between layers. A round of fine-tuning on a high-quality instruction dataset is necessary to harmonize the weights and unlock its full potential.
	* The model may exhibit unexpected behaviors, including repetitiveness or nonsensical outputs, prior to fine-tuning.
	* This model has not been aligned for safety and may produce problematic, biased, or otherwise undesirable content. The user assumes all responsibility for the output generated.

	## Acknowledgements

	This model would not have been possible without the foundational work of Alibaba Cloud on the Qwen models, and the powerful, flexible `MergeKit` toolkit created by Charles Goddard and Arcee.ai.