thomasgauthier's picture
Update README.md
aa2f9f0 verified
|
raw
history blame
3.43 kB
metadata
license: apache-2.0
tags:
  - mixtral
  - dense
  - mistral
  - expert

Unmixtraled 22B 8x linear merge

This model outputs gibberish as it was not trained under the dense configuration. Finetuning or merging is needed to make this model useful.

This is a 22B Mistral model recycling weights from mistral-community/Mixtral-8x22B-v0.1. The model was adapted from a Mixtral architecture to a dense Mistral architecture with the same number of layers, attention heads and hidden dimensions.
Embeddings, attention, layer norms and LM head weights were taken directly from the 8x22B model, MLP weights are a linear merge of experts 0 to 7 weights.

The following named weight correspondance was used:

Mistral weight Mixtral weight
gate_proj experts.{expert_num}.w1
down_proj experts.{expert_num}.w2
up_proj experts.{expert_num}.w3

This mergekit configuration was used to merge the experts:

models:
  - model: thomasgauthier/Unmixtraled-22B-v0.1-expert-0
  - model: thomasgauthier/Unmixtraled-22B-v0.1-expert-1
  - model: thomasgauthier/Unmixtraled-22B-v0.1-expert-2
  - model: thomasgauthier/Unmixtraled-22B-v0.1-expert-3
  - model: thomasgauthier/Unmixtraled-22B-v0.1-expert-4
  - model: thomasgauthier/Unmixtraled-22B-v0.1-expert-5
  - model: thomasgauthier/Unmixtraled-22B-v0.1-expert-6
  - model: thomasgauthier/Unmixtraled-22B-v0.1-expert-7
merge_method: linear
dtype: float16

Unmixtraled models

Expert Source Wikitext perplexity
Unmixtraled-22B-v0.1-expert-0 Mixtral 8x22B embed, attn, layernorm, lm_head + expert 0 MLPs 696.6932983398438
Unmixtraled-22B-v0.1-expert-1 Mixtral 8x22B embed, attn, layernorm, lm_head + expert 1 MLPs 6853.04248046875
Unmixtraled-22B-v0.1-expert-2 Mixtral 8x22B embed, attn, layernorm, lm_head + expert 2 MLPs 4689.181640625
Unmixtraled-22B-v0.1-expert-3 Mixtral 8x22B embed, attn, layernorm, lm_head + expert 3 MLPs 782.3755493164062
Unmixtraled-22B-v0.1-expert-4 Mixtral 8x22B embed, attn, layernorm, lm_head + expert 4 MLPs 2844.943603515625
Unmixtraled-22B-v0.1-expert-5 Mixtral 8x22B embed, attn, layernorm, lm_head + expert 5 MLPs 1099.32373046875
Unmixtraled-22B-v0.1-expert-6 Mixtral 8x22B embed, attn, layernorm, lm_head + expert 6 MLPs 341.5309753417969
Unmixtraled-22B-v0.1-expert-7 Mixtral 8x22B embed, attn, layernorm, lm_head + expert 7 MLPs 2099.63818359375
Unmixtraled-22B-v0.1-lerp Mixtral 8x22B embed, attn, layernorm, lm_head + linear merge of expert 0-7 MLPs 1873.9874267578125