---
pipeline_tag: any-to-any
---
This is the Chameleon-7b checkpoint, converted using the script [convert_chameleon_weights_to_hf.py](https://github.com/Alpha-VLLM/Lumina-mGPT/blob/main/lumina_mgpt/model/chameleon/convert_chameleon_weights_to_hf.py) from the [Lumina-mGPT](https://github.com/Alpha-VLLM/Lumina-mGPT) repository.

This release is intended to ease the initialization of Lumina-mGPT training. Before using this model, please ensure you have obtained permission to access the official Chameleon checkpoints available at [Hugging Face](https://huggingface.co/facebook/chameleon-7b). Usage of this model is at the user's own risk.


<h2 style="color:rosybrown">Differences from the official chameleon-7B release</h2> 

*This model is **almost the same** as the official chameleon-7B release, with one important difference in the *qk-norm* implementation*:
Due to unknown reasons, for the 34B Chameleon 
model, where 8-way model parallelism is employed during training, the weights in the qk-norm layers, which are expected to be the same across model-parallel ranks,
are found to be different (See [here](https://github.com/huggingface/transformers/pull/31534#issuecomment-2207354677) for details). More intuitively, this means that the attention heads can be divided into 1 group for 7B model and 8 groups for 34B model, where the qk-norm parameters
are the same within the groups but different among them. To mitigate this problem, `transformers` has developed the implementation to copy the qk-norm parameters to the shape `num_heads * head_dim`,
however, this means that if we want to further finetune the Chameleon model, like the case of Lumina-mGPT, the qk-norm parameters will further diverge to the extent that the parameters are different 
between every two attention heads, which is not ideal. To solve this problem, we slightly change the implementation so that the qk-norm parameters are instead of shape `model_parallel_size x head_dim`,
where `model_parallel_size` is 1 for 7B model and 8 for 34B model, and they are expanded to `num_heads * head_dim` during forward time through `repeat_interleave`. This modification ensures
that the qk-norm parameters can always be consistent within existing groups.