|
--- |
|
library_name: transformers |
|
tags: |
|
- mergekit |
|
- block expansion |
|
- progressive mistral |
|
- arcee cpt |
|
--- |
|
|
|
# Mistral-7B-Instruct-v0.2-expanded |
|
|
|
This method employs mergekit's passthrough method to expand blocks within the "mistralai/Mistral-7B-Instruct-v0.2" model. For every 5th layer, |
|
a new layer is added, with the `o_proj` and `down_proj` parameters of these added layers initialized to zero, mirroring the approach used in LLaMA Pro. |
|
|
|
### It's important to note that this configuration has not undergone fine-tuning. So this won't work. Therefore, when fine-tuning, ensure that only every 5th layer is trainable,while all other layers remain frozen. |
|
|
|
## 🧩 Configuration |
|
|
|
```yaml |
|
slices: |
|
- sources: |
|
- model: mistralai/Mistral-7B-Instruct-v0.2 |
|
layer_range: [0, 4] |
|
- sources: |
|
- model: mistralai/Mistral-7B-Instruct-v0.2 |
|
layer_range: [3, 4] |
|
parameters: |
|
scale: |
|
- filter: o_proj |
|
value: 0.0 |
|
- filter: down_proj |
|
value: 0.0 |
|
- value: 1.0 |
|
|
|
- sources: |
|
- model: mistralai/Mistral-7B-Instruct-v0.2 |
|
layer_range: [4, 8] |
|
- sources: |
|
- model: mistralai/Mistral-7B-Instruct-v0.2 |
|
layer_range: [7, 8] |
|
parameters: |
|
scale: |
|
- filter: o_proj |
|
value: 0.0 |
|
- filter: down_proj |
|
value: 0.0 |
|
- value: 1.0 |
|
|
|
- sources: |
|
- model: mistralai/Mistral-7B-Instruct-v0.2 |
|
layer_range: [8, 12] |
|
- sources: |
|
- model: mistralai/Mistral-7B-Instruct-v0.2 |
|
layer_range: [11, 12] |
|
parameters: |
|
scale: |
|
- filter: o_proj |
|
value: 0.0 |
|
- filter: down_proj |
|
value: 0.0 |
|
- value: 1.0 |
|
|
|
- sources: |
|
- model: mistralai/Mistral-7B-Instruct-v0.2 |
|
layer_range: [12, 16] |
|
- sources: |
|
- model: mistralai/Mistral-7B-Instruct-v0.2 |
|
layer_range: [15, 16] |
|
parameters: |
|
scale: |
|
- filter: o_proj |
|
value: 0.0 |
|
- filter: down_proj |
|
value: 0.0 |
|
- value: 1.0 |
|
|
|
- sources: |
|
- model: mistralai/Mistral-7B-Instruct-v0.2 |
|
layer_range: [16, 20] |
|
- sources: |
|
- model: mistralai/Mistral-7B-Instruct-v0.2 |
|
layer_range: [19, 20] |
|
parameters: |
|
scale: |
|
- filter: o_proj |
|
value: 0.0 |
|
- filter: down_proj |
|
value: 0.0 |
|
- value: 1.0 |
|
|
|
- sources: |
|
- model: mistralai/Mistral-7B-Instruct-v0.2 |
|
layer_range: [20, 24] |
|
- sources: |
|
- model: mistralai/Mistral-7B-Instruct-v0.2 |
|
layer_range: [23, 24] |
|
parameters: |
|
scale: |
|
- filter: o_proj |
|
value: 0.0 |
|
- filter: down_proj |
|
value: 0.0 |
|
- value: 1.0 |
|
|
|
- sources: |
|
- model: mistralai/Mistral-7B-Instruct-v0.2 |
|
layer_range: [24, 28] |
|
- sources: |
|
- model: mistralai/Mistral-7B-Instruct-v0.2 |
|
layer_range: [27, 28] |
|
parameters: |
|
scale: |
|
- filter: o_proj |
|
value: 0.0 |
|
- filter: down_proj |
|
value: 0.0 |
|
- value: 1.0 |
|
|
|
- sources: |
|
- model: mistralai/Mistral-7B-Instruct-v0.2 |
|
layer_range: [28, 32] |
|
- sources: |
|
- model: mistralai/Mistral-7B-Instruct-v0.2 |
|
layer_range: [31, 32] |
|
parameters: |
|
scale: |
|
- filter: o_proj |
|
value: 0.0 |
|
- filter: down_proj |
|
value: 0.0 |
|
- value: 1.0 |
|
|
|
merge_method: passthrough |
|
dtype: bfloat16 |
|
``` |
|
|
|
# Function to freeze layers |
|
|
|
``` |
|
from transformers import AutoModelForCausalLM |
|
|
|
def enable_grad_only_every_nth(model, n): |
|
""" |
|
This function configures the specified model to enable gradient calculations exclusively for every nth layer, starting |
|
from the first layer (0-indexed), to accommodate newly added blocks for training. Concurrently, it freezes the gradients |
|
for all other components of the model, including the embedding layers and the model's head. This setup is particularly |
|
useful for fine-tuning processes where only a subset of layers are targeted for updates, ensuring efficient training and |
|
adaptation of newly integrated layers while maintaining the pre-trained behavior of other model components. |
|
""" |
|
|
|
# Freeze embeddings. |
|
for param in model.model.embed_tokens.parameters(): |
|
param.requires_grad = False |
|
|
|
# Freeze lm_head. |
|
for param in model.lm_head.parameters(): |
|
param.requires_grad = False |
|
|
|
# Enable gradients for every nth layer |
|
layers = model.model.layers # Access the ModuleList containing the layers |
|
|
|
for index, layer in enumerate(layers): |
|
|
|
if (index + 1) % n == 0: # Enables gradients for every nth layer, starting from the layer after the 0th |
|
for param in layer.parameters(): |
|
param.requires_grad = True |
|
else: |
|
for param in layer.parameters(): |
|
param.requires_grad = False |
|
|
|
model = transformers.AutoModelForCausalLM.from_pretrained( |
|
"arcee-ai/Mistral-7B-Instruct-v0.2-expanded" |
|
) |
|
# Update layer gradients, specify the correct value for n based on your model's architecture |
|
n =5 |
|
enable_grad_only_every_nth(model, n) |
|
``` |