File size: 5,678 Bytes
f23a241 58e849e 3e0d58c f23a241 9467ed2 ea1aec4 9467ed2 ea1aec4 9467ed2 ea1aec4 9467ed2 ea1aec4 9467ed2 ea1aec4 9467ed2 ea1aec4 9467ed2 ea1aec4 9467ed2 ea1aec4 9467ed2 ea1aec4 9467ed2 c757aba 9467ed2 f23a241 9467ed2 409f695 9467ed2 409f695 87ffdfd 409f695 f23a241 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 |
---
tags:
- merge
- abacaj/phi-2-super
base_model:
- abacaj/phi-2-super
license: apache-2.0
---
# phi-2-DLEC
The DLEC (Distributive Layer Expansion Curve) methodology offers a novel approach to improving neural network models by focusing on the strategic duplication of certain effective layers. Developed with the aim of enhancing model performance, DLEC carefully identifies and amplifies the impact of key layers within the model's architecture.
## Code overview:
Setting Up:
First, the script ensures all necessary components are in place, from libraries to the model and dataset.
Database for Activations:
A SQLite database is established to track layer activations, providing a clear view into how
individual neurons react and which layers are most influential — these are our 'beneficial layers.'
Analyzing and Identifying:
By analyzing activation data, the script pinpoints which layers are most valuable to the model's performance.
Configuring DLEC:
A configuration is then created, guiding how the model should incorporate duplicates of these beneficial layers to boost effectiveness without unnecessarily increasing complexity.
# Key Features:
Selective Layer Duplication:
DLEC doesn't just add more layers; it doubles down on the ones that really matter. This methodical selection ensures we're making the most of the model's capabilities without wasteful expansion.
Smart Resource Management:
By honing in on specific areas for improvement, DLEC aims to make better use of computational and memory resources, promoting more efficient learning without adding undue complexity to the model.
This approach is about making informed, strategic enhancements to model architecture, prioritizing efficiency and effectiveness in utilizing neural network capabilities.
# Information Loss:
It is common to observe a loss of intelligence when merging models, especially with Passthrough merging, which typically results in a loss of around 3 points per billion parameters duplicated, assuming the merge is done correctly. If the merge is suboptimal, the loss can be much larger, ranging from 3-8 points or more per billion parameters duplicated. However, with DLEC, I was able to increase Phi-2 from 2.78b to 3.25b with a minimal loss of around 0.44 points on average.
DLEC Expanded Model:
[TheSkullery/phi-2-DLEC](https://huggingface.co/TheSkullery/phi-2-DLEC)
2.78 -> 3.25, a ~17% increase in size
```
Metric -> Value
Avg. 46.72
AGIEval 29.64
GPT4All 69.48
TruthfulQA 50.29
```
Original Model:
[abacaj/phi-2-super](https://huggingface.co/abacaj/phi-2-super))
```
Metric -> Value
Avg. 47.16
AGIEval 31.95
GPT4All 70.81
TruthfulQA 48.39
```
Loss or Increase:
Avg. -0.44
AGIEval -2.31
GPT4All -1.33
TruthfulQA +1.90
Example of loss:
[Steelskull/Etheria-55b-v0.1](https://huggingface.co/Steelskull/Etheria-55b-v0.1)
```
Metric -> Value
Avg. 64.69
AI2 Reasoning Challenge 65.10
HellaSwag 81.93
MMLU 73.66
TruthfulQA 56.16
Winogrande 76.09
GSM8k 35.18
```
[Yi-34B-200K-DARE-megamerge-v8](https://huggingface.co/brucethemoose/Yi-34B-200K-DARE-megamerge-v8)
```
Metric -> Value
Avg. 72.56
AI2 Reasoning Challenge 67.75
HellaSwag 86.06
MMLU 77.03
TruthfulQA 56.31
Winogrande 82.79
GSM8k 65.43
```
Merge Loss (Yi-34B-200K-DARE-megamerge-v8 compared to Etheria-55b-v0.1):
Avg. -7.87
AI2 Reasoning Challenge -2.65
HellaSwag -4.13
MMLU -3.37
TruthfulQA +0.15
Winogrande -6.70
GSM8k -30.25
In the example comparing Etheria-55b-v0.1 and Yi-34B-200K-DARE-megamerge-v8, there is a significant decrease in performance across all metrics, with the average score decreasing by 7.87 points. The most notable is in the GSM8k benchmark, where Yi-34B-200K-DARE-megamerge-v8 outperforms Etheria-55b-v0.1 by 30.25 points.
This method is still in active development, and I am currently tweaking the algorithm to improve the layer selection process,
I am also working on a single layer duping script as merge kit does not currently support this and I am merging layers that are unneeded and its degrading performance.
## 🧩 Configuration
```yaml
dtype: bfloat16
merge_method: passthrough
slices:
- sources:
- model: abacaj/phi-2-super
layer_range: [0, 3] # Introduces 0, 3
- sources:
- model: abacaj/phi-2-super
layer_range: [3, 8] # Duplicates 3, introduces 4, 7, 8
- sources:
- model: abacaj/phi-2-super
layer_range: [7, 12] # Duplicates 7, 8, introduces 11, 12
- sources:
- model: abacaj/phi-2-super
layer_range: [11, 16] # Duplicates 11, 12, introduces 15, 16
- sources:
- model: abacaj/phi-2-super
layer_range: [15, 20] # Duplicates 15, 16, introduces 19, 20
- sources:
- model: abacaj/phi-2-super
layer_range: [19, 24] # Duplicates 19, 20, introduces 23, 24
- sources:
- model: abacaj/phi-2-super
layer_range: [23, 28] # Duplicates 23, 24, introduces 27, 28
- sources:
- model: abacaj/phi-2-super
layer_range: [27, 32] # Duplicates 27, 28, introduces 31, 32
```
## 💻 Usage
```python
!pip install -qU transformers accelerate
from transformers import AutoTokenizer
import transformers
import torch
model = "TheSkullery/phi-2-DLEC"
messages = [{"role": "user", "content": "What is a large language model?"}]
tokenizer = AutoTokenizer.from_pretrained(model)
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
pipeline = transformers.pipeline(
"text-generation",
model=model,
torch_dtype=torch.float16,
device_map="auto",
)
outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])
``` |