File size: 5,283 Bytes
f23a241
 
 
 
 
 
58e849e
3e0d58c
f23a241
 
 
ea1aec4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c757aba
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2ac61c5
c757aba
ea1aec4
f23a241
409f695
 
87ffdfd
 
 
 
 
409f695
 
87ffdfd
409f695
87ffdfd
 
409f695
f23a241
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
---
tags:
- merge
- abacaj/phi-2-super
base_model:
- abacaj/phi-2-super
license: apache-2.0
---

# phi-2-DLEC

The DLEC (Distributive Layer Expansion Curve) methodology offers a novel approach to improving neural network models by focusing on the strategic duplication of certain effective layers. 
Developed with the aim of enhancing model performance, DLEC carefully identifies and amplifies the impact of key layers within the model's architecture. 

Below is a overview of the method and its implementation, particularly in how it integrates with the Hugging Face Transformers library and utilizes PyTorch and BitsAndBytes for efficient operation.

Overview
Setting Up: First, the script ensures all necessary components are in place, from libraries to the model and dataset.

Database for Activations: A SQLite database is established to track layer activations, providing a clear view into how individual neurons react and which layers are most influential — these are our 'beneficial layers.'

Analyzing and Identifying: By analyzing activation data, the script pinpoints which layers are most valuable to the model's performance.

Configuring DLEC: A configuration is then created, guiding how the model should incorporate duplicates of these beneficial layers to boost effectiveness without unnecessarily increasing complexity.

Reconfiguring and Running the Model: Finally, the model is adjusted according to DLEC's insights, focusing enhancement on the identified layers.

Key Features:
Selective Layer Duplication: DLEC doesn't just add more layers; it doubles down on the ones that really matter. This methodical selection ensures we're making the most of the model's capabilities without wasteful expansion.

Smart Resource Management: By honing in on specific areas for improvement, DLEC aims to make better use of computational and memory resources, promoting more efficient learning without adding undue complexity to the model.

This approach is about making informed, strategic enhancements to model architecture, prioritizing efficiency and effectiveness in utilizing neural network capabilities.

```ymal
-Possible_Beneficial_layers: #Layers of significance 
  - 0
  - 3
  - 4
  - 7
  - 8
  - 11
  - 12
  - 15
  - 16
  - 19
  - 20
  - 23
  - 24
  - 27
  - 28
  - 31
  - 32
```
Currently, I am still limited to Mergekit, for this method, which does not support single layer duping, this may have a impact on performance.

# This Method is still in development and I do not expect "Game Changing" or will I oversell this method, it is purely done for fun. Please let me know how the model works for you.

## ⚙️ Evals

![image/png](https://cdn-uploads.huggingface.co/production/uploads/64545af5ec40bbbd01242ca6/pS7KFYDheWmFEaGybxr3K.png)

[My Leaderboard:](https://huggingface.co/spaces/Steelskull/YALL-Leaderboard)

![image/png](https://cdn-uploads.huggingface.co/production/uploads/64545af5ec40bbbd01242ca6/CF9_p8AWMFraCnfiMa_no.png)

As you know there is a usual loss of intelligence with model mergers, especially with Passthrough merging, on the par of 3ish points per billion duped, IF you get the right merge, if not your looking at a much larger loss (anywhere from 3-8 points per billion duped).
Using DLEC, I was able to increase Phi-2 from 2.78b -> 3.25b with less than or around a single point of loss.

This method is still in active development, and I am currently tweaking the algorithm to improve the layer selection process, 
I am also working on a single layer duping script as merge kit does not currently support this and I am merging layers that are unneeded and its degrading performance.

## 🧩 Configuration

```yaml
dtype: bfloat16
merge_method: passthrough
slices:
  - sources:
      - model: abacaj/phi-2-super
        layer_range: [0, 3] # Introduces 0, 3
  - sources:
      - model: abacaj/phi-2-super
        layer_range: [3, 8] # Duplicates 3, introduces 4, 7, 8
  - sources:
      - model: abacaj/phi-2-super
        layer_range: [7, 12] # Duplicates 7, 8, introduces 11, 12
  - sources:
      - model: abacaj/phi-2-super
        layer_range: [11, 16] # Duplicates 11, 12, introduces 15, 16
  - sources:
      - model: abacaj/phi-2-super
        layer_range: [15, 20] # Duplicates 15, 16, introduces 19, 20
  - sources:
      - model: abacaj/phi-2-super
        layer_range: [19, 24] # Duplicates 19, 20, introduces 23, 24
  - sources:
      - model: abacaj/phi-2-super
        layer_range: [23, 28] # Duplicates 23, 24, introduces 27, 28
  - sources:
      - model: abacaj/phi-2-super
        layer_range: [27, 32] # Duplicates 27, 28, introduces 31, 32

```

## 💻 Usage

```python
!pip install -qU transformers accelerate

from transformers import AutoTokenizer
import transformers
import torch

model = "TheSkullery/phi-2-DLEC"
messages = [{"role": "user", "content": "What is a large language model?"}]

tokenizer = AutoTokenizer.from_pretrained(model)
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.float16,
    device_map="auto",
)

outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])
```