cgus commited on
Commit
742c82c
1 Parent(s): ba2e130

Upload 7 files

Browse files
README.md ADDED
@@ -0,0 +1,240 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ license_name: a
4
+ license_link: LICENSE
5
+ datasets:
6
+ - m-a-p/Code-Feedback
7
+ - HuggingFaceTB/cosmopedia-100k
8
+ - LDJnr/Capybara
9
+ - vicgalle/alpaca-gpt4
10
+ - glaiveai/glaive-code-assistant-v2
11
+ - WhiteRabbitNeo/WRN-Chapter-1
12
+ - WhiteRabbitNeo/WRN-Chapter-2
13
+ - m-a-p/CodeFeedback-Filtered-Instruction
14
+ - jondurbin/airoboros-3.2
15
+ - euclaise/WritingPrompts_curated
16
+ - derek-thomas/squad-v1.1-t5-question-generation
17
+ - reinforz/question_generation_data
18
+ - teknium/GPTeacher-General-Instruct
19
+ - dim/roleplay_instruct_v2_final
20
+ - TIGER-Lab/MathInstruct
21
+ - abacusai/SystemChat
22
+ - Mihaiii/OpenHermes-2.5-1k-longest-curated
23
+ language:
24
+ - en
25
+ library_name: transformers
26
+ tags:
27
+ - code
28
+ ---
29
+ # Model Card for NinjaMouse-2.4B-32L-danube
30
+
31
+ A lanky version of [h2o-danube](https://huggingface.co/h2oai/h2o-danube-1.8b-chat)'s tiny language model, stretched from 24 layers to 32. I have done this in steps, adding 2 new layers per step and training them on different datasets. This seems to have made it a quick learner, and easily fit an 8GB GPU for finetuning using Unsloth for optimizations. This model is designed to be a gateway into bigger language models.
32
+
33
+ This model is sponsored by Advanced Vintage Memeatics. A powerful dopaminergic with ties to the holy roman empire, the ghost of Richard Feynman, a radiator from the Radiator planet, and the gods defying Babel Fish. Consult your shaman before use. If their voodoo is strong you can find the even longer and even more uncut 3B model [here](https://huggingface.co/trollek/NinjaMouse-3B-40L-danube).
34
+
35
+ ## Model Details
36
+
37
+ ### Model Description
38
+
39
+ Two of the datasets I used to train this model was WhiteRabbitNeo [chapter 1](https://huggingface.co/datasets/WhiteRabbitNeo/WRN-Chapter-1) and [chapter 2](https://huggingface.co/datasets/WhiteRabbitNeo/WRN-Chapter-2) thereby agreeing to their extended Apache 2 license. If you use this model, a derivative, you really have to read their terms and agree with them, which is an easy task as they are quite reasonable. They could also be called the "Don't be a dick" clause (see out of scope section).
40
+
41
+ With the important things covered, let us cover the model.
42
+
43
+ I wanted a model that could construct Stable Diffusion prompts without the "trending on artstation, 8k uhd, dramatic lighting, detailed, masterpiece" spam. My solution got a little out of hand when trying out [*deep* block expansion](https://arxiv.org/abs/2401.02415), [DoRA](https://arxiv.org/abs/2402.09353) and QLoRA on the TinyLlama model, which failed but lead to this. A natty 16k context window, can be trained using Unsloth, and seems to be a lot more coherent than both TinyLlama and Phi2.
44
+
45
+ My thoughts going in to this was "If I use WRN in the training I get to call it something related to The Matrix" and "These Stable Diffusion prompt datasets need Geepus." After weeks of looking intensely at small numbers decreasing by very small amounts, I present to you a tiny language model that can generate image prompts and it has got a funny name.
46
+
47
+
48
+ - **Developed by:** Trolle Karlsson (Pseudonym Anonymous)
49
+ - **Model type:** Mistral
50
+ - **Language(s) (NLP):** English
51
+ - **License:** Apache-2.0 + WhiteRabbitNeo Extended Version
52
+ - **Finetuned from model:** [h2o-danube](https://huggingface.co/h2oai/h2o-danube-1.8b-chat)
53
+
54
+
55
+ ## Uses
56
+
57
+ Imagine having a model going through an entire book, page by page, creating SDXL prompts for the highlights. I want that! I would think that such a task would require some solid training data which I do not have. What I do have is my own set of about 700 instructions ranging from "write an SD(XL) prompt where something, something, something dark side shit is going on" through "Convert this image prompt from SD to SDXL" to "Inspiration: crocs."
58
+
59
+ The small size of the model, the diverse open datasets used in training, and the large context size could be great for RAG applications, but that is also the reason that additional finetuning is sort of required for this model to work in a consistent manner.
60
+
61
+ I think SOLAR and Llama Pro shows us that our current models benefit from being stretched a bit. That quantization works at all is also an implication that our models are too precise. However, rounding errors can introduce unforeseen bugs like suddenly being unable to spell, or in my case where the responses became **#**'s. It might have been brainfuck, but I barely write Python. Use vanilla at your own risk.
62
+
63
+ ### Direct Use
64
+
65
+ Here is what I can do with Stable Diffusion text prompts:
66
+
67
+ - Make SD image prompts by asking it nicely
68
+ - Transform those from SD to SDXL and back
69
+ - Improve prompts by removed legacy tags
70
+ - Inspire from only a single word
71
+ - TODO: Story/Lyrics to image prompt
72
+ - TODO: Reverse image prompt (for further dataset development reasons)
73
+ - TODO: Have all the other stuff work, beside SD prompting, at most temperatures.
74
+
75
+ ### Downstream Use
76
+
77
+ This isn't even the final form, 40 layers will be. Adding more at that point is just silly. By then it will have gone from 1.8B parameters to 3B. You can expand it even further, and I would like to know the results. I urge you to do your own training though. Small language models are prone to going off the rails and do whatever.
78
+
79
+ ### Out-of-Scope Use
80
+
81
+ **Do NOT use this model, or any derivatives, to be an ass.**
82
+
83
+ ```
84
+ You agree not to use the Model or Derivatives of the Model:
85
+
86
+ - In any way that violates any applicable national or international law or regulation or infringes upon the lawful rights and interests of any third party;
87
+ - For military use in any way;
88
+ - For the purpose of exploiting, harming or attempting to exploit or harm minors in any way;
89
+ - To generate or disseminate verifiably false information and/or content with the purpose of harming others;
90
+ - To generate or disseminate inappropriate content subject to applicable regulatory requirements;
91
+ - To generate or disseminate personal identifiable information without due authorization or for unreasonable use;
92
+ - To defame, disparage or otherwise harass others;
93
+ - For fully automated decision making that adversely impacts an individual’s legal rights or otherwise creates or modifies a binding, enforceable obligation;
94
+ - For any use intended to or which has the effect of discriminating against or harming individuals or groups based on online or offline social behavior or known or predicted personal or personality characteristics;
95
+ - To exploit any of the vulnerabilities of a specific group of persons based on their age, social, physical or mental characteristics, in order to materially distort the behavior of a person pertaining to that group in a manner that causes or is likely to cause that person or another person physical or psychological harm;
96
+ - For any use intended to or which has the effect of discriminating against individuals or groups based on legally protected characteristics or categories.
97
+ ```
98
+
99
+ I do however want you to explore the realms of extreme language compression. AGI is not that far away. Even if the required compute takes time to allocate, DistilAGI or something similar would surface.
100
+
101
+ ## Bias, Risks, and Limitations
102
+
103
+ There are some sultry prompts in my proprietary dataset, but I'm not *high* enough on the spectrum to delve into Pony prompting. Filtering SD datasets from space worms and worse took its toll.
104
+
105
+ I am hesitant to upload my dataset because of that. I also feel that, even though it's only about 700 samples, makes the responses a bit weird. It could also stem from the reddit writing prompts. Lets get professor Tegmark on the case!
106
+
107
+ ## How to Get Started with the Model
108
+
109
+ Use the code below to get started with the model.
110
+
111
+ ```python
112
+ from transformers import AutoTokenizer, AutoModelForCausalLM
113
+ import transformers
114
+ import torch
115
+
116
+ model_name = "trollek/NinjaMouse-2.4B-32L-danube"
117
+
118
+ tokeniser = AutoTokenizer.from_pretrained(model_name, use_fast=True)
119
+ model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16).to("cuda")
120
+
121
+ pipeline = transformers.pipeline(
122
+ "text-generation",
123
+ model=model,
124
+ tokenizer=tokeniser,
125
+ device=0,
126
+ )
127
+
128
+ system_prompt = "You are a very clever and helpful AI assistant called NinjaMouse."
129
+ intro_prompt = "Please introduce yourself."
130
+ messages = [
131
+ {
132
+ "role": "system",
133
+ "content": system_prompt,
134
+ },
135
+ {
136
+ "role": "user",
137
+ "content": f"{intro_prompt}"},
138
+ ]
139
+ prompt = pipeline.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
140
+
141
+ outputs = pipeline(prompt, max_new_tokens=512, do_sample=True, temperature=0.65, top_k=45, top_p=0.90)
142
+
143
+ print(outputs[0]["generated_text"])
144
+ ```
145
+
146
+ ## Training Details
147
+
148
+ ### Training Data
149
+
150
+ The datasets I've used to train this model are diverse, sandwiched as the middle and last layer when expanding. They are the following:
151
+
152
+ #### Step 1: 24->26
153
+
154
+ - LDJnr/Capybara
155
+ - vicgalle/alpaca-gpt4
156
+ - jondurbin/airoboros-3.2
157
+ - teknium/GPTeacher-General-Instruct
158
+ - WhiteRabbitNeo/WRN-Chapter-1
159
+ - WhiteRabbitNeo/WRN-Chapter-2
160
+
161
+ #### Step 1.5
162
+
163
+ - Mihaiii/OpenHermes-2.5-1k-longest-curated
164
+
165
+ **Notes**: As I understand, for models to fully use their context windows they have to be trained on long texts. I suppose when this one gets to a size around 3B it will have an easier time with long complex texts.
166
+
167
+ #### Step 2: 26->28
168
+
169
+ - abacusai/SystemChat
170
+ - TIGER-Lab/MathInstruct
171
+ - reinforz/question_generation_data
172
+
173
+ #### Step 3: 28->30
174
+
175
+ - euclaise/WritingPrompts_curated (heavily filtered - 6k)
176
+ - HuggingFaceTB/cosmopedia-100k (textbooks and stories)
177
+ - derek-thomas/squad-v1.1-t5-question-generation
178
+ - dim/roleplay_instruct_v2_final
179
+
180
+ **Notes**: This step was somewhat of a failure. I see good results when aiming for a training loss in the range of .5-.9 but this type of writing is not within its grasp yet.
181
+
182
+ #### Step 4: 30->32
183
+
184
+ - m-a-p/Code-Feedback
185
+ - m-a-p/CodeFeedback-Filtered-Instruction
186
+ - glaiveai/glaive-code-assistant-v2
187
+ - Toolcall 10k
188
+
189
+ #### Step 4.5
190
+
191
+ I figured a last once over with the datasets from step 1 and the SD one wouldn't hurt.
192
+
193
+ ### Training Procedure
194
+
195
+ I'll be honest with you splendid folks. This has taken a great deal of waiting patiently for Llama Factory to do what can be considered magic, but also a 4060Ti and a lot of time clicking around this site to find that datasets like [Capybara](https://huggingface.co/datasets/LDJnr/Capybara) are considered of the highest quality. With that, and "baby steps" in mind I selected training data that emulated our own knowledge progression. From right after the "what if dirt tastes amazing though?" stage of our life.
196
+
197
+ We learn to speak, we relate the 10 wiggly appendages on our hands to a number system, and finally we stare in awe at electrons defying time. With that said; Cosmopedia was a hassle. I strive for a training loss of <1, and Cosmopedia+WritingPrompt+Question Generation would not get below 1.1, which were in the step where I expanded the model from 28 to 30 layers. It started with a loss of 2, so it wasn't all bad.
198
+ #### Preprocessing
199
+
200
+ If a dataset is not compatible with Llama Factory I just open up a new Jupyter Notebook and work my way through analyzing the data and formatting it in a way I can use. For the writing prompts I filtered subreddit mentions, weird formatting like "\[WP\]", "\*\*\*\*\*\*\*\*\*", "\_\_\_\_\_\_\_", and by upvotes. I did the same for Cosmopedia-100k by removing reddit and children stories.
201
+
202
+ #### Training Hyperparameters
203
+
204
+ - **Training regime:**
205
+ - *LR*: 0.0001-0.0004
206
+ - *LR Scheduler*: cosine
207
+ - *Warmup*: 1%
208
+ - *Batch size*: 2-4
209
+ - *Gradient accumulation*: 2-4
210
+ - *Epochs*: 2-8
211
+
212
+ I yolo the parameters from educated guesstimates. Reading through some of the code for Unsloth and Huggingface I got an idea of what to write after -- in the terminal. `--help`also *helps* out a lot.
213
+
214
+ ## Roadmap (The Olsenbanden plan)
215
+
216
+ #### 32->34
217
+ **Logic**: Synthia, AutoCoT, STEM (from CamelAI)
218
+
219
+ #### 34->36
220
+ **Math**: MetaMath
221
+
222
+ #### 36->38
223
+ **Writing/Translation**: Writing Prompts, Cosmopedia, xP3x (danish question and command translations)
224
+
225
+ **Notes**: Before starting to finetune on a danish dataset like [danish-OpenHermes](https://huggingface.co/datasets/Mabeck/danish-OpenHermes) I would like to try out teaching it translation tasks first. I still feel that when modelling algorithms after our own brains, that we can think of NNs in the same way we do with our own meaty prediction machines. Motion -> Language -> Lies -> Math/Reason -> Life is the pathway I'm trying with this model, but without the robotics.
226
+ #### 38->40
227
+ **Repetition**: CodeFeedback, question gen, RAG (perhaps LongAlapaca), Roleplay
228
+
229
+ ### MoE Rodents
230
+
231
+ This is inspired by [Beyonder](https://huggingface.co/mlabonne/Beyonder-4x7B-v2). I think that with carefully selected positive prompting and a finetune on a large diverse dataset like Hercules, Hyperion, or OpenHermes, when using the same model x4, will make a big difference. 4 trained individually seems to work, but being able to test hypotheses on a smaller scale would be great to see more of.
232
+ #### 4x3B NinjaMice
233
+
234
+ **Expert 1:** Chat/Assist
235
+
236
+ **Expert 2:** Programming
237
+
238
+ **Expert 3:** Writing/Creative
239
+
240
+ **Expert 4:** Reason/Math
config.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "trollek/NinjaMouse-2.4B-32L-danube",
3
+ "architectures": [
4
+ "MistralForCausalLM"
5
+ ],
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 1,
8
+ "eos_token_id": 2,
9
+ "hidden_act": "silu",
10
+ "hidden_size": 2560,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 6912,
13
+ "max_position_embeddings": 16384,
14
+ "model_type": "mistral",
15
+ "num_attention_heads": 32,
16
+ "num_hidden_layers": 32,
17
+ "num_key_value_heads": 8,
18
+ "pad_token_id": 0,
19
+ "pretraining_tp": 1,
20
+ "rms_norm_eps": 1e-05,
21
+ "rope_scaling": null,
22
+ "rope_theta": 10000.0,
23
+ "sliding_window": 4096,
24
+ "tie_word_embeddings": false,
25
+ "torch_dtype": "bfloat16",
26
+ "transformers_version": "4.38.2",
27
+ "unsloth_version": "2024.3",
28
+ "use_cache": false,
29
+ "vocab_size": 32000
30
+ }
generation_config.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "pad_token_id": 0,
6
+ "transformers_version": "4.38.2",
7
+ "use_cache": false
8
+ }
output.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b3b8c59f679c7657c63395d61f5ff2749e957e860176a41d2c41dd58c8c93ffd
3
+ size 1340387798
special_tokens_map.json ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "</s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "pad_token": {
24
+ "content": "<unk>",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "sep_token": {
31
+ "content": "</s>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "unk_token": {
38
+ "content": "<unk>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ }
44
+ }
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347
3
+ size 499723
tokenizer_config.json ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": true,
3
+ "add_eos_token": false,
4
+ "add_prefix_space": true,
5
+ "added_tokens_decoder": {
6
+ "0": {
7
+ "content": "<unk>",
8
+ "lstrip": false,
9
+ "normalized": false,
10
+ "rstrip": false,
11
+ "single_word": false,
12
+ "special": true
13
+ },
14
+ "1": {
15
+ "content": "<s>",
16
+ "lstrip": false,
17
+ "normalized": false,
18
+ "rstrip": false,
19
+ "single_word": false,
20
+ "special": true
21
+ },
22
+ "2": {
23
+ "content": "</s>",
24
+ "lstrip": false,
25
+ "normalized": false,
26
+ "rstrip": false,
27
+ "single_word": false,
28
+ "special": true
29
+ }
30
+ },
31
+ "bos_token": "<s>",
32
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set system_message = messages[0]['content'] %}{% endif %}{% if system_message is defined %}{{ system_message + '\\n' }}{% endif %}{% for message in messages %}{% set content = message['content'] %}{% if message['role'] == 'user' %}{{ 'Human: ' + content + '\\nAssistant: ' }}{% elif message['role'] == 'assistant' %}{{ content + '</s>' + '\\n' }}{% endif %}{% endfor %}",
33
+ "clean_up_tokenization_spaces": false,
34
+ "cls_token": "</s>",
35
+ "eos_token": "</s>",
36
+ "legacy": false,
37
+ "model_max_length": 1000000000000000019884624838656,
38
+ "pad_token": "<unk>",
39
+ "padding_side": "left",
40
+ "sep_token": "</s>",
41
+ "sp_model_kwargs": {},
42
+ "spaces_between_special_tokens": false,
43
+ "split_special_tokens": false,
44
+ "tokenizer_class": "LlamaTokenizer",
45
+ "unk_token": "<unk>",
46
+ "use_default_system_prompt": false
47
+ }