Upload sd_token_similarity_calculator.ipynb
Browse files- sd_token_similarity_calculator.ipynb +332 -26
sd_token_similarity_calculator.ipynb
CHANGED
@@ -17,12 +17,23 @@
|
|
17 |
{
|
18 |
"cell_type": "markdown",
|
19 |
"source": [
|
20 |
-
"This Notebook is a Stable-diffusion tool which allows you to find similiar tokens from the SD 1.5 vocab.json that you can use for text-to-image generation. Try this Free online SD 1.5 generator with the results: https://perchance.org/fusion-ai-image-generator"
|
|
|
|
|
21 |
],
|
22 |
"metadata": {
|
23 |
"id": "L7JTcbOdBPfh"
|
24 |
}
|
25 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
26 |
{
|
27 |
"cell_type": "code",
|
28 |
"source": [
|
@@ -88,6 +99,144 @@
|
|
88 |
"execution_count": null,
|
89 |
"outputs": []
|
90 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
91 |
{
|
92 |
"cell_type": "code",
|
93 |
"source": [
|
@@ -107,22 +256,10 @@
|
|
107 |
"#You can leave the 'prompt' field empty to get a random value tensor. Since the tensor is random value, it will not correspond to any tensor in the vocab.json list , and this it will have no ID."
|
108 |
],
|
109 |
"metadata": {
|
110 |
-
"id": "RPdkYzT2_X85"
|
111 |
-
"colab": {
|
112 |
-
"base_uri": "https://localhost:8080/"
|
113 |
-
},
|
114 |
-
"outputId": "86f2f01e-6a04-4292-cee7-70fd8398e07f"
|
115 |
},
|
116 |
"execution_count": null,
|
117 |
-
"outputs": [
|
118 |
-
{
|
119 |
-
"output_type": "stream",
|
120 |
-
"name": "stdout",
|
121 |
-
"text": [
|
122 |
-
"[49406, 8922, 49407]\n"
|
123 |
-
]
|
124 |
-
}
|
125 |
-
]
|
126 |
},
|
127 |
{
|
128 |
"cell_type": "code",
|
@@ -353,21 +490,20 @@
|
|
353 |
"source": [
|
354 |
"\n",
|
355 |
"\n",
|
356 |
-
"
|
357 |
"\n",
|
358 |
"Similiar vectors = similiar output in the SD 1.5 / SDXL / FLUX model\n",
|
359 |
"\n",
|
360 |
-
"CLIP converts the prompt text to vectors (“tensors”) , with float32 values usually ranging from -1 to 1
|
361 |
"\n",
|
362 |
-
"Dimensions are [ 1x768 ] tensors for SD 1.5 , and a [ 1x768 , 1x1024 ] tensor for SDXL and FLUX.\n",
|
363 |
"\n",
|
364 |
"The SD models and FLUX converts these vectors to an image.\n",
|
365 |
"\n",
|
366 |
-
"This notebook takes an input string , tokenizes it and matches the first token against the 49407 token vectors in the vocab.json :
|
367 |
"\n",
|
368 |
"It finds the “most similiar tokens” in the list. Similarity is the theta angle between the token vectors.\n",
|
369 |
"\n",
|
370 |
-
"\n",
|
371 |
"<div>\n",
|
372 |
"<img src=\"https://huggingface.co/datasets/codeShare/sd_tokens/resolve/main/cosine.jpeg\" width=\"300\"/>\n",
|
373 |
"</div>\n",
|
@@ -376,19 +512,189 @@
|
|
376 |
"\n",
|
377 |
"Negative similarity is also possible.\n",
|
378 |
"\n",
|
379 |
-
"
|
380 |
"\n",
|
381 |
-
"
|
382 |
"\n",
|
383 |
-
"
|
384 |
"\n",
|
385 |
-
"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
386 |
"\n",
|
387 |
"So TLDR; vector direction = “what to generate” , vector magnitude = “prompt weights”\n",
|
388 |
"\n",
|
389 |
-
"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
390 |
"\n",
|
391 |
-
"
|
392 |
],
|
393 |
"metadata": {
|
394 |
"id": "njeJx_nSSA8H"
|
|
|
17 |
{
|
18 |
"cell_type": "markdown",
|
19 |
"source": [
|
20 |
+
"This Notebook is a Stable-diffusion tool which allows you to find similiar tokens from the SD 1.5 vocab.json that you can use for text-to-image generation. Try this Free online SD 1.5 generator with the results: https://perchance.org/fusion-ai-image-generator\n",
|
21 |
+
"\n",
|
22 |
+
"Scroll to the bottom of the notebook to see the guide for how this works."
|
23 |
],
|
24 |
"metadata": {
|
25 |
"id": "L7JTcbOdBPfh"
|
26 |
}
|
27 |
},
|
28 |
+
{
|
29 |
+
"cell_type": "code",
|
30 |
+
"source": [],
|
31 |
+
"metadata": {
|
32 |
+
"id": "PBwVIuAjEdHA"
|
33 |
+
},
|
34 |
+
"execution_count": null,
|
35 |
+
"outputs": []
|
36 |
+
},
|
37 |
{
|
38 |
"cell_type": "code",
|
39 |
"source": [
|
|
|
99 |
"execution_count": null,
|
100 |
"outputs": []
|
101 |
},
|
102 |
+
{
|
103 |
+
"cell_type": "code",
|
104 |
+
"source": [
|
105 |
+
"# @title ⚡ Get similiar tokens\n",
|
106 |
+
"from transformers import AutoTokenizer\n",
|
107 |
+
"tokenizer = AutoTokenizer.from_pretrained(\"openai/clip-vit-large-patch14\", clean_up_tokenization_spaces = False)\n",
|
108 |
+
"\n",
|
109 |
+
"prompt= \"banana\" # @param {type:'string'}\n",
|
110 |
+
"\n",
|
111 |
+
"tokenizer_output = tokenizer(text = prompt)\n",
|
112 |
+
"input_ids = tokenizer_output['input_ids']\n",
|
113 |
+
"print(input_ids)\n",
|
114 |
+
"\n",
|
115 |
+
"\n",
|
116 |
+
"#The prompt will be enclosed with the <|start-of-text|> and <|end-of-text|> tokens, which is why output will be [49406, ... , 49407].\n",
|
117 |
+
"\n",
|
118 |
+
"#You can leave the 'prompt' field empty to get a random value tensor. Since the tensor is random value, it will not correspond to any tensor in the vocab.json list , and this it will have no ID.\n",
|
119 |
+
"\n",
|
120 |
+
"id_A = input_ids[1]\n",
|
121 |
+
"A = token[id_A]\n",
|
122 |
+
"_A = LA.vector_norm(A, ord=2)\n",
|
123 |
+
"\n",
|
124 |
+
"#if no imput exists we just randomize the entire thing\n",
|
125 |
+
"if (prompt == \"\"):\n",
|
126 |
+
" id_A = -1\n",
|
127 |
+
" print(\"Tokenized prompt tensor A is a random valued tensor with no ID\")\n",
|
128 |
+
" R = torch.rand(768)\n",
|
129 |
+
" _R = LA.vector_norm(R, ord=2)\n",
|
130 |
+
" A = R*(_A/_R)\n",
|
131 |
+
"\n",
|
132 |
+
"\n",
|
133 |
+
"mix_with = \"\" # @param {\"type\":\"string\",\"placeholder\":\"(optional) write something else\"}\n",
|
134 |
+
"mix_method = \"None\" # @param [\"None\" , \"Average\", \"Subtract\"] {allow-input: true}\n",
|
135 |
+
"w = 0.5 # @param {type:\"slider\", min:0, max:1, step:0.01}\n",
|
136 |
+
"\n",
|
137 |
+
"tokenizer_output = tokenizer(text = mix_with)\n",
|
138 |
+
"input_ids = tokenizer_output['input_ids']\n",
|
139 |
+
"id_C = input_ids[1]\n",
|
140 |
+
"C = token[id_C]\n",
|
141 |
+
"_C = LA.vector_norm(C, ord=2)\n",
|
142 |
+
"\n",
|
143 |
+
"#if no imput exists we just randomize the entire thing\n",
|
144 |
+
"if (mix_with == \"\"):\n",
|
145 |
+
" id_C = -1\n",
|
146 |
+
" print(\"Tokenized prompt 'mix_with' tensor C is a random valued tensor with no ID\")\n",
|
147 |
+
" R = torch.rand(768)\n",
|
148 |
+
" _R = LA.vector_norm(R, ord=2)\n",
|
149 |
+
" C = R*(_C/_R)\n",
|
150 |
+
"\n",
|
151 |
+
"if (mix_method == \"None\"):\n",
|
152 |
+
" print(\"No operation\")\n",
|
153 |
+
"\n",
|
154 |
+
"if (mix_method == \"Average\"):\n",
|
155 |
+
" A = w*A + (1-w)*C\n",
|
156 |
+
" _A = LA.vector_norm(A, ord=2)\n",
|
157 |
+
" print(\"Tokenized prompt tensor A has been recalculated as A = w*A + (1-w)*C , where C is the tokenized prompt 'mix_with' tensor C\")\n",
|
158 |
+
"\n",
|
159 |
+
"if (mix_method == \"Subtract\"):\n",
|
160 |
+
" tmp = (A/_A) - (C/_C)\n",
|
161 |
+
" _tmp = LA.vector_norm(tmp, ord=2)\n",
|
162 |
+
" A = tmp*((w*_A + (1-w)*_C)/_tmp)\n",
|
163 |
+
" _A = LA.vector_norm(A, ord=2)\n",
|
164 |
+
" print(\"Tokenized prompt tensor A has been recalculated as A = (w*_A + (1-w)*_C) * norm(w*A - (1-w)*C) , where C is the tokenized prompt 'mix_with' tensor C\")\n",
|
165 |
+
"\n",
|
166 |
+
"#OPTIONAL : Add/subtract + normalize above result with another token. Leave field empty to get a random value tensor\n",
|
167 |
+
"\n",
|
168 |
+
"dots = torch.zeros(NUM_TOKENS)\n",
|
169 |
+
"for index in range(NUM_TOKENS):\n",
|
170 |
+
" id_B = index\n",
|
171 |
+
" B = token[id_B]\n",
|
172 |
+
" _B = LA.vector_norm(B, ord=2)\n",
|
173 |
+
" result = torch.dot(A,B)/(_A*_B)\n",
|
174 |
+
" #result = absolute_value(result.item())\n",
|
175 |
+
" result = result.item()\n",
|
176 |
+
" dots[index] = result\n",
|
177 |
+
"\n",
|
178 |
+
"name_A = \"A of random type\"\n",
|
179 |
+
"if (id_A>-1):\n",
|
180 |
+
" name_A = vocab[id_A]\n",
|
181 |
+
"\n",
|
182 |
+
"name_C = \"token C of random type\"\n",
|
183 |
+
"if (id_C>-1):\n",
|
184 |
+
" name_C = vocab[id_C]\n",
|
185 |
+
"\n",
|
186 |
+
"\n",
|
187 |
+
"sorted, indices = torch.sort(dots,dim=0 , descending=True)\n",
|
188 |
+
"#----#\n",
|
189 |
+
"if (mix_method == \"Average\"):\n",
|
190 |
+
" print(f'Calculated all cosine-similarities between the average of token {name_A} and {name_C} with Id_A = {id_A} and mixed Id_C = {id_C} as a 1x{sorted.shape[0]} tensor')\n",
|
191 |
+
"if (mix_method == \"Subtract\"):\n",
|
192 |
+
" print(f'Calculated all cosine-similarities between the subtract of token {name_A} and {name_C} with Id_A = {id_A} and mixed Id_C = {id_C} as a 1x{sorted.shape[0]} tensor')\n",
|
193 |
+
"if (mix_method == \"None\"):\n",
|
194 |
+
" print(f'Calculated all cosine-similarities between the token {name_A} with Id_A = {id_A} with the the rest of the {NUM_TOKENS} tokens as a 1x{sorted.shape[0]} tensor')\n",
|
195 |
+
"\n",
|
196 |
+
"#Produce a list id IDs that are most similiar to the prompt ID at positiion 1 based on above result\n",
|
197 |
+
"\n",
|
198 |
+
"list_size = 100 # @param {type:'number'}\n",
|
199 |
+
"print_ID = False # @param {type:\"boolean\"}\n",
|
200 |
+
"print_Similarity = True # @param {type:\"boolean\"}\n",
|
201 |
+
"print_Name = True # @param {type:\"boolean\"}\n",
|
202 |
+
"print_Divider = True # @param {type:\"boolean\"}\n",
|
203 |
+
"\n",
|
204 |
+
"\n",
|
205 |
+
"if (print_Divider):\n",
|
206 |
+
" print('//---//') # % value\n",
|
207 |
+
"\n",
|
208 |
+
"print('') # % value\n",
|
209 |
+
"print('Here is the result : ') # % value\n",
|
210 |
+
"print('') # % value\n",
|
211 |
+
"\n",
|
212 |
+
"for index in range(list_size):\n",
|
213 |
+
" id = indices[index].item()\n",
|
214 |
+
" if (print_Name):\n",
|
215 |
+
" print(f'{vocab[id]}') # vocab item\n",
|
216 |
+
" if (print_ID):\n",
|
217 |
+
" print(f'ID = {id}') # IDs\n",
|
218 |
+
" if (print_Similarity):\n",
|
219 |
+
" print(f'similiarity = {round(sorted[index].item()*100,2)} %') # % value\n",
|
220 |
+
" if (print_Divider):\n",
|
221 |
+
" print('--------')\n",
|
222 |
+
"\n",
|
223 |
+
"#Print the sorted list from above result"
|
224 |
+
],
|
225 |
+
"metadata": {
|
226 |
+
"id": "iWeFnT1gAx6A"
|
227 |
+
},
|
228 |
+
"execution_count": null,
|
229 |
+
"outputs": []
|
230 |
+
},
|
231 |
+
{
|
232 |
+
"cell_type": "markdown",
|
233 |
+
"source": [
|
234 |
+
"# ↓ Sub modules (use these to build your own projects) ↓"
|
235 |
+
],
|
236 |
+
"metadata": {
|
237 |
+
"id": "_d8WtPgtAymM"
|
238 |
+
}
|
239 |
+
},
|
240 |
{
|
241 |
"cell_type": "code",
|
242 |
"source": [
|
|
|
256 |
"#You can leave the 'prompt' field empty to get a random value tensor. Since the tensor is random value, it will not correspond to any tensor in the vocab.json list , and this it will have no ID."
|
257 |
],
|
258 |
"metadata": {
|
259 |
+
"id": "RPdkYzT2_X85"
|
|
|
|
|
|
|
|
|
260 |
},
|
261 |
"execution_count": null,
|
262 |
+
"outputs": []
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
263 |
},
|
264 |
{
|
265 |
"cell_type": "code",
|
|
|
490 |
"source": [
|
491 |
"\n",
|
492 |
"\n",
|
493 |
+
"# How does this notebook work?\n",
|
494 |
"\n",
|
495 |
"Similiar vectors = similiar output in the SD 1.5 / SDXL / FLUX model\n",
|
496 |
"\n",
|
497 |
+
"CLIP converts the prompt text to vectors (“tensors”) , with float32 values usually ranging from -1 to 1.\n",
|
498 |
"\n",
|
499 |
+
"Dimensions are \\[ 1x768 ] tensors for SD 1.5 , and a \\[ 1x768 , 1x1024 ] tensor for SDXL and FLUX.\n",
|
500 |
"\n",
|
501 |
"The SD models and FLUX converts these vectors to an image.\n",
|
502 |
"\n",
|
503 |
+
"This notebook takes an input string , tokenizes it and matches the first token against the 49407 token vectors in the vocab.json : [https://huggingface.co/black-forest-labs/FLUX.1-dev/tree/main/tokenizer](https://www.google.com/url?q=https%3A%2F%2Fhuggingface.co%2Fblack-forest-labs%2FFLUX.1-dev%2Ftree%2Fmain%2Ftokenizer)\n",
|
504 |
"\n",
|
505 |
"It finds the “most similiar tokens” in the list. Similarity is the theta angle between the token vectors.\n",
|
506 |
"\n",
|
|
|
507 |
"<div>\n",
|
508 |
"<img src=\"https://huggingface.co/datasets/codeShare/sd_tokens/resolve/main/cosine.jpeg\" width=\"300\"/>\n",
|
509 |
"</div>\n",
|
|
|
512 |
"\n",
|
513 |
"Negative similarity is also possible.\n",
|
514 |
"\n",
|
515 |
+
"# How can I use it?\n",
|
516 |
"\n",
|
517 |
+
"If you are bored of prompting “girl” and want something similiar you can run this notebook and use the “chick” token at 21.88% similarity , for example\n",
|
518 |
"\n",
|
519 |
+
"You can also run a mixed search , like “cute+girl”/2 , where for example “kpop” has a 16.71% similarity\n",
|
520 |
"\n",
|
521 |
+
"There are some strange tokens further down the list you go. Example: tokens similiar to the token \"pewdiepie</w>\" (yes this is an actual token that exists in CLIP)\n",
|
522 |
+
"\n",
|
523 |
+
"<div>\n",
|
524 |
+
"<img src=\"https://lemmy.world/pictrs/image/a1cd284e-3341-4284-9949-5f8b58d3bd1f.jpeg\" width=\"300\"/>\n",
|
525 |
+
"</div>\n",
|
526 |
+
"\n",
|
527 |
+
"Each of these correspond to a unique 1x768 token vector.\n",
|
528 |
+
"\n",
|
529 |
+
"The higher the ID value , the less often the token appeared in the CLIP training data.\n",
|
530 |
+
"\n",
|
531 |
+
"To reiterate; this is the CLIP model training data , not the SD-model training data.\n",
|
532 |
+
"\n",
|
533 |
+
"So for certain models , tokens with high ID can give very consistent results , if the SD model is trained to handle them.\n",
|
534 |
+
"\n",
|
535 |
+
"Example of this can be anime models , where japanese artist names can affect the output greatly. \n",
|
536 |
+
"\n",
|
537 |
+
"Tokens with high ID will often give the \"fun\" output when used in very short prompts.\n",
|
538 |
+
"\n",
|
539 |
+
"# What about token vector length?\n",
|
540 |
+
"\n",
|
541 |
+
"If you are wondering about token magnitude,\n",
|
542 |
+
"Prompt weights like (banana:1.2) will scale the magnitude of the corresponding 1x768 tensor(s) by 1.2 . So thats how prompt token magnitude works.\n",
|
543 |
+
"\n",
|
544 |
+
"Source: [https://huggingface.co/docs/diffusers/main/en/using-diffusers/weighted\\_prompts](https://www.google.com/url?q=https%3A%2F%2Fhuggingface.co%2Fdocs%2Fdiffusers%2Fmain%2Fen%2Fusing-diffusers%2Fweighted_prompts)\\*\n",
|
545 |
"\n",
|
546 |
"So TLDR; vector direction = “what to generate” , vector magnitude = “prompt weights”\n",
|
547 |
"\n",
|
548 |
+
"# How prompting works (technical summary)\n",
|
549 |
+
"\n",
|
550 |
+
" 1. There is no correct way to prompt.\n",
|
551 |
+
"\n",
|
552 |
+
"2. Stable diffusion reads your prompt left to right, one token at a time, finding association _from_ the previous token _to_ the current token _and to_ the image generated thus far (Cross Attention Rule)\n",
|
553 |
+
"\n",
|
554 |
+
"3. Stable Diffusion is an optimization problem that seeks to maximize similarity to prompt and minimize similarity to negatives (Optimization Rule)\n",
|
555 |
+
"\n",
|
556 |
+
"Reference material (covers entire SD , so not good source material really, but the info is there) : https://youtu.be/sFztPP9qPRc?si=ge2Ty7wnpPGmB0gi\n",
|
557 |
+
"\n",
|
558 |
+
"# The SD pipeline\n",
|
559 |
+
"\n",
|
560 |
+
"For every step (20 in total by default) for SD1.5 :\n",
|
561 |
+
"\n",
|
562 |
+
"1. Prompt text => (tokenizer)\n",
|
563 |
+
"2. => Nx768 token vectors =>(CLIP model) =>\n",
|
564 |
+
"3. 1x768 encoding => ( the SD model / Unet ) =>\n",
|
565 |
+
"4. => _Desired_ image per Rule 3 => ( sampler)\n",
|
566 |
+
"5. => Paint a section of the image => (image)\n",
|
567 |
+
"\n",
|
568 |
+
"# Disclaimer /Trivia\n",
|
569 |
+
"\n",
|
570 |
+
"This notebook should be seen as a \"dictionary search tool\" for the vocab.json , which is the same for SD1.5 , SDXL and FLUX. Feel free to verify this by checking the 'tokenizer' folder under each model.\n",
|
571 |
+
"\n",
|
572 |
+
"vocab.json in the FLUX model , for example (1 of 2 copies) : https://huggingface.co/black-forest-labs/FLUX.1-dev/tree/main/tokenizer\n",
|
573 |
+
"\n",
|
574 |
+
"I'm using Clip-vit-large-patch14 , which is used in SD 1.5 , and is one among the two tokenizers for SDXL and FLUX : https://huggingface.co/openai/clip-vit-large-patch14/blob/main/README.md\n",
|
575 |
+
"\n",
|
576 |
+
"This set of tokens has dimension 1x768. \n",
|
577 |
+
"\n",
|
578 |
+
"SDXL and FLUX uses an additional set of tokens of dimension 1x1024.\n",
|
579 |
+
"\n",
|
580 |
+
"These are not included in this notebook. Feel free to include them yourselves (I would appreciate that).\n",
|
581 |
+
"\n",
|
582 |
+
"To do so, you will have to download a FLUX and/or SDXL model\n",
|
583 |
+
"\n",
|
584 |
+
", and copy the 49407x1024 tensor list that is stored within the model and then save it as a .pt file.\n",
|
585 |
+
"\n",
|
586 |
+
"//---//\n",
|
587 |
+
"\n",
|
588 |
+
"I am aware it is actually the 1x768 text_encoding being processed into an image for the SD models + FLUX.\n",
|
589 |
+
"\n",
|
590 |
+
"As such , I've included text_encoding comparison at the bottom of the Notebook.\n",
|
591 |
+
"\n",
|
592 |
+
"I am also aware thar SDXL and FLUX uses additional encodings , which are not included in this notebook.\n",
|
593 |
+
"\n",
|
594 |
+
"* Clip-vit-bigG for SDXL: https://huggingface.co/laion/CLIP-ViT-bigG-14-laion2B-39B-b160k/blob/main/README.md\n",
|
595 |
+
"\n",
|
596 |
+
"* And the T5 text encoder for FLUX. I have 0% understanding of FLUX T5 text_encoder.\n",
|
597 |
+
"\n",
|
598 |
+
"//---//\n",
|
599 |
+
"\n",
|
600 |
+
"If you want them , feel free to include them yourself and share the results (cuz I probably won't) :)!\n",
|
601 |
+
"\n",
|
602 |
+
"That being said , being an encoding , I reckon the CLIP Nx768 => 1x768 should be \"linear\" (or whatever one might call it)\n",
|
603 |
+
"\n",
|
604 |
+
"So exchange a few tokens in the Nx768 for something similiar , and the resulting 1x768 ought to be kinda similar to 1x768 we had earlier. Hopefully.\n",
|
605 |
+
"\n",
|
606 |
+
"I feel its important to mention this , in case some wonder why the token-token similarity don't match the text-encoding to text-encoding similarity.\n",
|
607 |
+
"\n",
|
608 |
+
"# Note regarding CLIP text encoding vs. token\n",
|
609 |
+
"\n",
|
610 |
+
"*To make this disclaimer clear; Token-to-token similarity is not the same as text_encoding similarity.*\n",
|
611 |
+
"\n",
|
612 |
+
"I have to say this , since it will otherwise get (even more) confusing , as both the individual tokens , and the text_encoding have dimensions 1x768.\n",
|
613 |
+
"\n",
|
614 |
+
"They are separate things. Separate results. etc.\n",
|
615 |
+
"\n",
|
616 |
+
"As such , you will not get anything useful if you start comparing similarity between a token , and a text-encoding. So don't do that :)!\n",
|
617 |
+
"\n",
|
618 |
+
"# What about the CLIP image encoding?\n",
|
619 |
+
"\n",
|
620 |
+
"The CLIP model can also do an image_encoding of an image, where the output will be a 1x768 tensor. These _can_ be compared with the text_encoding.\n",
|
621 |
+
"\n",
|
622 |
+
"Comparing CLIP image_encoding with the CLIP text_encoding for a bunch of random prompts until you find the \"highest similarity\" , is a method used in the CLIP interrogator : https://huggingface.co/spaces/pharmapsychotic/CLIP-Interrogator\n",
|
623 |
+
"\n",
|
624 |
+
"List of random prompts for CLIP interrogator can be found here, for reference : https://github.com/pharmapsychotic/clip-interrogator/tree/main/clip_interrogator/data\n",
|
625 |
+
"\n",
|
626 |
+
"The CLIP image_encoding is not included in this Notebook.\n",
|
627 |
+
"\n",
|
628 |
+
"If you spot errors / ideas for improvememts; feel free to fix the code in your own notebook and post the results.\n",
|
629 |
+
"\n",
|
630 |
+
"I'd appreciate that over people saying \"your math is wrong you n00b!\" with no constructive feedback.\n",
|
631 |
+
"\n",
|
632 |
+
"//---//\n",
|
633 |
+
"\n",
|
634 |
+
"Regarding output\n",
|
635 |
+
"\n",
|
636 |
+
"# What are the </w> symbols?\n",
|
637 |
+
"\n",
|
638 |
+
"The whitespace symbol indicate if the tokenized item ends with whitespace ( the suffix \"banana</w>\" => \"banana \" ) or not (the prefix \"post\" in \"post-apocalyptic \")\n",
|
639 |
+
"\n",
|
640 |
+
"For ease of reference , I call them prefix-tokens and suffix-tokens.\n",
|
641 |
+
"\n",
|
642 |
+
"Sidenote:\n",
|
643 |
+
"\n",
|
644 |
+
"Prefix tokens have the unique property in that they \"mutate\" suffix tokens\n",
|
645 |
+
"\n",
|
646 |
+
"Example: \"photo of a #prefix#-banana\"\n",
|
647 |
+
"\n",
|
648 |
+
"where #prefix# is a randomly selected prefix-token from the vocab.json\n",
|
649 |
+
"\n",
|
650 |
+
"The hyphen \"-\" exists to guarantee the tokenized text splits into the written #prefix# and #suffix# token respectively. The \"-\" hypen symbol can be replaced by any other special character of your choosing.\n",
|
651 |
+
"\n",
|
652 |
+
" Capital letters work too , e.g \"photo of a #prefix#Abanana\" since the capital letters A-Z are only listed once in the entire vocab.json.\n",
|
653 |
+
"\n",
|
654 |
+
"You can also choose to omit any separator and just rawdog it with the prompt \"photo of a #prefix#banana\" , however know that this may , on occasion , be tokenized as completely different tokens of lower ID:s.\n",
|
655 |
+
"\n",
|
656 |
+
"Curiously , common NSFW terms found online have in the CLIP model have been purposefully fragmented into separate #prefix# and #suffix# counterparts in the vocab.json. Likely for PR-reasons.\n",
|
657 |
+
"\n",
|
658 |
+
"You can verify the results using this online tokenizer: https://sd-tokenizer.rocker.boo/\n",
|
659 |
+
"\n",
|
660 |
+
"<div>\n",
|
661 |
+
"<img src=\"https://lemmy.world/pictrs/image/43467d75-7406-4a13-93ca-cdc469f944fc.jpeg\" width=\"300\"/>\n",
|
662 |
+
"<img src=\"https://lemmy.world/pictrs/image/c0411565-0cb3-47b1-a788-b368924d6f17.jpeg\" width=\"300\"/>\n",
|
663 |
+
"<img src=\"https://lemmy.world/pictrs/image/c27c6550-a88b-4543-9bd7-067dff016be2.jpeg\" width=\"300\"/>\n",
|
664 |
+
"</div>\n",
|
665 |
+
"\n",
|
666 |
+
"# What is that gibberish tokens that show up?\n",
|
667 |
+
"\n",
|
668 |
+
"The gibberish tokens like \"ðŁĺħ\\</w>\" are actually emojis!\n",
|
669 |
+
"\n",
|
670 |
+
"Try writing some emojis in this online tokenizer to see the results: https://sd-tokenizer.rocker.boo/\n",
|
671 |
+
"\n",
|
672 |
+
"It is a bit borked as it can't process capital letters properly.\n",
|
673 |
+
"\n",
|
674 |
+
"Also note that this is not reversible.\n",
|
675 |
+
"\n",
|
676 |
+
"If tokenization \"😅\" => ðŁĺħ</w>\n",
|
677 |
+
"\n",
|
678 |
+
"Then you can't prompt \"ðŁĺħ\" and expect to get the same result as the tokenized original emoji , \"😅\".\n",
|
679 |
+
"\n",
|
680 |
+
"SD 1.5 models actually have training for Emojis.\n",
|
681 |
+
"\n",
|
682 |
+
"But you have to set CLIP skip to 1 for this to work is intended.\n",
|
683 |
+
"\n",
|
684 |
+
"For example, this is the result from \"photo of a 🧔🏻♂️\"\n",
|
685 |
+
"\n",
|
686 |
+
"\n",
|
687 |
+
"<div>\n",
|
688 |
+
"<img src=\"https://lemmy.world/pictrs/image/e2b51aea-6960-4ad0-867e-8ce85f2bd51e.jpeg\" width=\"300\"/>\n",
|
689 |
+
"</div>\n",
|
690 |
+
"\n",
|
691 |
+
"A tutorial on stuff you can do with the vocab.list concluded.\n",
|
692 |
+
"\n",
|
693 |
+
"Anyways, have fun with the notebook.\n",
|
694 |
+
"\n",
|
695 |
+
"There might be some updates in the future with features not mentioned here.\n",
|
696 |
"\n",
|
697 |
+
"//---//"
|
698 |
],
|
699 |
"metadata": {
|
700 |
"id": "njeJx_nSSA8H"
|