DavidAU commited on
Commit
73dced2
·
verified ·
1 Parent(s): 6bf4853

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +326 -275
README.md CHANGED
@@ -30,6 +30,8 @@ tags:
30
 
31
  <h3>Maximizing Model Performance for All Quants Types And Full-Precision using Samplers, Advance Samplers and Parameters Guide</h3>
32
 
 
 
33
  This document includes detailed information, references, and notes for general parameters, samplers and
34
  advanced samplers to get the most out of your model's abilities including notes / settings for the most popular AI/LLM app in use (LLAMACPP, KoboldCPP, Text-Generation-WebUI, LMStudio, Sillytavern, Ollama and others).
35
 
@@ -78,209 +80,234 @@ You will get higher quality operation overall - stronger prose, better answers,
78
 
79
  ---
80
 
81
- <h2>TESTING / Generation Example PARAMETERS AND SAMPLERS</h2>
82
 
83
  ---
84
 
85
- Primary Testing Parameters I use, including use for output generation examples at my repo:
86
 
87
- <B>Ranged Parameters:</B>
88
 
89
- temperature: 0 to 5 ("temp")
90
 
91
- repetition_penalty : 1.02 to 1.15 ("rep pen")
92
 
93
- <B>Set parameters:</B>
 
 
 
 
 
 
94
 
95
- top_k:40
 
 
 
 
 
 
96
 
97
- min_p:0.05
98
 
99
- top_p: 0.95
100
 
101
- repeat-last-n: 64 (also called: "repetition_penalty_range" / "rp range" )
 
 
 
102
 
103
- I do not set any other settings, parameters or have samplers activated when generating examples.
 
 
 
 
 
 
 
 
104
 
105
- Everything else is "zeroed" / "disabled".
106
-
107
- These parameters/settings are considered both safe and default and in most cases available to all users in all AI/LLM apps.
108
-
109
- Note for Class 3/Class 4 models (discussed below) "repeat-last-n" is a CRITICAL setting.
110
 
111
  ---
112
 
113
- <h2>SOURCE FILES for my Models / APPS to Run LLMs / AIs:</h2>
114
 
115
  ---
116
 
117
- Source files / Source models of my models are located here (also upper right menu on this page):
118
-
119
- [ https://huggingface.co/collections/DavidAU/d-au-source-files-for-gguf-exl2-awq-gptq-hqq-etc-etc-66b55cb8ba25f914cbf210be ]
120
-
121
- You will need the config files to use "llamacpp_HF" loader ("text-generation-webui") [ https://github.com/oobabooga/text-generation-webui ]
122
-
123
- You can also use the full source in "text-generation-webui" too.
124
-
125
- As an alternative you can use GGUFs directly in "KOBOLDCPP" / "SillyTavern" without the "config files" and still use almost all the parameters, samplers and advanced samplers.
126
-
127
- <B>Parameters, Samplers and Advanced Samplers</B>
128
-
129
- In section 1 a,b, and c, below are all the LLAMA_CPP parameters and samplers.
130
-
131
- I have added notes below each one for adjustment / enhancement(s) for specific use cases.
132
-
133
- TEXT-GENERATION-WEBUI
134
-
135
- In section 2, will be additional samplers, which become available when using "llamacpp_HF" loader in https://github.com/oobabooga/text-generation-webui
136
- AND/OR https://github.com/LostRuins/koboldcpp ("KOBOLDCPP").
137
-
138
- The "llamacpp_HF" (for "text-generation-webui") only requires the GGUF you want to use plus a few config files from "source repo" of the model.
139
-
140
- (this process is automated with this program, just enter the repo(s) urls -> it will fetch everything for you)
141
-
142
- This allows access to very advanced samplers in addition to all the parameters / samplers here.
143
-
144
- KOBOLDCPP:
145
-
146
- Note that https://github.com/LostRuins/koboldcpp also allows access to all LLAMACPP parameters/samplers too as well as additional advanced samplers too.
147
-
148
- You can use almost all parameters, samplers and advanced samplers using "KOBOLDCPP" without the need to get the source config files (the "llamacpp_HF" step).
149
 
150
- Note: This program has one of the newest samplers called "Anti-slop" which allows phrase/word banning at the generation level.
151
 
152
- SILLYTAVERN:
153
 
154
- Note that https://github.com/SillyTavern/SillyTavern also allows access to all LLAMACPP parameters/samplers too as well as additional advanced samplers too.
155
 
156
- You can use almost all parameters, samplers and advanced samplers using "SILLYTAVERN" without the need to get the source config files (the "llamacpp_HF" step).
157
 
158
- For CLASS3 and CLASS4 the most important setting is "SMOOTHING FACTOR" (Quadratic Smoothing) ; information is located on this page:
159
 
160
- https://docs.sillytavern.app/usage/common-settings/
161
 
162
- Critical Note:
163
 
164
- Silly Tavern allows you to "connect" (via API) to different AI programs/apps like Koboldcpp, Llamacpp (server), Text Generation Webui, Lmstudio, Ollama ... etc etc.
165
 
166
- You "load" a model in one of these, then connect Silly Tavern to the App via API. This way you can use any model, and Sillytavern becomes the interface between
167
- the AI model and you directly. Sillytavern opens an interface in your browser.
168
 
169
- In Sillytavern you can then adjust parameters, samplers and advanced samplers ; there are also PRESET parameter/samplers too and you can save your favorites too.
170
 
171
- Currently, at time of this writing, connecting Silly Tavern via KoboldCPP or Text Generation Webui will provide the most samplers/parameters.
172
 
173
- However for some, connecting to Lmstudio, LlamaCPP, or Ollama may be preferred.
174
 
175
- NOTE:
176
 
177
- It appears that Silly Tavern also supports "DRY" and "XTC" too ; but it is not yet in the documentation at the time of writing.
 
 
 
 
 
 
 
 
 
 
 
178
 
179
- You may also want to check out how to connect SillyTavern to local AI "apps" running on your pc here:
 
 
180
 
181
- https://docs.sillytavern.app/usage/api-connections/
182
 
 
183
 
184
- OTHER PROGRAMS:
185
 
186
- Other programs like https://www.LMStudio.ai allows access to most of STANDARD samplers, where as others (llamacpp only here) you may need to add to the json file(s) for a model and/or template preset.
187
 
188
- In most cases all llama_cpp parameters/samplers are available when using API / headless / server mode in "text-generation-webui", "koboldcpp", "Sillytavern", "Olama", and "LMStudio" (as well as other apps too).
 
189
 
190
- You can also use llama_cpp directly too. (IE: llama-server.exe) ; see :
191
 
192
- https://github.com/ggerganov/llama.cpp
193
 
194
- (scroll down on the main page for more apps/programs to use GGUFs too that connect to / use the LLAMA-CPP package.)
195
 
196
- Special note:
197
 
198
- It appears "DRY" / "XTC" samplers has been added to LLAMACPP and SILLYTAVERN.
199
 
200
- It is available (Llamacpp) via "server.exe / llama-server.exe". Likely this sampler will also become available "downstream" in applications that use LLAMACPP in due time.
201
 
202
- [ https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md ]
203
 
204
- Operating Systems:
205
 
206
- Most AI/LLM apps operate on Windows, Mac, and Linux.
207
 
208
- Mobile devices (and O/S) are in many cases also supported.
209
 
210
- ---
211
 
212
- <h2>DETAILED NOTES ON PARAMETERS, SAMPLERS and ADVANCED SAMPLERS:</h2>
213
 
214
- ---
215
 
216
- Most AI / LLM apps allow saving a "profile" parameters and samplers - "favorite" settings.
217
 
218
- Text Generation Web Ui, Koboldcpp, Silly Tavern all have this feature and also "presets" (parameters/samplers set already) too.
219
 
220
- Other AI/LLM apps also have this feature to varying degrees too.
221
 
222
- DETAILS on PARAMETERS / SAMPLERS:
223
 
224
- For additional details on these samplers settings (including advanced ones) you may also want to check out:
225
 
226
- https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab
227
 
228
- (NOTE: Not all of these "options" are available for GGUFS, including when you use "llamacpp_HF" loader in "text-generation-webui" )
 
229
 
230
- Additional Links (on parameters, samplers and advanced samplers):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
231
 
232
- DRY
233
- - https://github.com/oobabooga/text-generation-webui/pull/5677
234
- - https://www.reddit.com/r/KoboldAI/comments/1e49vpt/dry_sampler_questionsthat_im_sure_most_of_us_are/
235
- - https://www.reddit.com/r/KoboldAI/comments/1eo4r6q/dry_settings_questions/
236
 
237
- Samplers:
238
 
239
- https://gist.github.com/kalomaze/4473f3f975ff5e5fade06e632498f73e
240
 
241
- (see also community tab for more discussions here)
 
 
242
 
243
- https://huggingface.co/LWDCLS/LLM-Discussions/discussions/2
244
 
245
- Silly Tavern / Samplers:
246
 
247
- (see also community tab for more discussions here)
248
 
249
- https://huggingface.co/Virt-io/SillyTavern-Presets
250
 
251
- Creative Writing :
252
 
253
- https://www.reddit.com/r/LocalLLaMA/comments/1c36ieb/comparing_sampling_techniques_for_creative/
254
 
255
- General Parameters:
256
 
257
- https://arxiv.org/html/2408.13586v1
258
 
259
- https://www.reddit.com/r/LocalLLaMA/comments/17vonjo/your_settings_are_probably_hurting_your_model_why/
260
 
261
- Benchmarking-and-Guiding-Adaptive-Sampling-Decoding
 
 
 
 
262
 
263
- https://github.com/ZhouYuxuanYX/Benchmarking-and-Guiding-Adaptive-Sampling-Decoding-for-LLMs
264
 
265
- LLAMACPP-SERVER EXE:
266
 
267
- https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md
268
 
269
- The Local LLM Settings Guide/Rant (covers a lot of parameters/samplers - lots of detail)
270
 
271
- https://rentry.org/llm-settings
272
 
273
- A Visual Guide of some top parameters / Samplers in action which you can play with and see how they interact:
274
 
275
- https://artefact2.github.io/llm-sampling/index.xhtml
276
 
277
- NOTE:
278
 
279
- I have also added notes too in the sections below for almost all parameters, samplers, and advanced samplers as well.
280
 
281
- OTHER:
282
 
283
- Depending on the AI/LLM "apps" you are using, additional reference material for parameters / samplers may also exist.
284
 
285
  ---
286
 
@@ -335,187 +362,141 @@ There are no "Class 5" models published... yet.
335
 
336
  ---
337
 
338
- <h2>QUANTS:</h2>
339
 
340
  ---
341
 
342
- Please note that smaller quant(s) IE: Q2K, IQ1s, IQ2s and some IQ3s (especially those of models size 8B parameters or less) may require additional adjustment(s). For these quants
343
- you may need to increase the "penalty" sampler(s) and/or advanced sampler(s) to compensate for the compression damage of the model.
344
-
345
- For models of 20B parameters and higher, generally this is not a major concern as the parameters can make up for compression damage at lower quant levels (IE Q2K+, but at least Q3 ; IQ2+, but at least IQ3+).
346
-
347
- IQ1s: Generally IQ1_S rarely works for models less than 30B parameters. IQ1_M is however almost twice as stable/usable relative to IQ1_S.
348
 
349
- Generally it is recommended to run the highest quant(s) you can on your machine ; but at least Q4KM/IQ4XS as a minimum for models 20B and lower.
350
 
351
- The smaller the size of model, the greater the contrast between the smallest quant and largest quant in terms of operation, quality, nuance and general overall function.
352
 
353
- There is an exception to this , see "Neo Imatrix" below and "all quants" (cpu only operation).
354
 
355
- IMATRIX:
356
 
357
- Imatrix quants generally improve all quants, and also allow you to use smaller quants (less memory, more context space) and retain quality of operation.
358
 
359
- IE: Instead of using a q4KM, you might be able to run an IQ3_M and get close to Q4KM's quality, but at a higher token per second speed and have more VRAM for context.
360
 
 
361
 
362
- <B>Recommended Quants - ALL:</B>
363
 
364
- This covers both Imatrix and regular quants.
 
365
 
366
- Imatrix can be applied to any quant - "Q" or "IQ" - however, IQ1s to IQ3_S REQUIRE an imatrix dataset / imatrixing process before quanting.
367
 
368
- This chart shows the order in terms of "BPW" for each quant (mapped below with relative "strength" to one another) with "IQ1_S" with the least, and "Q8_0" (F16 is full precision) with the most:
369
 
370
- <small>
371
- <PRE>
372
- IQ1_S | IQ1_M
373
- IQ2_XXS | IQ2_XS | Q2_K_S | IQ2_S | Q2_K | IQ2_M
374
- IQ3_XXS | Q3_K_S | IQ3_XS | IQ3_S | IQ3_M | Q3_K_M | Q3_K_L
375
- Q4_K_S | IQ4_XS | IQ4_NL | Q4_K_M
376
- Q5_K_S | Q5_K_M
377
- Q6_K
378
- Q8_0
379
- F16
380
- </pre>
381
- </small>
382
 
383
- More BPW mean better quality, but higher VRAM requirements (and larger file size) and lower tokens per second.
384
- The larger the model in terms of parameters the lower the size of quant you can run with less quality losses.
385
- Note that "quality losses" refers to both instruction following and output quality.
386
 
387
- Differences (quality) between quants at lower levels are larger relative to higher quants differences.
388
 
389
- The Imatrix process has NO effect on Q8 or F16 quants.
390
 
391
- F16 is full precision, just in GGUF format.
392
 
393
- CPU ONLY CONSIDERATIONS:
394
 
395
- This section DOES NOT apply to most "Macs" because of the difference in O/S Memory, Vram and motherboard VS other frameworks.
396
 
397
- Running quants on CPU will be a lot slower than running them on a video card(s).
398
 
399
- In this special case however it may be preferred to run AS SMALL a quant as possible for token per second generation reasons.
400
 
401
- On a top, high end (and relatively new) CPU expect token per second speeds to be 1/4 (or less) a standard middle of the road video card.
402
 
403
- Older machines/cpus will be a lot slower - but models will STILL run on these as long as you have enough ram.
404
 
405
- Here are some rough comparisons:
406
 
407
- On my video card (Nvidia 16GB 4060TI) I get 160-190 tokens per second with 1B LLama 3.2 Instruct, CPU speeds are 50-60 token per second.
 
408
 
409
- On my much older machine (8 years old)(2 core), token per second speed (same 1B model) is in the 10ish token per second (CPU).
410
 
411
- Roughly 8B-12B models are limit for CPU only operation (in terms of "usable" tokens/second) - at the moment.
412
 
413
- This is changing as new cpus come out, designed for AI usage.
414
 
415
- ADDITONAL QUANT INFORMATION:
416
 
417
- <details>
418
- <summary>Click here for details</summary>
419
 
420
- A great write up with charts showing various performances is provided by Artefact2 [here](https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9)
421
 
422
- The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have.
423
 
424
- If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM.
425
 
426
- If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total.
427
 
428
- Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'.
429
 
430
- If you don't want to think too much, grab one of the K-quants. These are in format 'QX_K_X', like Q5_K_M.
431
 
432
- If you want to get more into the weeds, you can check out this extremely useful feature chart:
433
 
434
- [llama.cpp feature matrix](https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix)
435
 
436
- But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQX_X, like IQ3_M. These are newer and offer better performance for their size.
437
 
438
- These I-quants can also be used on CPU and Apple Metal, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide.
439
 
440
- The I-quants are *not* compatible with Vulcan, which is also AMD, so if you have an AMD card double check if you're using the rocBLAS build or the Vulcan build. At the time of writing this, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm.
441
 
442
- </details>
443
 
444
- ARM QUANTS / Q4_0_X_X:
445
 
446
- These are new quants that are specifically for computers/devices that can run "ARM" quants. If you try to run these on a "non arm" machine/device, the token per second will be VERY SLOW.
447
 
448
- Q4_0_X_X information
449
 
450
- These are *NOT* for Metal (Apple) or GPU (nvidia/AMD/intel) offloading, only ARM chips (and certain AVX2/AVX512 CPUs).
451
 
452
- If you're using an ARM chip, the Q4_0_X_X quants will have a substantial speedup. Check out Q4_0_4_4 speed comparisons [on the original pull request](https://github.com/ggerganov/llama.cpp/pull/5780#pullrequestreview-21657544660)
453
 
454
- To check which one would work best for your ARM chip, you can check [AArch64 SoC features](https://gpages.juszkiewicz.com.pl/arm-socs-table/arm-socs.html) (thanks EloyOn!).
455
 
456
- If you're using a CPU that supports AVX2 or AVX512 (typically server CPUs and AMD's latest Zen5 CPUs) and are not offloading to a GPU, the Q4_0_8_8 may offer a nice speed as well:
457
 
458
- <details>
459
- <summary>Click to view benchmarks on an AVX2 system (EPYC7702)</summary>
460
 
461
- | model | size | params | backend | threads | test | t/s | % (vs Q4_0) |
462
- | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: |
463
- | qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% |
464
- | qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% |
465
- | qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% |
466
- | qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% |
467
- | qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% |
468
- | qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% |
469
- | qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% |
470
- | qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% |
471
- | qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% |
472
- | qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% |
473
- | qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% |
474
- | qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% |
475
- | qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% |
476
- | qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% |
477
- | qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% |
478
- | qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% |
479
- | qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% |
480
- | qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% |
481
 
482
- Q4_0_8_8 offers a nice bump to prompt processing and a small bump to text generation
483
 
484
- </details>
485
 
486
- <B>NEO Imatrix Quants / Neo Imatrix X Quants</B>
487
 
488
- NEO Imatrix quants are specialized and specifically "themed" datasets used to slightly alter the weights in a model. All Imatrix datasets do this to some degree or another, however NEO Imatrix datasets
489
- are content / theme specific and have been calibrated to have maximum effect on a model (relative to standard Imatrix datasets). Calibration was made possible after testing 50+ standard Imatrix datasets,
490
- and carefully modifying them and testing the resulting changes to determine the exact format and content which has the maximum effect on a model via the Imatrix process.
491
 
492
- Please keep in mind that the Imatrix process (at it strongest) only "tints" a model and/or slightly changes its bias(es).
493
 
494
- Here are some Imatrix Neo Models:
495
 
496
- [ https://huggingface.co/DavidAU/Command-R-01-Ultra-NEO-DARK-HORROR-V1-V2-35B-IMATRIX-GGUF ]
497
 
498
- [ https://huggingface.co/DavidAU/Command-R-01-200xq-Ultra-NEO-V1-35B-IMATRIX-GGUF ]
499
 
500
- [ https://huggingface.co/DavidAU/Command-R-01-200xq-Ultra-NEO-V1-35B-IMATRIX-GGUF ] (this is an X-Quant)
501
 
502
- [ https://huggingface.co/DavidAU/Llama-3.2-1B-Instruct-NEO-SI-FI-GGUF ]
503
 
504
- [ https://huggingface.co/DavidAU/Llama-3.2-1B-Instruct-NEO-WEE-HORROR-GGUF ]
505
 
506
- [ https://huggingface.co/DavidAU/L3-8B-Stheno-v3.2-Ultra-NEO-V1-IMATRIX-GGUF ]
507
 
508
- Suggestions for Imatrix NEO quants:
509
 
510
- - The LOWER the quant the STRONGER the Imatrix effect is, and therefore the stronger the "tint" so to speak
511
- - Due to the unique nature of this project, quants IQ1s to IQ4s are recommended for maximum effect with IQ4_XS the most balanced in terms of power and bits.
512
- - Secondaries are Q2s-Q4s. Imatrix effect is still strong in these quants.
513
- - Effects diminish quickly from Q5s and up.
514
- - Q8/F16 there is no change (as the Imatrix process does not affect this quant), and therefore not included.
515
 
516
  ---
517
 
518
- <h2> Quick Reference Table </h2>
519
 
520
  ---
521
 
@@ -604,52 +585,6 @@ Please see sections below this for advanced usage, more details, settings, notes
604
 
605
  </small>
606
 
607
- ---
608
-
609
- <h2>ADVANCED: HOW TO TEST EACH PARAMETER(s), SAMPLER(s) and ADVANCED SAMPLER(s)</h2>
610
-
611
- ---
612
-
613
- 1 - Set temp to 0 (zero) and set your basic parameters, and use a prompt to get a "default" generation. A creative prompt will work better here.
614
-
615
- 2 - If you want to test basic parameter changes, test ONE at a time, then compare output (answer quality, word choice, sentence size/construction, general output qualities) to your "default" generation.
616
-
617
- 3 - Then start testing TWO parameters at a time, and comparing again. Keep in mind parameters (all) interact with each other.
618
-
619
- 4 - Samplers -> Reset your basic parameters, (temp still at zero) and test each one of these, one at a time. Then adjust settings, test again.
620
-
621
- 5 - Once you have an "idea" of how each affects your "test prompt" , now test at "temp" (not zero). It may take five to ten generation to get a rough idea.
622
-
623
- Yes, testing is a lot of work - but once you get all the parameter(s) and/or sampler(s) dialed in - it is worth it.
624
-
625
- IMPORTANT: Use a "fresh chat" PER TEST (you will contaminate the results otherwise). Never use the same chat for multiple tests -> exception: Regens.
626
-
627
- Keep in mind that parameters, samplers and advanced samplers can affect the model on a per token generation basis AND/OR on a multi-token / phrase / sentence / paragraph
628
- and even complete generation basis.
629
-
630
- Everything is cumulative here regardless if the parameter/sampler affects per token or multi-token basis because of how models "look back" to see what was generated in some cases.
631
-
632
- And of course... each model will be different too.
633
-
634
- All that being said, it is a good idea to have specific generation quality "goals" in mind.
635
-
636
- Likewise, at my repo, I post example generations so you can get an idea (but not complete picture) of a model's generation abilities.
637
-
638
- The best way to control generation is STILL with your prompt(s) - including pre-prompts/system role. The latest gen models (and archs) have very strong
639
- instruction following so many times better (or just included!) instructions in your prompts can make a world of difference.
640
-
641
- Not sure if the model understands your prompt(s)?
642
-
643
- Ask it ->
644
-
645
- "Check my prompt below and tell me how to make it clearer?" (prompt after this line)
646
-
647
- "For my prompt below, explain the steps you wound take to execute it" (prompt after this line)
648
-
649
- This will help the model fine tune your prompt so IT understands it.
650
-
651
- However sometimes parameters and/or samplers are required to better "wrangle" the model and getting to perform to its maximum potential and/or fine tune it to your use case(s).
652
-
653
 
654
  ---
655
 
@@ -1123,3 +1058,119 @@ This is also influenced by the parameter size of the model in relation to the qu
1123
 
1124
  IE: a 8B model at Q2K will be far more unstable relative to a 20B model at Q2K, and as a result require stronger settings.
1125
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
  <h3>Maximizing Model Performance for All Quants Types And Full-Precision using Samplers, Advance Samplers and Parameters Guide</h3>
32
 
33
+ <h3>Maximizing Model Performance for All Quants Types And Full-Precision using Samplers, Advance Samplers and Parameters Guide</h3>
34
+
35
  This document includes detailed information, references, and notes for general parameters, samplers and
36
  advanced samplers to get the most out of your model's abilities including notes / settings for the most popular AI/LLM app in use (LLAMACPP, KoboldCPP, Text-Generation-WebUI, LMStudio, Sillytavern, Ollama and others).
37
 
 
80
 
81
  ---
82
 
83
+ INDEX
84
 
85
  ---
86
 
87
+ How to Use this document:
88
 
89
+ Review quant(s) information to select quant(s) to download, then review "Class 1,2,3..." for specific information on models followed by "Source Files...APPS to run LLMs/AIs". "Quick reference" will state the best parameter settings for each "Class" of model(s) to get the best operation and/or good defaults to use to get started. The detailed sections about parameters - Section 1 a,b,c and section 2 will help tune the model(s) operation.
90
 
91
+ The "DETAILED NOTES ON PARAMETERS, SAMPLERS and ADVANCED SAMPLERS" section after this covers and links to more information about "tuning" your model(s). These cover theory, hints, tips and tricks, and observations.
92
 
93
+ All information about parameters, samplers and advanced samplers applies to ALL models , regardless of repo(s) you download them from.
94
 
95
+ QUANTS:
96
+ - QUANTS Detailed information.
97
+ - IMATRIX Quants
98
+ - ADDITONAL QUANT INFORMATION
99
+ - ARM QUANTS / Q4_0_X_X
100
+ - NEO Imatrix Quants / Neo Imatrix X Quants
101
+ - CPU ONLY CONSIDERATIONS
102
 
103
+ Class 1, 2, 3 and 4 model critical notes
104
+
105
+ SOURCE FILES for my Models / APPS to Run LLMs / AIs:
106
+ - TEXT-GENERATION-WEBUI
107
+ - KOBOLDCPP
108
+ - SILLYTAVERN
109
+ - OTHER PROGRAMS
110
 
111
+ TESTING / Generation Example PARAMETERS AND SAMPLERS
112
 
113
+ Quick Reference Table - Parameters, Samplers, Advanced Samplers
114
 
115
+ Section 1a : PRIMARY PARAMETERS - ALL APPS
116
+ Section 1b : PENALITY SAMPLERS - ALL APPS
117
+ Section 1c : SECONDARY SAMPLERS / FILTERS - ALL APPS
118
+ Section 2: ADVANCED SAMPLERS
119
 
120
+ DETAILED NOTES ON PARAMETERS, SAMPLERS and ADVANCED SAMPLERS:
121
+ - DETAILS on PARAMETERS / SAMPLERS
122
+ - General Parameters
123
+ - The Local LLM Settings Guide/Rant
124
+ - LLAMACPP-SERVER EXE - usage / parameters / samplers
125
+ - DRY Sampler
126
+ - Samplers
127
+ - Creative Writing
128
+ - Benchmarking-and-Guiding-Adaptive-Sampling-Decoding
129
 
130
+ ADVANCED: HOW TO TEST EACH PARAMETER(s), SAMPLER(s) and ADVANCED SAMPLER(s)
 
 
 
 
131
 
132
  ---
133
 
134
+ <h2>QUANTS:</h2>
135
 
136
  ---
137
 
138
+ Please note that smaller quant(s) IE: Q2K, IQ1s, IQ2s and some IQ3s (especially those of models size 8B parameters or less) may require additional adjustment(s). For these quants
139
+ you may need to increase the "penalty" sampler(s) and/or advanced sampler(s) to compensate for the compression damage of the model.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
140
 
141
+ For models of 20B parameters and higher, generally this is not a major concern as the parameters can make up for compression damage at lower quant levels (IE Q2K+, but at least Q3 ; IQ2+, but at least IQ3+).
142
 
143
+ IQ1s: Generally IQ1_S rarely works for models less than 30B parameters. IQ1_M is however almost twice as stable/usable relative to IQ1_S.
144
 
145
+ Generally it is recommended to run the highest quant(s) you can on your machine ; but at least Q4KM/IQ4XS as a minimum for models 20B and lower.
146
 
147
+ The smaller the size of model, the greater the contrast between the smallest quant and largest quant in terms of operation, quality, nuance and general overall function.
148
 
149
+ There is an exception to this , see "Neo Imatrix" below and "all quants" (cpu only operation).
150
 
151
+ IMATRIX:
152
 
153
+ Imatrix quants generally improve all quants, and also allow you to use smaller quants (less memory, more context space) and retain quality of operation.
154
 
155
+ IE: Instead of using a q4KM, you might be able to run an IQ3_M and get close to Q4KM's quality, but at a higher token per second speed and have more VRAM for context.
156
 
 
 
157
 
158
+ <B>Recommended Quants - ALL:</B>
159
 
160
+ This covers both Imatrix and regular quants.
161
 
162
+ Imatrix can be applied to any quant - "Q" or "IQ" - however, IQ1s to IQ3_S REQUIRE an imatrix dataset / imatrixing process before quanting.
163
 
164
+ This chart shows the order in terms of "BPW" for each quant (mapped below with relative "strength" to one another) with "IQ1_S" with the least, and "Q8_0" (F16 is full precision) with the most:
165
 
166
+ <small>
167
+ <PRE>
168
+ IQ1_S | IQ1_M
169
+ IQ2_XXS | IQ2_XS | Q2_K_S | IQ2_S | Q2_K | IQ2_M
170
+ IQ3_XXS | Q3_K_S | IQ3_XS | IQ3_S | IQ3_M | Q3_K_M | Q3_K_L
171
+ Q4_K_S | IQ4_XS | IQ4_NL | Q4_K_M
172
+ Q5_K_S | Q5_K_M
173
+ Q6_K
174
+ Q8_0
175
+ F16
176
+ </pre>
177
+ </small>
178
 
179
+ More BPW mean better quality, but higher VRAM requirements (and larger file size) and lower tokens per second.
180
+ The larger the model in terms of parameters the lower the size of quant you can run with less quality losses.
181
+ Note that "quality losses" refers to both instruction following and output quality.
182
 
183
+ Differences (quality) between quants at lower levels are larger relative to higher quants differences.
184
 
185
+ The Imatrix process has NO effect on Q8 or F16 quants.
186
 
187
+ F16 is full precision, just in GGUF format.
188
 
189
+ ADDITONAL QUANT INFORMATION:
190
 
191
+ <details>
192
+ <summary>Click here for details</summary>
193
 
194
+ A great write up with charts showing various performances is provided by Artefact2 [here](https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9)
195
 
196
+ The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have.
197
 
198
+ If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM.
199
 
200
+ If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total.
201
 
202
+ Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'.
203
 
204
+ If you don't want to think too much, grab one of the K-quants. These are in format 'QX_K_X', like Q5_K_M.
205
 
206
+ If you want to get more into the weeds, you can check out this extremely useful feature chart:
207
 
208
+ [llama.cpp feature matrix](https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix)
209
 
210
+ But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQX_X, like IQ3_M. These are newer and offer better performance for their size.
211
 
212
+ These I-quants can also be used on CPU and Apple Metal, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide.
213
 
214
+ The I-quants are *not* compatible with Vulcan, which is also AMD, so if you have an AMD card double check if you're using the rocBLAS build or the Vulcan build. At the time of writing this, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm.
215
 
216
+ </details>
217
 
218
+ ARM QUANTS / Q4_0_X_X:
219
 
220
+ These are new quants that are specifically for computers/devices that can run "ARM" quants. If you try to run these on a "non arm" machine/device, the token per second will be VERY SLOW.
221
 
222
+ Q4_0_X_X information
223
 
224
+ These are *NOT* for Metal (Apple) or GPU (nvidia/AMD/intel) offloading, only ARM chips (and certain AVX2/AVX512 CPUs).
225
 
226
+ If you're using an ARM chip, the Q4_0_X_X quants will have a substantial speedup. Check out Q4_0_4_4 speed comparisons [on the original pull request](https://github.com/ggerganov/llama.cpp/pull/5780#pullrequestreview-21657544660)
227
 
228
+ To check which one would work best for your ARM chip, you can check [AArch64 SoC features](https://gpages.juszkiewicz.com.pl/arm-socs-table/arm-socs.html) (thanks EloyOn!).
229
 
230
+ If you're using a CPU that supports AVX2 or AVX512 (typically server CPUs and AMD's latest Zen5 CPUs) and are not offloading to a GPU, the Q4_0_8_8 may offer a nice speed as well:
231
 
232
+ <details>
233
+ <summary>Click to view benchmarks on an AVX2 system (EPYC7702)</summary>
234
 
235
+ | model | size | params | backend | threads | test | t/s | % (vs Q4_0) |
236
+ | ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: |
237
+ | qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% |
238
+ | qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% |
239
+ | qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% |
240
+ | qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% |
241
+ | qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% |
242
+ | qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% |
243
+ | qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% |
244
+ | qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% |
245
+ | qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% |
246
+ | qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% |
247
+ | qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% |
248
+ | qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% |
249
+ | qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% |
250
+ | qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% |
251
+ | qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% |
252
+ | qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% |
253
+ | qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% |
254
+ | qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% |
255
 
256
+ Q4_0_8_8 offers a nice bump to prompt processing and a small bump to text generation
 
 
 
257
 
258
+ </details>
259
 
260
+ <B>NEO Imatrix Quants / Neo Imatrix X Quants</B>
261
 
262
+ NEO Imatrix quants are specialized and specifically "themed" datasets used to slightly alter the weights in a model. All Imatrix datasets do this to some degree or another, however NEO Imatrix datasets
263
+ are content / theme specific and have been calibrated to have maximum effect on a model (relative to standard Imatrix datasets). Calibration was made possible after testing 50+ standard Imatrix datasets,
264
+ and carefully modifying them and testing the resulting changes to determine the exact format and content which has the maximum effect on a model via the Imatrix process.
265
 
266
+ Please keep in mind that the Imatrix process (at it strongest) only "tints" a model and/or slightly changes its bias(es).
267
 
268
+ Here are some Imatrix Neo Models:
269
 
270
+ [ https://huggingface.co/DavidAU/Command-R-01-Ultra-NEO-DARK-HORROR-V1-V2-35B-IMATRIX-GGUF ]
271
 
272
+ [ https://huggingface.co/DavidAU/Command-R-01-200xq-Ultra-NEO-V1-35B-IMATRIX-GGUF ]
273
 
274
+ [ https://huggingface.co/DavidAU/Command-R-01-200xq-Ultra-NEO-V1-35B-IMATRIX-GGUF ] (this is an X-Quant)
275
 
276
+ [ https://huggingface.co/DavidAU/Llama-3.2-1B-Instruct-NEO-SI-FI-GGUF ]
277
 
278
+ [ https://huggingface.co/DavidAU/Llama-3.2-1B-Instruct-NEO-WEE-HORROR-GGUF ]
279
 
280
+ [ https://huggingface.co/DavidAU/L3-8B-Stheno-v3.2-Ultra-NEO-V1-IMATRIX-GGUF ]
281
 
282
+ Suggestions for Imatrix NEO quants:
283
 
284
+ - The LOWER the quant the STRONGER the Imatrix effect is, and therefore the stronger the "tint" so to speak
285
+ - Due to the unique nature of this project, quants IQ1s to IQ4s are recommended for maximum effect with IQ4_XS the most balanced in terms of power and bits.
286
+ - Secondaries are Q2s-Q4s. Imatrix effect is still strong in these quants.
287
+ - Effects diminish quickly from Q5s and up.
288
+ - Q8/F16 there is no change (as the Imatrix process does not affect this quant), and therefore not included.
289
 
290
+ CPU ONLY CONSIDERATIONS:
291
 
292
+ This section DOES NOT apply to most "Macs" because of the difference in O/S Memory, Vram and motherboard VS other frameworks.
293
 
294
+ Running quants on CPU will be a lot slower than running them on a video card(s).
295
 
296
+ In this special case however it may be preferred to run AS SMALL a quant as possible for token per second generation reasons.
297
 
298
+ On a top, high end (and relatively new) CPU expect token per second speeds to be 1/4 (or less) a standard middle of the road video card.
299
 
300
+ Older machines/cpus will be a lot slower - but models will STILL run on these as long as you have enough ram.
301
 
302
+ Here are some rough comparisons:
303
 
304
+ On my video card (Nvidia 16GB 4060TI) I get 160-190 tokens per second with 1B LLama 3.2 Instruct, CPU speeds are 50-60 token per second.
305
 
306
+ On my much older machine (8 years old)(2 core), token per second speed (same 1B model) is in the 10ish token per second (CPU).
307
 
308
+ Roughly 8B-12B models are limit for CPU only operation (in terms of "usable" tokens/second) - at the moment.
309
 
310
+ This is changing as new cpus come out, designed for AI usage.
311
 
312
  ---
313
 
 
362
 
363
  ---
364
 
365
+ <h2>SOURCE FILES for my Models / APPS to Run LLMs / AIs:</h2>
366
 
367
  ---
368
 
369
+ Source files / Source models of my models are located here (also upper right menu on this page):
 
 
 
 
 
370
 
371
+ [ https://huggingface.co/collections/DavidAU/d-au-source-files-for-gguf-exl2-awq-gptq-hqq-etc-etc-66b55cb8ba25f914cbf210be ]
372
 
373
+ You will need the config files to use "llamacpp_HF" loader ("text-generation-webui") [ https://github.com/oobabooga/text-generation-webui ]
374
 
375
+ You can also use the full source in "text-generation-webui" too.
376
 
377
+ As an alternative you can use GGUFs directly in "KOBOLDCPP" / "SillyTavern" without the "config files" and still use almost all the parameters, samplers and advanced samplers.
378
 
379
+ <B>Parameters, Samplers and Advanced Samplers</B>
380
 
381
+ In section 1 a,b, and c, below are all the LLAMA_CPP parameters and samplers.
382
 
383
+ I have added notes below each one for adjustment / enhancement(s) for specific use cases.
384
 
385
+ TEXT-GENERATION-WEBUI
386
 
387
+ In section 2, will be additional samplers, which become available when using "llamacpp_HF" loader in https://github.com/oobabooga/text-generation-webui
388
+ AND/OR https://github.com/LostRuins/koboldcpp ("KOBOLDCPP").
389
 
390
+ The "llamacpp_HF" (for "text-generation-webui") only requires the GGUF you want to use plus a few config files from "source repo" of the model.
391
 
392
+ (this process is automated with this program, just enter the repo(s) urls -> it will fetch everything for you)
393
 
394
+ This allows access to very advanced samplers in addition to all the parameters / samplers here.
 
 
 
 
 
 
 
 
 
 
 
395
 
396
+ KOBOLDCPP:
 
 
397
 
398
+ Note that https://github.com/LostRuins/koboldcpp also allows access to all LLAMACPP parameters/samplers too as well as additional advanced samplers too.
399
 
400
+ You can use almost all parameters, samplers and advanced samplers using "KOBOLDCPP" without the need to get the source config files (the "llamacpp_HF" step).
401
 
402
+ Note: This program has one of the newest samplers called "Anti-slop" which allows phrase/word banning at the generation level.
403
 
404
+ SILLYTAVERN:
405
 
406
+ Note that https://github.com/SillyTavern/SillyTavern also allows access to all LLAMACPP parameters/samplers too as well as additional advanced samplers too.
407
 
408
+ You can use almost all parameters, samplers and advanced samplers using "SILLYTAVERN" without the need to get the source config files (the "llamacpp_HF" step).
409
 
410
+ For CLASS3 and CLASS4 the most important setting is "SMOOTHING FACTOR" (Quadratic Smoothing) ; information is located on this page:
411
 
412
+ https://docs.sillytavern.app/usage/common-settings/
413
 
414
+ Critical Note:
415
 
416
+ Silly Tavern allows you to "connect" (via API) to different AI programs/apps like Koboldcpp, Llamacpp (server), Text Generation Webui, Lmstudio, Ollama ... etc etc.
417
 
418
+ You "load" a model in one of these, then connect Silly Tavern to the App via API. This way you can use any model, and Sillytavern becomes the interface between
419
+ the AI model and you directly. Sillytavern opens an interface in your browser.
420
 
421
+ In Sillytavern you can then adjust parameters, samplers and advanced samplers ; there are also PRESET parameter/samplers too and you can save your favorites too.
422
 
423
+ Currently, at time of this writing, connecting Silly Tavern via KoboldCPP or Text Generation Webui will provide the most samplers/parameters.
424
 
425
+ However for some, connecting to Lmstudio, LlamaCPP, or Ollama may be preferred.
426
 
427
+ NOTE:
428
 
429
+ It appears that Silly Tavern also supports "DRY" and "XTC" too ; but it is not yet in the documentation at the time of writing.
 
430
 
431
+ You may also want to check out how to connect SillyTavern to local AI "apps" running on your pc here:
432
 
433
+ https://docs.sillytavern.app/usage/api-connections/
434
 
 
435
 
436
+ OTHER PROGRAMS:
437
 
438
+ Other programs like https://www.LMStudio.ai allows access to most of STANDARD samplers, where as others (llamacpp only here) you may need to add to the json file(s) for a model and/or template preset.
439
 
440
+ In most cases all llama_cpp parameters/samplers are available when using API / headless / server mode in "text-generation-webui", "koboldcpp", "Sillytavern", "Olama", and "LMStudio" (as well as other apps too).
441
 
442
+ You can also use llama_cpp directly too. (IE: llama-server.exe) ; see :
443
 
444
+ https://github.com/ggerganov/llama.cpp
445
 
446
+ (scroll down on the main page for more apps/programs to use GGUFs too that connect to / use the LLAMA-CPP package.)
447
 
448
+ Special note:
449
 
450
+ It appears "DRY" / "XTC" samplers has been added to LLAMACPP and SILLYTAVERN.
451
 
452
+ It is available (Llamacpp) via "server.exe / llama-server.exe". Likely this sampler will also become available "downstream" in applications that use LLAMACPP in due time.
453
 
454
+ [ https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md ]
455
 
456
+ Operating Systems:
457
 
458
+ Most AI/LLM apps operate on Windows, Mac, and Linux.
459
 
460
+ Mobile devices (and O/S) are in many cases also supported.
461
 
 
462
 
463
+ ---
464
 
465
+ <h2>TESTING / Generation Example PARAMETERS AND SAMPLERS</h2>
466
 
467
+ ---
 
468
 
469
+ Primary Testing Parameters I use, including use for output generation examples at my repo:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
470
 
471
+ <B>Ranged Parameters:</B>
472
 
473
+ temperature: 0 to 5 ("temp")
474
 
475
+ repetition_penalty : 1.02 to 1.15 ("rep pen")
476
 
477
+ <B>Set parameters:</B>
 
 
478
 
479
+ top_k:40
480
 
481
+ min_p:0.05
482
 
483
+ top_p: 0.95
484
 
485
+ repeat-last-n: 64 (also called: "repetition_penalty_range" / "rp range" )
486
 
487
+ I do not set any other settings, parameters or have samplers activated when generating examples.
488
 
489
+ Everything else is "zeroed" / "disabled".
490
 
491
+ These parameters/settings are considered both safe and default and in most cases available to all users in all AI/LLM apps.
492
 
493
+ Note for Class 3/Class 4 models (discussed below) "repeat-last-n" is a CRITICAL setting.
494
 
 
495
 
 
 
 
 
 
496
 
497
  ---
498
 
499
+ <h2> Quick Reference Table - Parameters, Samplers, Advanced Samplers </h2>
500
 
501
  ---
502
 
 
585
 
586
  </small>
587
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
588
 
589
  ---
590
 
 
1058
 
1059
  IE: a 8B model at Q2K will be far more unstable relative to a 20B model at Q2K, and as a result require stronger settings.
1060
 
1061
+
1062
+ ---
1063
+
1064
+ <h2>DETAILED NOTES ON PARAMETERS, SAMPLERS and ADVANCED SAMPLERS:</h2>
1065
+
1066
+ ---
1067
+
1068
+ Most AI / LLM apps allow saving a "profile" parameters and samplers - "favorite" settings.
1069
+
1070
+ Text Generation Web Ui, Koboldcpp, Silly Tavern all have this feature and also "presets" (parameters/samplers set already) too.
1071
+
1072
+ Other AI/LLM apps also have this feature to varying degrees too.
1073
+
1074
+ DETAILS on PARAMETERS / SAMPLERS:
1075
+
1076
+ For additional details on these samplers settings (including advanced ones) you may also want to check out:
1077
+
1078
+ https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab
1079
+
1080
+ (NOTE: Not all of these "options" are available for GGUFS, including when you use "llamacpp_HF" loader in "text-generation-webui" )
1081
+
1082
+ Additional Links (on parameters, samplers and advanced samplers):
1083
+
1084
+ A Visual Guide of some top parameters / Samplers in action which you can play with and see how they interact:
1085
+
1086
+ https://artefact2.github.io/llm-sampling/index.xhtml
1087
+
1088
+ General Parameters:
1089
+
1090
+ https://arxiv.org/html/2408.13586v1
1091
+
1092
+ https://www.reddit.com/r/LocalLLaMA/comments/17vonjo/your_settings_are_probably_hurting_your_model_why/
1093
+
1094
+ The Local LLM Settings Guide/Rant (covers a lot of parameters/samplers - lots of detail)
1095
+
1096
+ https://rentry.org/llm-settings
1097
+
1098
+ LLAMACPP-SERVER EXE - usage / parameters / samplers:
1099
+
1100
+ https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md
1101
+
1102
+ DRY
1103
+ - https://github.com/oobabooga/text-generation-webui/pull/5677
1104
+ - https://www.reddit.com/r/KoboldAI/comments/1e49vpt/dry_sampler_questionsthat_im_sure_most_of_us_are/
1105
+ - https://www.reddit.com/r/KoboldAI/comments/1eo4r6q/dry_settings_questions/
1106
+
1107
+ Samplers:
1108
+
1109
+ https://gist.github.com/kalomaze/4473f3f975ff5e5fade06e632498f73e
1110
+
1111
+ https://huggingface.co/LWDCLS/LLM-Discussions/discussions/2
1112
+
1113
+ https://huggingface.co/Virt-io/SillyTavern-Presets
1114
+
1115
+ Creative Writing :
1116
+
1117
+ https://www.reddit.com/r/LocalLLaMA/comments/1c36ieb/comparing_sampling_techniques_for_creative/
1118
+
1119
+ Benchmarking-and-Guiding-Adaptive-Sampling-Decoding
1120
+
1121
+ https://github.com/ZhouYuxuanYX/Benchmarking-and-Guiding-Adaptive-Sampling-Decoding-for-LLMs
1122
+
1123
+ NOTE:
1124
+
1125
+ I have also added notes too in the sections below for almost all parameters, samplers, and advanced samplers as well.
1126
+
1127
+ OTHER:
1128
+
1129
+ Depending on the AI/LLM "apps" you are using, additional reference material for parameters / samplers may also exist.
1130
+
1131
+ ---
1132
+
1133
+ <h2>ADVANCED: HOW TO TEST EACH PARAMETER(s), SAMPLER(s) and ADVANCED SAMPLER(s)</h2>
1134
+
1135
+ ---
1136
+
1137
+ 1 - Set temp to 0 (zero) and set your basic parameters, and use a prompt to get a "default" generation. A creative prompt will work better here.
1138
+
1139
+ 2 - If you want to test basic parameter changes, test ONE at a time, then compare output (answer quality, word choice, sentence size/construction, general output qualities) to your "default" generation.
1140
+
1141
+ 3 - Then start testing TWO parameters at a time, and comparing again. Keep in mind parameters (all) interact with each other.
1142
+
1143
+ 4 - Samplers -> Reset your basic parameters, (temp still at zero) and test each one of these, one at a time. Then adjust settings, test again.
1144
+
1145
+ 5 - Once you have an "idea" of how each affects your "test prompt" , now test at "temp" (not zero). It may take five to ten generation to get a rough idea.
1146
+
1147
+ Yes, testing is a lot of work - but once you get all the parameter(s) and/or sampler(s) dialed in - it is worth it.
1148
+
1149
+ IMPORTANT: Use a "fresh chat" PER TEST (you will contaminate the results otherwise). Never use the same chat for multiple tests -> exception: Regens.
1150
+
1151
+ Keep in mind that parameters, samplers and advanced samplers can affect the model on a per token generation basis AND/OR on a multi-token / phrase / sentence / paragraph
1152
+ and even complete generation basis.
1153
+
1154
+ Everything is cumulative here regardless if the parameter/sampler affects per token or multi-token basis because of how models "look back" to see what was generated in some cases.
1155
+
1156
+ And of course... each model will be different too.
1157
+
1158
+ All that being said, it is a good idea to have specific generation quality "goals" in mind.
1159
+
1160
+ Likewise, at my repo, I post example generations so you can get an idea (but not complete picture) of a model's generation abilities.
1161
+
1162
+ The best way to control generation is STILL with your prompt(s) - including pre-prompts/system role. The latest gen models (and archs) have very strong
1163
+ instruction following so many times better (or just included!) instructions in your prompts can make a world of difference.
1164
+
1165
+ Not sure if the model understands your prompt(s)?
1166
+
1167
+ Ask it ->
1168
+
1169
+ "Check my prompt below and tell me how to make it clearer?" (prompt after this line)
1170
+
1171
+ "For my prompt below, explain the steps you wound take to execute it" (prompt after this line)
1172
+
1173
+ This will help the model fine tune your prompt so IT understands it.
1174
+
1175
+ However sometimes parameters and/or samplers are required to better "wrangle" the model and getting to perform to its maximum potential and/or fine tune it to your use case(s).
1176
+