DavidAU
/

Maximizing-Model-Performance-All-Quants-Types-And-Full-Precision-by-Samplers_Parameters

@@ -30,6 +30,8 @@ tags:
 <h3>Maximizing Model Performance for All Quants Types And Full-Precision using Samplers, Advance Samplers and Parameters Guide</h3>
 This document includes detailed information, references, and notes for general parameters, samplers and
 advanced samplers to get the most out of your model's abilities including notes / settings for the most popular AI/LLM app in use (LLAMACPP, KoboldCPP, Text-Generation-WebUI, LMStudio, Sillytavern, Ollama and others).
@@ -78,209 +80,234 @@ You will get higher quality operation overall - stronger prose, better answers,
 ---
-<h2>TESTING / Generation Example PARAMETERS AND SAMPLERS</h2>
 ---
-Primary Testing Parameters I use, including use for output generation examples at my repo:
-<B>Ranged Parameters:</B>
-temperature: 0 to 5 ("temp")
-repetition_penalty : 1.02 to 1.15 ("rep pen")
-<B>Set parameters:</B>
-top_k:40
-min_p:0.05
-top_p: 0.95
-repeat-last-n: 64   (also called: "repetition_penalty_range" / "rp range" )
-I do not set any other settings, parameters or have samplers activated when generating examples.
-Everything else is "zeroed" / "disabled".
-These parameters/settings are considered both safe and default and in most cases available to all users in all AI/LLM apps.
-Note for Class 3/Class 4 models (discussed below) "repeat-last-n" is a CRITICAL setting.
 ---
-<h2>SOURCE FILES for my Models / APPS to Run LLMs / AIs:</h2>
 ---
-Source files / Source models of my models are located here (also upper right menu on this page):
-[ https://huggingface.co/collections/DavidAU/d-au-source-files-for-gguf-exl2-awq-gptq-hqq-etc-etc-66b55cb8ba25f914cbf210be ]
-You will need the config files to use "llamacpp_HF" loader ("text-generation-webui") [ https://github.com/oobabooga/text-generation-webui ]
-You can also use the full source in "text-generation-webui" too.
-As an alternative you can use GGUFs directly in "KOBOLDCPP" / "SillyTavern" without the "config files" and still use almost all the parameters, samplers and advanced samplers.
-<B>Parameters, Samplers and Advanced Samplers</B>
-In section 1 a,b, and c, below are all the LLAMA_CPP parameters and samplers.
-I have added notes below each one for adjustment / enhancement(s) for specific use cases.
-TEXT-GENERATION-WEBUI
-In section 2, will be additional samplers, which become available when using "llamacpp_HF" loader in https://github.com/oobabooga/text-generation-webui
-AND/OR https://github.com/LostRuins/koboldcpp ("KOBOLDCPP").
-The "llamacpp_HF" (for "text-generation-webui") only requires the GGUF you want to use plus a few config files from "source repo" of the model.
-(this process is automated with this program, just enter the repo(s) urls -> it will fetch everything for you)
-This allows access to very advanced samplers in addition to all the parameters / samplers here.
-KOBOLDCPP:
-Note that https://github.com/LostRuins/koboldcpp also allows access to all LLAMACPP parameters/samplers too as well as additional advanced samplers too.
-You can use almost all parameters, samplers and advanced samplers using "KOBOLDCPP" without the need to get the source config files (the "llamacpp_HF" step).
-Note: This program has one of the newest samplers called "Anti-slop" which allows phrase/word banning at the generation level.
-SILLYTAVERN:
-Note that https://github.com/SillyTavern/SillyTavern also allows access to all LLAMACPP parameters/samplers too as well as additional advanced samplers too.
-You can use almost all parameters, samplers and advanced samplers using "SILLYTAVERN" without the need to get the source config files (the "llamacpp_HF" step).
-For CLASS3 and CLASS4 the most important setting is "SMOOTHING FACTOR" (Quadratic Smoothing) ; information is located on this page:
-https://docs.sillytavern.app/usage/common-settings/
-Critical Note:
-Silly Tavern allows you to "connect" (via API) to different AI programs/apps like Koboldcpp, Llamacpp (server), Text Generation Webui, Lmstudio, Ollama ... etc etc.
-You "load" a model in one of these, then connect Silly Tavern to the App via API. This way you can use any model, and Sillytavern becomes the interface between
-the AI model and you directly. Sillytavern opens an interface in your browser.
-In Sillytavern you can then adjust parameters, samplers and advanced samplers ; there are also PRESET parameter/samplers too and you can save your favorites too.
-Currently, at time of this writing, connecting Silly Tavern via KoboldCPP or Text Generation Webui will provide the most samplers/parameters.
-However for some, connecting to Lmstudio, LlamaCPP, or Ollama may be preferred.
-NOTE:
-It appears that Silly Tavern also supports "DRY" and "XTC" too ; but it is not yet in the documentation at the time of writing.
-You may also want to check out how to connect SillyTavern to local AI "apps" running on your pc here:
-https://docs.sillytavern.app/usage/api-connections/
-OTHER PROGRAMS:
-Other programs like https://www.LMStudio.ai allows access to most of STANDARD samplers, where as others (llamacpp only here) you may need to add to the json file(s) for a model and/or template preset.
-In most cases all llama_cpp parameters/samplers are available when using API / headless / server mode in "text-generation-webui", "koboldcpp", "Sillytavern", "Olama", and "LMStudio" (as well as other apps too).
-You can also use llama_cpp directly too. (IE: llama-server.exe) ; see :
-https://github.com/ggerganov/llama.cpp
-(scroll down on the main page for more apps/programs to use GGUFs too that connect to / use the LLAMA-CPP package.)
-Special note:
-It appears "DRY" / "XTC" samplers has been added to LLAMACPP and SILLYTAVERN.
-It is available (Llamacpp) via "server.exe / llama-server.exe". Likely this sampler will also become available "downstream" in applications that use LLAMACPP in due time.
-[ https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md ]
-Operating Systems:
-Most AI/LLM apps operate on Windows, Mac, and Linux.
-Mobile devices (and O/S) are in many cases also supported.
----
-<h2>DETAILED NOTES ON PARAMETERS, SAMPLERS and ADVANCED SAMPLERS:</h2>
----
-Most AI / LLM apps allow saving a "profile" parameters and samplers - "favorite" settings.
-Text Generation Web Ui, Koboldcpp, Silly Tavern all have this feature and also "presets" (parameters/samplers set already) too.
-Other AI/LLM apps also have this feature to varying degrees too.
-DETAILS on PARAMETERS / SAMPLERS:
-For additional details on these samplers settings (including advanced ones) you may also want to check out:
-https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab
-(NOTE: Not all of these "options" are available for GGUFS, including when you use "llamacpp_HF" loader in "text-generation-webui" )
-Additional Links (on parameters, samplers and advanced samplers):
-DRY
-- https://github.com/oobabooga/text-generation-webui/pull/5677
-- https://www.reddit.com/r/KoboldAI/comments/1e49vpt/dry_sampler_questionsthat_im_sure_most_of_us_are/
-- https://www.reddit.com/r/KoboldAI/comments/1eo4r6q/dry_settings_questions/
-Samplers:
-https://gist.github.com/kalomaze/4473f3f975ff5e5fade06e632498f73e
-(see also community tab for more discussions here)
-https://huggingface.co/LWDCLS/LLM-Discussions/discussions/2
-Silly Tavern  / Samplers:
-(see also community tab for more discussions here)
-https://huggingface.co/Virt-io/SillyTavern-Presets
-Creative Writing :
-https://www.reddit.com/r/LocalLLaMA/comments/1c36ieb/comparing_sampling_techniques_for_creative/
-General Parameters:
-https://arxiv.org/html/2408.13586v1
-https://www.reddit.com/r/LocalLLaMA/comments/17vonjo/your_settings_are_probably_hurting_your_model_why/
-Benchmarking-and-Guiding-Adaptive-Sampling-Decoding
-https://github.com/ZhouYuxuanYX/Benchmarking-and-Guiding-Adaptive-Sampling-Decoding-for-LLMs
-LLAMACPP-SERVER EXE:
-https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md
-The Local LLM Settings Guide/Rant (covers a lot of parameters/samplers - lots of detail)
-https://rentry.org/llm-settings
-A Visual Guide of some top parameters / Samplers in action which you can play with and see how they interact:
-https://artefact2.github.io/llm-sampling/index.xhtml
-NOTE:
-I have also added notes too in the sections below for almost all parameters, samplers, and advanced samplers as well.
-OTHER:
-Depending on the AI/LLM "apps" you are using, additional reference material for parameters / samplers may also exist.
 ---
@@ -335,187 +362,141 @@ There are no "Class 5" models published... yet.
 ---
-<h2>QUANTS:</h2>
 ---
-Please note that smaller quant(s) IE: Q2K, IQ1s, IQ2s and some IQ3s (especially those of models size 8B parameters or less) may require additional adjustment(s). For these quants
-you may need to increase the "penalty" sampler(s) and/or advanced sampler(s) to compensate for the compression damage of the model.
-For models of 20B parameters and higher, generally this is not a major concern as the parameters can make up for compression damage at lower quant levels (IE Q2K+, but at least Q3 ; IQ2+, but at least IQ3+).
-IQ1s: Generally IQ1_S rarely works for models less than 30B parameters. IQ1_M is however almost twice as stable/usable relative to IQ1_S.
-Generally it is recommended to run the highest quant(s) you can on your machine ; but at least Q4KM/IQ4XS as a minimum for models 20B and lower.
-The smaller the size of model, the greater the contrast between the smallest quant and largest quant in terms of operation, quality, nuance and general overall function.
-There is an exception to this , see "Neo Imatrix" below and "all quants" (cpu only operation).
-IMATRIX:
-Imatrix quants generally improve all quants, and also allow you to use smaller quants (less memory, more context space) and retain quality of operation.
-IE: Instead of using a q4KM, you might be able to run an IQ3_M and get close to Q4KM's quality, but at a higher token per second speed and have more VRAM for context.
-<B>Recommended Quants - ALL:</B>
-This covers both Imatrix and regular quants.
-Imatrix can be applied to any quant - "Q" or "IQ" - however, IQ1s to IQ3_S REQUIRE an imatrix dataset / imatrixing process before quanting.
-This chart shows the order in terms of "BPW" for each quant (mapped below with relative "strength" to one another) with "IQ1_S" with the least, and "Q8_0" (F16 is full precision) with the most:
-<small>
-<PRE>
-IQ1_S 	| IQ1_M
-IQ2_XXS | IQ2_XS | Q2_K_S 	| IQ2_S 	| Q2_K  	| IQ2_M
-IQ3_XXS | Q3_K_S | IQ3_XS  	| IQ3_S 	| IQ3_M	    | Q3_K_M	| Q3_K_L
-Q4_K_S	| IQ4_XS | IQ4_NL  	| Q4_K_M
-Q5_K_S	| Q5_K_M
-Q6_K
-Q8_0
-F16
-</pre>
-</small>
-More BPW mean better quality, but higher VRAM requirements (and larger file size) and lower tokens per second.
-The larger the model in terms of parameters the lower the size of quant you can run with less quality losses.
-Note that "quality losses" refers to both instruction following and output quality.
-Differences (quality) between quants at lower levels are larger relative to higher quants differences.
-The Imatrix process has NO effect on Q8 or F16 quants.
-F16 is full precision, just in GGUF format.
-CPU ONLY CONSIDERATIONS:
-This section DOES NOT apply to most "Macs" because of the difference in O/S Memory, Vram and motherboard VS other frameworks.
-Running quants on CPU will be a lot slower than running them on a video card(s).
-In this special case however it may be preferred to run AS SMALL a quant as possible for token per second generation reasons.
-On a top, high end (and relatively new) CPU expect token per second speeds to be 1/4 (or less) a standard middle of the road video card.
-Older machines/cpus will be a lot slower - but models will STILL run on these as long as you have enough ram.
-Here are some rough comparisons:
-On my video card (Nvidia 16GB 4060TI) I get 160-190 tokens per second with 1B LLama 3.2 Instruct, CPU speeds are 50-60 token per second.
-On my much older machine (8 years old)(2 core), token per second speed (same 1B model) is in the 10ish token per second (CPU).
-Roughly 8B-12B models are limit for CPU only operation (in terms of "usable" tokens/second) - at the moment.
-This is changing as new cpus come out, designed for AI usage.
-ADDITONAL QUANT INFORMATION:
-<details>
-  <summary>Click here for details</summary>
-A great write up with charts showing various performances is provided by Artefact2 [here](https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9)
-The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have.
-If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM.
-If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total.
-Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'.
-If you don't want to think too much, grab one of the K-quants. These are in format 'QX_K_X', like Q5_K_M.
-If you want to get more into the weeds, you can check out this extremely useful feature chart:
-[llama.cpp feature matrix](https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix)
-But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQX_X, like IQ3_M. These are newer and offer better performance for their size.
-These I-quants can also be used on CPU and Apple Metal, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide.
-The I-quants are *not* compatible with Vulcan, which is also AMD, so if you have an AMD card double check if you're using the rocBLAS build or the Vulcan build. At the time of writing this, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm.
-</details>
-ARM QUANTS / Q4_0_X_X:
-These are new quants that are specifically for computers/devices that can run "ARM" quants. If you try to run these on a "non arm" machine/device, the token per second will be VERY SLOW.
-Q4_0_X_X information
-These are *NOT* for Metal (Apple) or GPU (nvidia/AMD/intel) offloading, only ARM chips (and certain AVX2/AVX512 CPUs).
-If you're using an ARM chip, the Q4_0_X_X quants will have a substantial speedup. Check out Q4_0_4_4 speed comparisons [on the original pull request](https://github.com/ggerganov/llama.cpp/pull/5780#pullrequestreview-21657544660)
-To check which one would work best for your ARM chip, you can check [AArch64 SoC features](https://gpages.juszkiewicz.com.pl/arm-socs-table/arm-socs.html) (thanks EloyOn!).
-If you're using a CPU that supports AVX2 or AVX512 (typically server CPUs and AMD's latest Zen5 CPUs) and are not offloading to a GPU, the Q4_0_8_8 may offer a nice speed as well:
-<details>
-  <summary>Click to view benchmarks on an AVX2 system (EPYC7702)</summary>
-| model                          |       size |     params | backend    | threads |          test |                  t/s |  % (vs Q4_0)  |
-| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: |
-| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |      64 |         pp512 |        204.03 ± 1.03 |          100% |
-| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |      64 |        pp1024 |        282.92 ± 0.19 |          100% |
-| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |      64 |        pp2048 |        259.49 ± 0.44 |          100% |
-| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |      64 |         tg128 |         39.12 ± 0.27 |          100% |
-| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |      64 |         tg256 |         39.31 ± 0.69 |          100% |
-| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |      64 |         tg512 |         40.52 ± 0.03 |          100% |
-| qwen2 3B Q4_K_M                |   1.79 GiB |     3.09 B | CPU        |      64 |         pp512 |        301.02 ± 1.74 |          147% |
-| qwen2 3B Q4_K_M                |   1.79 GiB |     3.09 B | CPU        |      64 |        pp1024 |        287.23 ± 0.20 |          101% |
-| qwen2 3B Q4_K_M                |   1.79 GiB |     3.09 B | CPU        |      64 |        pp2048 |        262.77 ± 1.81 |          101% |
-| qwen2 3B Q4_K_M                |   1.79 GiB |     3.09 B | CPU        |      64 |         tg128 |         18.80 ± 0.99 |           48% |
-| qwen2 3B Q4_K_M                |   1.79 GiB |     3.09 B | CPU        |      64 |         tg256 |         24.46 ± 3.04 |           83% |
-| qwen2 3B Q4_K_M                |   1.79 GiB |     3.09 B | CPU        |      64 |         tg512 |         36.32 ± 3.59 |           90% |
-| qwen2 3B Q4_0_8_8              |   1.69 GiB |     3.09 B | CPU        |      64 |         pp512 |        271.71 ± 3.53 |          133% |
-| qwen2 3B Q4_0_8_8              |   1.69 GiB |     3.09 B | CPU        |      64 |        pp1024 |       279.86 ± 45.63 |          100% |
-| qwen2 3B Q4_0_8_8              |   1.69 GiB |     3.09 B | CPU        |      64 |        pp2048 |        320.77 ± 5.00 |          124% |
-| qwen2 3B Q4_0_8_8              |   1.69 GiB |     3.09 B | CPU        |      64 |         tg128 |         43.51 ± 0.05 |          111% |
-| qwen2 3B Q4_0_8_8              |   1.69 GiB |     3.09 B | CPU        |      64 |         tg256 |         43.35 ± 0.09 |          110% |
-| qwen2 3B Q4_0_8_8              |   1.69 GiB |     3.09 B | CPU        |      64 |         tg512 |         42.60 ± 0.31 |          105% |
-Q4_0_8_8 offers a nice bump to prompt processing and a small bump to text generation
-</details>
-<B>NEO Imatrix Quants / Neo Imatrix X Quants</B>
-NEO Imatrix quants are specialized and specifically "themed" datasets used to slightly alter the weights in a model. All Imatrix datasets do this to some degree or another, however NEO Imatrix datasets
-are content / theme specific and have been calibrated to have maximum effect on a model (relative to standard Imatrix datasets). Calibration was made possible after testing 50+ standard Imatrix datasets,
-and carefully modifying them and testing the resulting changes to determine the exact format and content which has the maximum effect on a model via the Imatrix process.
-Please keep in mind that the Imatrix process (at it strongest) only "tints" a model and/or slightly changes its bias(es).
-Here are some Imatrix Neo Models:
-[ https://huggingface.co/DavidAU/Command-R-01-Ultra-NEO-DARK-HORROR-V1-V2-35B-IMATRIX-GGUF ]
-[ https://huggingface.co/DavidAU/Command-R-01-200xq-Ultra-NEO-V1-35B-IMATRIX-GGUF ]
-[ https://huggingface.co/DavidAU/Command-R-01-200xq-Ultra-NEO-V1-35B-IMATRIX-GGUF ] (this is an X-Quant)
-[ https://huggingface.co/DavidAU/Llama-3.2-1B-Instruct-NEO-SI-FI-GGUF ]
-[ https://huggingface.co/DavidAU/Llama-3.2-1B-Instruct-NEO-WEE-HORROR-GGUF ]
-[ https://huggingface.co/DavidAU/L3-8B-Stheno-v3.2-Ultra-NEO-V1-IMATRIX-GGUF ]
-Suggestions for Imatrix NEO quants:
-- The LOWER the quant the STRONGER the Imatrix effect is, and therefore the stronger the "tint" so to speak
-- Due to the unique nature of this project, quants IQ1s to IQ4s are recommended for maximum effect with IQ4_XS the most balanced in terms of power and bits.
-- Secondaries are Q2s-Q4s. Imatrix effect is still strong in these quants.
-- Effects diminish quickly from Q5s and up.
-- Q8/F16 there is no change (as the Imatrix process does not affect this quant), and therefore not included.
 ---
-<h2> Quick Reference Table </h2>
 ---
@@ -604,52 +585,6 @@ Please see sections below this for advanced usage, more details, settings, notes
 </small>
----
-<h2>ADVANCED: HOW TO TEST EACH PARAMETER(s), SAMPLER(s) and ADVANCED SAMPLER(s)</h2>
----
-1 - Set temp to 0 (zero) and set your basic parameters, and use a prompt to get a "default" generation. A creative prompt will work better here.
-2 - If you want to test basic parameter changes, test ONE at a time, then compare output (answer quality, word choice, sentence size/construction, general output qualities) to your "default" generation.
-3 - Then start testing TWO parameters at a time, and comparing again. Keep in mind parameters (all) interact with each other.
-4 - Samplers -> Reset your basic parameters, (temp still at zero) and test each one of these, one at a time. Then adjust settings, test again.
-5 - Once you have an "idea" of how each affects your "test prompt" , now test at "temp" (not zero). It may take five to ten generation to get a rough idea.
-Yes, testing is a lot of work - but once you get all the parameter(s) and/or sampler(s) dialed in - it is worth it.
-IMPORTANT: Use a "fresh chat" PER TEST (you will contaminate the results otherwise). Never use the same chat for multiple tests -> exception: Regens.
-Keep in mind that parameters, samplers and advanced samplers can affect the model on a per token generation basis AND/OR on a multi-token / phrase / sentence / paragraph
-and even complete generation basis.
-Everything is cumulative here regardless if the parameter/sampler affects per token or multi-token basis because of how models "look back" to see what was generated in some cases.
-And of course... each model will be different too.
-All that being said, it is a good idea to have specific generation quality "goals" in mind.
-Likewise, at my repo, I post example generations so you can get an idea (but not complete picture) of a model's generation abilities.
-The best way to control generation is STILL with your prompt(s) - including pre-prompts/system role. The latest gen models (and archs) have very strong
-instruction following so many times better (or just included!) instructions in your prompts can make a world of difference.
-Not sure if the model understands your prompt(s)?
-Ask it ->
-"Check my prompt below and tell me how to make it clearer?" (prompt after this line)
-"For my prompt below, explain the steps you wound take to execute it" (prompt after this line)
-This will help the model fine tune your prompt so IT understands it.
-However sometimes parameters and/or samplers are required to better "wrangle" the model and getting to perform to its maximum potential and/or fine tune it to your use case(s).
 ---
@@ -1123,3 +1058,119 @@ This is also influenced by the parameter size of the model in relation to the qu
 IE: a 8B model at Q2K will be far more unstable relative to a 20B model at Q2K, and as a result require stronger settings.

 <h3>Maximizing Model Performance for All Quants Types And Full-Precision using Samplers, Advance Samplers and Parameters Guide</h3>
+<h3>Maximizing Model Performance for All Quants Types And Full-Precision using Samplers, Advance Samplers and Parameters Guide</h3>
 This document includes detailed information, references, and notes for general parameters, samplers and
 advanced samplers to get the most out of your model's abilities including notes / settings for the most popular AI/LLM app in use (LLAMACPP, KoboldCPP, Text-Generation-WebUI, LMStudio, Sillytavern, Ollama and others).
 ---
+INDEX
 ---
+How to Use this document:
+Review quant(s) information to select quant(s) to download, then review "Class 1,2,3..." for specific information on models followed by "Source Files...APPS to run LLMs/AIs". "Quick reference" will state the best parameter settings for each "Class" of model(s) to get the best operation and/or good defaults to use to get started. The detailed sections about parameters - Section 1 a,b,c and section 2 will help tune the model(s) operation.
+The "DETAILED NOTES ON PARAMETERS, SAMPLERS and ADVANCED SAMPLERS" section after this covers and links to more information about "tuning" your model(s). These cover theory, hints, tips and tricks, and observations.
+All information about parameters, samplers and advanced samplers applies to ALL models , regardless of repo(s) you download them from.
+QUANTS:
+-	QUANTS Detailed information.
+-	IMATRIX Quants
+-	ADDITONAL QUANT INFORMATION
+-	ARM QUANTS / Q4_0_X_X
+-	NEO Imatrix Quants / Neo Imatrix X Quants
+-	CPU ONLY CONSIDERATIONS
+Class 1, 2, 3 and 4 model critical notes
+SOURCE FILES for my Models / APPS to Run LLMs / AIs:
+-	TEXT-GENERATION-WEBUI
+-	KOBOLDCPP
+-	SILLYTAVERN
+-	OTHER PROGRAMS
+TESTING / Generation Example PARAMETERS AND SAMPLERS
+Quick Reference Table - Parameters, Samplers, Advanced Samplers
+Section 1a : PRIMARY PARAMETERS - ALL APPS
+Section 1b : PENALITY SAMPLERS - ALL APPS
+Section 1c : SECONDARY SAMPLERS / FILTERS - ALL APPS
+Section 2: ADVANCED SAMPLERS
+DETAILED NOTES ON PARAMETERS, SAMPLERS and ADVANCED SAMPLERS:
+-	DETAILS on PARAMETERS / SAMPLERS
+-	General Parameters
+-	The Local LLM Settings Guide/Rant
+-	LLAMACPP-SERVER EXE - usage / parameters / samplers
+-	DRY Sampler
+-	Samplers
+-	Creative Writing
+-	Benchmarking-and-Guiding-Adaptive-Sampling-Decoding
+ADVANCED: HOW TO TEST EACH PARAMETER(s), SAMPLER(s) and ADVANCED SAMPLER(s)
 ---
+<h2>QUANTS:</h2>
 ---
+Please note that smaller quant(s) IE: Q2K, IQ1s, IQ2s and some IQ3s (especially those of models size 8B parameters or less) may require additional adjustment(s). For these quants
+you may need to increase the "penalty" sampler(s) and/or advanced sampler(s) to compensate for the compression damage of the model.
+For models of 20B parameters and higher, generally this is not a major concern as the parameters can make up for compression damage at lower quant levels (IE Q2K+, but at least Q3 ; IQ2+, but at least IQ3+).
+IQ1s: Generally IQ1_S rarely works for models less than 30B parameters. IQ1_M is however almost twice as stable/usable relative to IQ1_S.
+Generally it is recommended to run the highest quant(s) you can on your machine ; but at least Q4KM/IQ4XS as a minimum for models 20B and lower.
+The smaller the size of model, the greater the contrast between the smallest quant and largest quant in terms of operation, quality, nuance and general overall function.
+There is an exception to this , see "Neo Imatrix" below and "all quants" (cpu only operation).
+IMATRIX:
+Imatrix quants generally improve all quants, and also allow you to use smaller quants (less memory, more context space) and retain quality of operation.
+IE: Instead of using a q4KM, you might be able to run an IQ3_M and get close to Q4KM's quality, but at a higher token per second speed and have more VRAM for context.
+<B>Recommended Quants - ALL:</B>
+This covers both Imatrix and regular quants.
+Imatrix can be applied to any quant - "Q" or "IQ" - however, IQ1s to IQ3_S REQUIRE an imatrix dataset / imatrixing process before quanting.
+This chart shows the order in terms of "BPW" for each quant (mapped below with relative "strength" to one another) with "IQ1_S" with the least, and "Q8_0" (F16 is full precision) with the most:
+<small>
+<PRE>
+IQ1_S 	| IQ1_M
+IQ2_XXS | IQ2_XS | Q2_K_S 	| IQ2_S 	| Q2_K  	| IQ2_M
+IQ3_XXS | Q3_K_S | IQ3_XS  	| IQ3_S 	| IQ3_M	    | Q3_K_M	| Q3_K_L
+Q4_K_S	| IQ4_XS | IQ4_NL  	| Q4_K_M
+Q5_K_S	| Q5_K_M
+Q6_K
+Q8_0
+F16
+</pre>
+</small>
+More BPW mean better quality, but higher VRAM requirements (and larger file size) and lower tokens per second.
+The larger the model in terms of parameters the lower the size of quant you can run with less quality losses.
+Note that "quality losses" refers to both instruction following and output quality.
+Differences (quality) between quants at lower levels are larger relative to higher quants differences.
+The Imatrix process has NO effect on Q8 or F16 quants.
+F16 is full precision, just in GGUF format.
+ADDITONAL QUANT INFORMATION:
+<details>
+  <summary>Click here for details</summary>
+A great write up with charts showing various performances is provided by Artefact2 [here](https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9)
+The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have.
+If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM.
+If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total.
+Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'.
+If you don't want to think too much, grab one of the K-quants. These are in format 'QX_K_X', like Q5_K_M.
+If you want to get more into the weeds, you can check out this extremely useful feature chart:
+[llama.cpp feature matrix](https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix)
+But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQX_X, like IQ3_M. These are newer and offer better performance for their size.
+These I-quants can also be used on CPU and Apple Metal, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide.
+The I-quants are *not* compatible with Vulcan, which is also AMD, so if you have an AMD card double check if you're using the rocBLAS build or the Vulcan build. At the time of writing this, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm.
+</details>
+ARM QUANTS / Q4_0_X_X:
+These are new quants that are specifically for computers/devices that can run "ARM" quants. If you try to run these on a "non arm" machine/device, the token per second will be VERY SLOW.
+Q4_0_X_X information
+These are *NOT* for Metal (Apple) or GPU (nvidia/AMD/intel) offloading, only ARM chips (and certain AVX2/AVX512 CPUs).
+If you're using an ARM chip, the Q4_0_X_X quants will have a substantial speedup. Check out Q4_0_4_4 speed comparisons [on the original pull request](https://github.com/ggerganov/llama.cpp/pull/5780#pullrequestreview-21657544660)
+To check which one would work best for your ARM chip, you can check [AArch64 SoC features](https://gpages.juszkiewicz.com.pl/arm-socs-table/arm-socs.html) (thanks EloyOn!).
+If you're using a CPU that supports AVX2 or AVX512 (typically server CPUs and AMD's latest Zen5 CPUs) and are not offloading to a GPU, the Q4_0_8_8 may offer a nice speed as well:
+<details>
+  <summary>Click to view benchmarks on an AVX2 system (EPYC7702)</summary>
+| model                          |       size |     params | backend    | threads |          test |                  t/s |  % (vs Q4_0)  |
+| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: |
+| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |      64 |         pp512 |        204.03 ± 1.03 |          100% |
+| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |      64 |        pp1024 |        282.92 ± 0.19 |          100% |
+| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |      64 |        pp2048 |        259.49 ± 0.44 |          100% |
+| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |      64 |         tg128 |         39.12 ± 0.27 |          100% |
+| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |      64 |         tg256 |         39.31 ± 0.69 |          100% |
+| qwen2 3B Q4_0                  |   1.70 GiB |     3.09 B | CPU        |      64 |         tg512 |         40.52 ± 0.03 |          100% |
+| qwen2 3B Q4_K_M                |   1.79 GiB |     3.09 B | CPU        |      64 |         pp512 |        301.02 ± 1.74 |          147% |
+| qwen2 3B Q4_K_M                |   1.79 GiB |     3.09 B | CPU        |      64 |        pp1024 |        287.23 ± 0.20 |          101% |
+| qwen2 3B Q4_K_M                |   1.79 GiB |     3.09 B | CPU        |      64 |        pp2048 |        262.77 ± 1.81 |          101% |
+| qwen2 3B Q4_K_M                |   1.79 GiB |     3.09 B | CPU        |      64 |         tg128 |         18.80 ± 0.99 |           48% |
+| qwen2 3B Q4_K_M                |   1.79 GiB |     3.09 B | CPU        |      64 |         tg256 |         24.46 ± 3.04 |           83% |
+| qwen2 3B Q4_K_M                |   1.79 GiB |     3.09 B | CPU        |      64 |         tg512 |         36.32 ± 3.59 |           90% |
+| qwen2 3B Q4_0_8_8              |   1.69 GiB |     3.09 B | CPU        |      64 |         pp512 |        271.71 ± 3.53 |          133% |
+| qwen2 3B Q4_0_8_8              |   1.69 GiB |     3.09 B | CPU        |      64 |        pp1024 |       279.86 ± 45.63 |          100% |
+| qwen2 3B Q4_0_8_8              |   1.69 GiB |     3.09 B | CPU        |      64 |        pp2048 |        320.77 ± 5.00 |          124% |
+| qwen2 3B Q4_0_8_8              |   1.69 GiB |     3.09 B | CPU        |      64 |         tg128 |         43.51 ± 0.05 |          111% |
+| qwen2 3B Q4_0_8_8              |   1.69 GiB |     3.09 B | CPU        |      64 |         tg256 |         43.35 ± 0.09 |          110% |
+| qwen2 3B Q4_0_8_8              |   1.69 GiB |     3.09 B | CPU        |      64 |         tg512 |         42.60 ± 0.31 |          105% |
+Q4_0_8_8 offers a nice bump to prompt processing and a small bump to text generation
+</details>
+<B>NEO Imatrix Quants / Neo Imatrix X Quants</B>
+NEO Imatrix quants are specialized and specifically "themed" datasets used to slightly alter the weights in a model. All Imatrix datasets do this to some degree or another, however NEO Imatrix datasets
+are content / theme specific and have been calibrated to have maximum effect on a model (relative to standard Imatrix datasets). Calibration was made possible after testing 50+ standard Imatrix datasets,
+and carefully modifying them and testing the resulting changes to determine the exact format and content which has the maximum effect on a model via the Imatrix process.
+Please keep in mind that the Imatrix process (at it strongest) only "tints" a model and/or slightly changes its bias(es).
+Here are some Imatrix Neo Models:
+[ https://huggingface.co/DavidAU/Command-R-01-Ultra-NEO-DARK-HORROR-V1-V2-35B-IMATRIX-GGUF ]
+[ https://huggingface.co/DavidAU/Command-R-01-200xq-Ultra-NEO-V1-35B-IMATRIX-GGUF ]
+[ https://huggingface.co/DavidAU/Command-R-01-200xq-Ultra-NEO-V1-35B-IMATRIX-GGUF ] (this is an X-Quant)
+[ https://huggingface.co/DavidAU/Llama-3.2-1B-Instruct-NEO-SI-FI-GGUF ]
+[ https://huggingface.co/DavidAU/Llama-3.2-1B-Instruct-NEO-WEE-HORROR-GGUF ]
+[ https://huggingface.co/DavidAU/L3-8B-Stheno-v3.2-Ultra-NEO-V1-IMATRIX-GGUF ]
+Suggestions for Imatrix NEO quants:
+- The LOWER the quant the STRONGER the Imatrix effect is, and therefore the stronger the "tint" so to speak
+- Due to the unique nature of this project, quants IQ1s to IQ4s are recommended for maximum effect with IQ4_XS the most balanced in terms of power and bits.
+- Secondaries are Q2s-Q4s. Imatrix effect is still strong in these quants.
+- Effects diminish quickly from Q5s and up.
+- Q8/F16 there is no change (as the Imatrix process does not affect this quant), and therefore not included.
+CPU ONLY CONSIDERATIONS:
+This section DOES NOT apply to most "Macs" because of the difference in O/S Memory, Vram and motherboard VS other frameworks.
+Running quants on CPU will be a lot slower than running them on a video card(s).
+In this special case however it may be preferred to run AS SMALL a quant as possible for token per second generation reasons.
+On a top, high end (and relatively new) CPU expect token per second speeds to be 1/4 (or less) a standard middle of the road video card.
+Older machines/cpus will be a lot slower - but models will STILL run on these as long as you have enough ram.
+Here are some rough comparisons:
+On my video card (Nvidia 16GB 4060TI) I get 160-190 tokens per second with 1B LLama 3.2 Instruct, CPU speeds are 50-60 token per second.
+On my much older machine (8 years old)(2 core), token per second speed (same 1B model) is in the 10ish token per second (CPU).
+Roughly 8B-12B models are limit for CPU only operation (in terms of "usable" tokens/second) - at the moment.
+This is changing as new cpus come out, designed for AI usage.
 ---
 ---
+<h2>SOURCE FILES for my Models / APPS to Run LLMs / AIs:</h2>
 ---
+Source files / Source models of my models are located here (also upper right menu on this page):
+[ https://huggingface.co/collections/DavidAU/d-au-source-files-for-gguf-exl2-awq-gptq-hqq-etc-etc-66b55cb8ba25f914cbf210be ]
+You will need the config files to use "llamacpp_HF" loader ("text-generation-webui") [ https://github.com/oobabooga/text-generation-webui ]
+You can also use the full source in "text-generation-webui" too.
+As an alternative you can use GGUFs directly in "KOBOLDCPP" / "SillyTavern" without the "config files" and still use almost all the parameters, samplers and advanced samplers.
+<B>Parameters, Samplers and Advanced Samplers</B>
+In section 1 a,b, and c, below are all the LLAMA_CPP parameters and samplers.
+I have added notes below each one for adjustment / enhancement(s) for specific use cases.
+TEXT-GENERATION-WEBUI
+In section 2, will be additional samplers, which become available when using "llamacpp_HF" loader in https://github.com/oobabooga/text-generation-webui
+AND/OR https://github.com/LostRuins/koboldcpp ("KOBOLDCPP").
+The "llamacpp_HF" (for "text-generation-webui") only requires the GGUF you want to use plus a few config files from "source repo" of the model.
+(this process is automated with this program, just enter the repo(s) urls -> it will fetch everything for you)
+This allows access to very advanced samplers in addition to all the parameters / samplers here.
+KOBOLDCPP:
+Note that https://github.com/LostRuins/koboldcpp also allows access to all LLAMACPP parameters/samplers too as well as additional advanced samplers too.
+You can use almost all parameters, samplers and advanced samplers using "KOBOLDCPP" without the need to get the source config files (the "llamacpp_HF" step).
+Note: This program has one of the newest samplers called "Anti-slop" which allows phrase/word banning at the generation level.
+SILLYTAVERN:
+Note that https://github.com/SillyTavern/SillyTavern also allows access to all LLAMACPP parameters/samplers too as well as additional advanced samplers too.
+You can use almost all parameters, samplers and advanced samplers using "SILLYTAVERN" without the need to get the source config files (the "llamacpp_HF" step).
+For CLASS3 and CLASS4 the most important setting is "SMOOTHING FACTOR" (Quadratic Smoothing) ; information is located on this page:
+https://docs.sillytavern.app/usage/common-settings/
+Critical Note:
+Silly Tavern allows you to "connect" (via API) to different AI programs/apps like Koboldcpp, Llamacpp (server), Text Generation Webui, Lmstudio, Ollama ... etc etc.
+You "load" a model in one of these, then connect Silly Tavern to the App via API. This way you can use any model, and Sillytavern becomes the interface between
+the AI model and you directly. Sillytavern opens an interface in your browser.
+In Sillytavern you can then adjust parameters, samplers and advanced samplers ; there are also PRESET parameter/samplers too and you can save your favorites too.
+Currently, at time of this writing, connecting Silly Tavern via KoboldCPP or Text Generation Webui will provide the most samplers/parameters.
+However for some, connecting to Lmstudio, LlamaCPP, or Ollama may be preferred.
+NOTE:
+It appears that Silly Tavern also supports "DRY" and "XTC" too ; but it is not yet in the documentation at the time of writing.
+You may also want to check out how to connect SillyTavern to local AI "apps" running on your pc here:
+https://docs.sillytavern.app/usage/api-connections/
+OTHER PROGRAMS:
+Other programs like https://www.LMStudio.ai allows access to most of STANDARD samplers, where as others (llamacpp only here) you may need to add to the json file(s) for a model and/or template preset.
+In most cases all llama_cpp parameters/samplers are available when using API / headless / server mode in "text-generation-webui", "koboldcpp", "Sillytavern", "Olama", and "LMStudio" (as well as other apps too).
+You can also use llama_cpp directly too. (IE: llama-server.exe) ; see :
+https://github.com/ggerganov/llama.cpp
+(scroll down on the main page for more apps/programs to use GGUFs too that connect to / use the LLAMA-CPP package.)
+Special note:
+It appears "DRY" / "XTC" samplers has been added to LLAMACPP and SILLYTAVERN.
+It is available (Llamacpp) via "server.exe / llama-server.exe". Likely this sampler will also become available "downstream" in applications that use LLAMACPP in due time.
+[ https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md ]
+Operating Systems:
+Most AI/LLM apps operate on Windows, Mac, and Linux.
+Mobile devices (and O/S) are in many cases also supported.
+---
+<h2>TESTING / Generation Example PARAMETERS AND SAMPLERS</h2>
+---
+Primary Testing Parameters I use, including use for output generation examples at my repo:
+<B>Ranged Parameters:</B>
+temperature: 0 to 5 ("temp")
+repetition_penalty : 1.02 to 1.15 ("rep pen")
+<B>Set parameters:</B>
+top_k:40
+min_p:0.05
+top_p: 0.95
+repeat-last-n: 64   (also called: "repetition_penalty_range" / "rp range" )
+I do not set any other settings, parameters or have samplers activated when generating examples.
+Everything else is "zeroed" / "disabled".
+These parameters/settings are considered both safe and default and in most cases available to all users in all AI/LLM apps.
+Note for Class 3/Class 4 models (discussed below) "repeat-last-n" is a CRITICAL setting.
 ---
+<h2> Quick Reference Table - Parameters, Samplers, Advanced Samplers </h2>
 ---
 </small>
 ---
 IE: a 8B model at Q2K will be far more unstable relative to a 20B model at Q2K, and as a result require stronger settings.
+---
+<h2>DETAILED NOTES ON PARAMETERS, SAMPLERS and ADVANCED SAMPLERS:</h2>
+---
+Most AI / LLM apps allow saving a "profile" parameters and samplers - "favorite" settings.
+Text Generation Web Ui, Koboldcpp, Silly Tavern all have this feature and also "presets" (parameters/samplers set already) too.
+Other AI/LLM apps also have this feature to varying degrees too.
+DETAILS on PARAMETERS / SAMPLERS:
+For additional details on these samplers settings (including advanced ones) you may also want to check out:
+https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab
+(NOTE: Not all of these "options" are available for GGUFS, including when you use "llamacpp_HF" loader in "text-generation-webui" )
+Additional Links (on parameters, samplers and advanced samplers):
+A Visual Guide of some top parameters / Samplers in action which you can play with and see how they interact:
+https://artefact2.github.io/llm-sampling/index.xhtml
+General Parameters:
+https://arxiv.org/html/2408.13586v1
+https://www.reddit.com/r/LocalLLaMA/comments/17vonjo/your_settings_are_probably_hurting_your_model_why/
+The Local LLM Settings Guide/Rant (covers a lot of parameters/samplers - lots of detail)
+https://rentry.org/llm-settings
+LLAMACPP-SERVER EXE - usage / parameters / samplers:
+https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md
+DRY
+- https://github.com/oobabooga/text-generation-webui/pull/5677
+- https://www.reddit.com/r/KoboldAI/comments/1e49vpt/dry_sampler_questionsthat_im_sure_most_of_us_are/
+- https://www.reddit.com/r/KoboldAI/comments/1eo4r6q/dry_settings_questions/
+Samplers:
+https://gist.github.com/kalomaze/4473f3f975ff5e5fade06e632498f73e
+https://huggingface.co/LWDCLS/LLM-Discussions/discussions/2
+https://huggingface.co/Virt-io/SillyTavern-Presets
+Creative Writing :
+https://www.reddit.com/r/LocalLLaMA/comments/1c36ieb/comparing_sampling_techniques_for_creative/
+Benchmarking-and-Guiding-Adaptive-Sampling-Decoding
+https://github.com/ZhouYuxuanYX/Benchmarking-and-Guiding-Adaptive-Sampling-Decoding-for-LLMs
+NOTE:
+I have also added notes too in the sections below for almost all parameters, samplers, and advanced samplers as well.
+OTHER:
+Depending on the AI/LLM "apps" you are using, additional reference material for parameters / samplers may also exist.
+---
+<h2>ADVANCED: HOW TO TEST EACH PARAMETER(s), SAMPLER(s) and ADVANCED SAMPLER(s)</h2>
+---
+1 - Set temp to 0 (zero) and set your basic parameters, and use a prompt to get a "default" generation. A creative prompt will work better here.
+2 - If you want to test basic parameter changes, test ONE at a time, then compare output (answer quality, word choice, sentence size/construction, general output qualities) to your "default" generation.
+3 - Then start testing TWO parameters at a time, and comparing again. Keep in mind parameters (all) interact with each other.
+4 - Samplers -> Reset your basic parameters, (temp still at zero) and test each one of these, one at a time. Then adjust settings, test again.
+5 - Once you have an "idea" of how each affects your "test prompt" , now test at "temp" (not zero). It may take five to ten generation to get a rough idea.
+Yes, testing is a lot of work - but once you get all the parameter(s) and/or sampler(s) dialed in - it is worth it.
+IMPORTANT: Use a "fresh chat" PER TEST (you will contaminate the results otherwise). Never use the same chat for multiple tests -> exception: Regens.
+Keep in mind that parameters, samplers and advanced samplers can affect the model on a per token generation basis AND/OR on a multi-token / phrase / sentence / paragraph
+and even complete generation basis.
+Everything is cumulative here regardless if the parameter/sampler affects per token or multi-token basis because of how models "look back" to see what was generated in some cases.
+And of course... each model will be different too.
+All that being said, it is a good idea to have specific generation quality "goals" in mind.
+Likewise, at my repo, I post example generations so you can get an idea (but not complete picture) of a model's generation abilities.
+The best way to control generation is STILL with your prompt(s) - including pre-prompts/system role. The latest gen models (and archs) have very strong
+instruction following so many times better (or just included!) instructions in your prompts can make a world of difference.
+Not sure if the model understands your prompt(s)?
+Ask it ->
+"Check my prompt below and tell me how to make it clearer?" (prompt after this line)
+"For my prompt below, explain the steps you wound take to execute it" (prompt after this line)
+This will help the model fine tune your prompt so IT understands it.
+However sometimes parameters and/or samplers are required to better "wrangle" the model and getting to perform to its maximum potential and/or fine tune it to your use case(s).