license: apache-2.0
tags:
- parameters guide
- samplers guide
- model generation
Maximizing Model Performance for All Quants Types And Full-Precision using Samplers, Advance Samplers and Parameters Guide
This document includes detailed information, references, and notes for general parameters, samplers and advanced samplers to get the most out of your model's abilities.
These settings / suggestions can be applied to all models including GGUF, EXL2, GPTQ, HQQ, AWQ and full source/precision.
It also includes critical settings for Class 3 and Class 4 models at this repo - DavidAU - to enhance and control generation for specific as a well as outside use case(s) including role play, chat and other use case(s).
This settings can also fix a number of model issues such as:
- "Gibberish"
- letter, word, phrase, paragraph repeats
- coherence
- creativeness or lack there of or .. too much - purple prose.
Likewise setting can also improve model generation and/or general overall "smoothness" / "quality" of model operation.
Even if you are not using my models, you may find this document useful for any model available online.
If you are currently using model(s) that are difficult to "wrangle" then apply "Class 3" or "Class 4" settings to them.
This document will be updated over time too.
Please use the "community tab" for suggestions / edits / improvements.
PARAMETERS AND SAMPLERS
Primary Testing Parameters I use, including use for output generation examples at my repo:
Ranged Parameters:
temperature: 0 to 5 ("temp")
repetition_penalty : 1.02 to 1.15 ("rep pen")
Set parameters:
top_k:40
min_p:0.05
top_p: 0.95
repeat-last-n: 64 (also called: "repetition_penalty_range" / "rp range" )
I do not set any other settings, parameters or have samplers activated when generating examples.
Everything else is "zeroed" / "disabled".
These parameters/settings are considered both safe and default and in most cases available to all users in all apps.
Llama CPP Parameters, Samplers and Advanced Samplers
Below are all the LLAMA_CPP parameters and samplers.
I have added notes below each one for adjustment / enhancement(s) for specific use cases.
Following this section will be additional samplers, which become available when using "llamacpp_HF" loader in https://github.com/oobabooga/text-generation-webui .
The "llamacpp_HF" only requires the GGUF you want to use plus a few config files from "source repo" of the model.
(this process is automated with this program, just enter the repo(s) urls -> it will fetch everything for you)
This allows access to very advanced samplers in addition to all the parameters / samplers here.
For additional details on these samplers settings (including advanced ones) you may also want to check out:
https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab
(NOTE: Not all of these "options" are available for GGUFS, including when you use "llamacpp_HF" loader)
Note that https://github.com/LostRuins/koboldcpp also allows access to all LLAMACPP parameters/samplers too as well as additional advanced samplers too.
Other programs like https://www.LMStudio.ai allows access to most of STANDARD samplers, where as others (llamacpp only here) you may need to add to the json file(s) for a model and/or template preset.
In most cases all llama_cpp settings are available when using API / headless / server mode in "text-generation-webui", "koboldcpp" and "lmstudio" (as well as other apps too).
You can also use llama_cpp directly too. (IE: llama-server.exe) ; see :
https://github.com/ggerganov/llama.cpp
(scroll down on the main page for more apps/programs to use GGUFs too)
CRITICAL NOTES:
Some of the models at my repo are custom designed / limited use case models. For some of these models, specific settings and/or samplers (including advanced) are recommended for best operation.
As a result I have classified the models as class 1, class 2, class 3 and class 4.
Generally all models (mine and other repos) fall under class 1 or class 2 and can be used when just about any sampler(s) / parameter(s) and advanced sampler(s).
Class 3 requires a little more adjustment because these models run closer to the ragged edge of stability. The settings for these will help control them better, especially for chat / role play and/or other use case(s). Generally speaking, this helps them behave better overall.
Class 4 are balanced on the very edge of stability. These models are generally highly creative, for very narrow use case(s), and closer to "human prose" than other models. With these models, advanced samplers are used to "bring these bad boys" inline which is especially important for chat and/or role play type use cases AND/OR use case(s) these models were not designed for.
The goal here is to use parameters to raise/lower the power of the model and samplers to "prune" (or in some cases enhance) operation.
With that being said, generation "examples" (at my repo) are created using the "Primary Testing Parameters" (top of this document) settings regardless of the "class" of the model AND NO advanced settings, or samplers.
QUANTS:
Please note that smaller quant(s) IE: Q2K, IQ1s, IQ2s and some IQ3s (especially those of models size 8B parameters or less) may require additional adjustment(s). For these quants you may need to increase the "penalty" sampler(s) and/or advanced sampler(s) to compensate for the compression damage of the model.
For models of 20B parameters and higher, generally this is not a major concern as the parameters can make up for compression damage at lower quant levels (IE Q2K+, but at least Q3 ; IQ2+, but at least IQ3+).
IQ1s: Generally IQ1_S rarely works for models less than 30B parameters. IQ1_M is however almost twice as stable/usable relative to IQ1_S.
Generally it is recommended to run the highest quant(s) you can on your machine ; but at least Q4KM/IQ4XS as a minimum for models 20B and lower.
The smaller the size of model, the greater the contrast between the smallest quant and largest quant in terms of operation, quality, nuance and general overall function.
Imatrix quants generally improve all quants, and also allow you to use smaller quants (less memory, more context space) and retain quality of operation.
IE: Instead of using a q4KM, you might be able to run an IQ3_M and get close to Q4KM's quality, but at a higher token per second speed and have more VRAM for context.
PRIMARY PARAMETERS:
These parameters will have SIGNIFICANT effect on prose, generation, length and content; with temp being the most powerful.
Keep in mind the biggest parameter / random "unknown" is your prompt. A word change, rephrasing, punctation , even a comma, or semi-colon can drastically alter the output, even at min temp settings. CAPS also affect generation too.
temp / temperature
temperature (default: 0.8)
Primary factor to control the randomness of outputs. 0 = deterministic (only the most likely token is used). Higher value = more randomness.
Range 0 to 5. Increment at .1 per change.
Too much temp can affect instruction following in some cases and sometimes not enough = boring generation.
Newer model archs (L3,L3.1,L3.2, Mistral Nemo, Gemma2 etc) many times NEED more temp (1+) to get their best generations.
top-p
top-p sampling (default: 0.9, 1.0 = disabled)
If not set to 1, select tokens with probabilities adding up to less than this number. Higher value = higher range of possible random results.
I use default of: .95 ;
min-p
min-p sampling (default: 0.1, 0.0 = disabled)
Tokens with probability smaller than (min_p) * (probability of the most likely token) are discarded.
I use default: .05 ;
top-k
top-k sampling (default: 40, 0 = disabled)
Similar to top_p, but select instead only the top_k most likely tokens. Higher value = higher range of possible random results.
Bring this up to 80-120 for a lot more word choice, and below 40 for simpler word choices.
NOTES:
For an interesting test, set "temp" to 0 ; this will give you the SAME generation for a given prompt each time. Then adjust a word, phrase, sentence etc - to see the differences. Keep in mind this will show model operation at its LEAST powerful/creative level and should NOT be used to determine if the model works for your use case(s).
Then test "at temp" to see the model in action. (5-10 generations recommended)
You can also use "temp=0" to test different quants of the same model to see generation differences. (roughly "BIAS").
Another option is testing different models (of the same quant) to see how each handles your prompt(s).
Then test "at temp" to see the MODELS in action. (5-10 generations recommended)
PENALITY SAMPLERS:
These samplers "trim" or "prune" output.
PRIMARY:
repeat-last-n
last n tokens to consider for penalize (default: 64, 0 = disabled, -1 = ctx_size) ("repetition_penalty_range" in oobabooga/text-generation-webui , "rp_range" in kobold)
THIS IS CRITICAL. Too high you can get all kinds of issues (repeat words, sentences, paragraphs or "gibberish"), especially with class 3 or 4 models.
This setting also works in conjunction with all other "rep pens" below.
This parameter is the "RANGE" of tokens looked at for the samplers directly below.
SECONDARIES:
repeat-penalty
penalize repeat sequence of tokens (default: 1.0, 1.0 = disabled) (commonly called "rep pen")
Generally this is set from 1.0 to 1.15 ; smallest increments are best IE: 1.01... 1,.02 or even 1.001... 1.002.
This affects creativity of the model over all , not just how words are penalized.
presence-penalty
repeat alpha presence penalty (default: 0.0, 0.0 = disabled)
Generally leave this at zero IF repeat-last-n is 256 or less. You may want to use this for higher repeat-last-n settings.
CLASS 3: 0.05 may assist generation BUT SET "repeat-last-n" to 512 or less. Better is 128 or 64.
CLASS 4: 0.1 to 0.25 may assist generation BUT SET "repeat-last-n" to 64
frequency-penalty
repeat alpha frequency penalty (default: 0.0, 0.0 = disabled)
Generally leave this at zero IF repeat-last-n is 512 or less. You may want to use this for higher repeat-last-n settings.
CLASS 3: 0.25 may assist generation BUT SET "repeat-last-n" to 512 or less. Better is 128 or 64.
CLASS 4: 0.7 to 0.8 may assist generation BUT SET "repeat-last-n" to 64.
penalize-nl
penalize newline tokens (default: false)
Generally this is not used.
SECONDARY SAMPLERS / FILTERS:
tfs
tail free sampling, parameter z (default: 1.0, 1.0 = disabled)
Tries to detect a tail of low-probability tokens in the distribution and removes those tokens. The closer to 0, the more discarded tokens. ( https://www.trentonbricken.com/Tail-Free-Sampling/ )
typical
locally typical sampling, parameter p (default: 1.0, 1.0 = disabled)
If not set to 1, select only tokens that are at least this much more likely to appear than random tokens, given the prior text.
mirostat
use Mirostat sampling. "Top K", "Nucleus", "Tail Free" (TFS) and "Locally Typical" (TYPICAL) samplers are ignored if used. (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)
mirostat-lr
Mirostat learning rate, parameter eta (default: 0.1) " mirostat_tau "
mirostat-ent
Mirostat target entropy, parameter tau (default: 5.0) " mirostat_eta "
Activates the Mirostat sampling technique. It aims to control perplexity during sampling. See the paper. (https://arxiv.org/abs/2007.14966)
mirostat_tau: 5-8 is a good value.
mirostat_eta: 0.1 is a good value.
This is the big one ; activating this will help with creative generation. It can also help with stability.
This is both a sampler (and pruner) and enhancement all in one.
For Class 3 models it is suggested to use this to assist with generation (min settings).
For Class 4 models it is highly recommended with Microstat 1 or 2 + mirostat-lr @ 6 to 8 and mirostat_eta at .1 to .5
dynatemp-range
dynamic temperature range (default: 0.0, 0.0 = disabled)
dynatemp-exp
dynamic temperature exponent (default: 1.0)
In: oobabooga/text-generation-webui (has on/off, and high / low) :
Activates Dynamic Temperature. This modifies temperature to range between "dynatemp_low" (minimum) and "dynatemp_high" (maximum), with an entropy-based scaling. The steepness of the curve is controlled by "dynatemp_exponent".
This allows the model to CHANGE temp during generation. This can greatly affect creativity, dialog, and other contrasts.
For Kobold a converter is available and in oobabooga/text-generation-webui you just enter low/high/exp.
Class 4 only: Suggested this is on, with a high/low of .8 to 1.8 (note the range here of "1" between high and low); with exponent to 1 (however below 0 or above work too)
To set manually (IE: Api, lmstudio, etc) using "range" and "exp" ; this is a bit more tricky: (example is to set range from .8 to 1.8)
1 - Set the "temp" to 1.3 (the regular temp parameter)
2 - Set the "range" to .500 (this gives you ".8" to "1.8" with "1.3" as the "base")
3 - Set exp to 1 (or as you want).
This is both an enhancement and in some ways fixes issues in a model when too little temp (or too much/too much of the same) affects generation.
xtc-probability
xtc probability (default: 0.0, 0.0 = disabled)
Probability that the removal will actually happen. 0 disables the sampler. 1 makes it always happen.
xtc-threshold
xtc threshold (default: 0.1, 1.0 = disabled)
If 2 or more tokens have probability above this threshold, consider removing all but the last one.
XTC is a new sampler, that adds an interesting twist in generation. Suggest you experiment with this one, with other advanced samplers disabled to see its affects.
l, logit-bias TOKEN_ID(+/-)BIAS
modifies the likelihood of token appearing in the completion,
i.e. --logit-bias 15043+1
to increase likelihood of token ' Hello',
or --logit-bias 15043-1
to decrease likelihood of token ' Hello'
This may or may not be available. This requires a bit more work.
IN "oobabooga/text-generation-webui" there is "TOKEN BANNING":
This is a very powerful pruning method; which can drastically alter output generation.
I suggest you get some "bad outputs" ; get the "tokens" (actual number for the "word" / part word) then use this.
Careful testing is required, as this can have unclear side effects.
OTHER:
-s, --seed SEED
RNG seed (default: -1, use random seed for -1)
samplers SAMPLERS
samplers that will be used for generation in the order, separated by ';' (default: top_k;tfs_z;typ_p;top_p;min_p;xtc;temperature)
sampling-seq SEQUENCE
simplified sequence for samplers that will be used (default: kfypmxt)
ignore-eos
ignore end of stream token and continue generating (implies --logit-bias EOS-inf)
ADVANCED SAMPLERS:
I am not going to touch on all of them ; just the main ones ; for more info see:
https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab
Keep in mind these parameters/samplers become available (for GGUFs) in "oobabooga/text-generation-webui" when you use the llamacpp_HF loader.
What I will touch on here are special settings for CLASS 3 and CLASS 4 models.
For CLASS 3 you can use one, two or both.
For CLASS 4 using BOTH are strongly recommended, or at minimum "QUADRATIC SAMPLING".
These samplers (along with "penalty" settings) work in conjunction to "wrangle" the model / control it and get it to settle down, important for Class 3 but critical for Class 4 models.
For other classes of models, these advanced samplers can enhance operation across the board.
For Class 3 and Class 4 the goal is to use the LOWEST settings to keep the model inline rather than "over prune it".
You may therefore want to experiment to with dropping the settings (SLOWLY) for Class3/4 models from suggested below.
DRY:
Class 3:
dry_multiplier: .8
dry_allowed_length: 2
dry_base: 1
Class 4:
dry_multiplier: .8 to 1.12+
dry_allowed_length: 2 (or less)
dry_base: 1.15 to 1.5
QUADRATIC SAMPLING:
Class 3:
smoothing_factor: 1 to 3
smoothing_curve: 1
Class 4:
smoothing_factor: 3 to 5 (or higher)
smoothing_curve: 1.5 to 2.
IMPORTANT:
Keep in mind that these settings/samplers work in conjunction with "penalties" ; which is especially important for operation of CLASS 4 models for chat / role play and/or "smoother operation".
For Class 3 models, "QUADRATIC" will have a stronger effect than "DRY" relatively speaking.
If you use Microstat, keep in mind this will interact with these two advanced samplers too.
Finally:
Smaller quants may require STRONGER settings (all classes of models) due to compression damage, especially for Q2K, and IQ1/IQ2s.
This is also influenced by the parameter size of the model in relation to the quant size.
IE: a 8B model at Q2K will be far more unstable relative to a 20B model at Q2K, and as a result require stronger settings.