brucethemoose
/

Yi-34B-200K-RPMerge

@@ -51,6 +51,68 @@ I am a huge fan of Kalomaze's quadratic sampling (shown as "smoothing factor" wh
 Otherwise, I recommend a lower temperature with 0.1 or higher MinP, a little repetition penalty, and mirostat with a low tau, and no other samplers. See the explanation here: https://github.com/ggerganov/llama.cpp/pull/3841
 24GB GPUs can efficiently run Yi-34B-200K models at **40K-90K context** with exllamav2, and performant UIs like [exui](https://github.com/turboderp/exui). I go into more detail in this [post](https://old.reddit.com/r/LocalLLaMA/comments/1896igc/how_i_run_34b_models_at_75k_context_on_24gb_fast/). Empty 16GB GPUs can still run the high context with aggressive quantization.
 To load/train this in full-context backends like transformers, you *must* change `max_position_embeddings` in config.json to a lower value than 200,000, otherwise you will OOM! I do not recommend running high context without context-efficient backends that support flash attention + 8 bit kv cache, like exllamav2, litellm, vllm or unsloth.

 Otherwise, I recommend a lower temperature with 0.1 or higher MinP, a little repetition penalty, and mirostat with a low tau, and no other samplers. See the explanation here: https://github.com/ggerganov/llama.cpp/pull/3841
+@MarinaraSpaghetti has extensively tested the model and recommended the following settings, and they seem to work quite well:
+```
+{
+    "temp": 1,
+    "temperature_last": true,
+    "top_p": 1,
+    "top_k": 0,
+    "top_a": 0,
+    "tfs": 1,
+    "epsilon_cutoff": 0,
+    "eta_cutoff": 0,
+    "typical_p": 0.9,
+    "min_p": 0,
+    "rep_pen": 1.1,
+    "rep_pen_range": 19456,
+    "no_repeat_ngram_size": 0,
+    "penalty_alpha": 0,
+    "num_beams": 1,
+    "length_penalty": 0,
+    "min_length": 0,
+    "encoder_rep_pen": 1,
+    "freq_pen": 0,
+    "presence_pen": 0,
+    "do_sample": true,
+    "early_stopping": false,
+    "dynatemp": false,
+    "min_temp": 1,
+    "max_temp": 2,
+    "dynatemp_exponent": 1,
+    "smoothing_factor": 0.33,
+    "add_bos_token": false,
+    "truncation_length": 2048,
+    "ban_eos_token": false,
+    "skip_special_tokens": true,
+    "streaming": true,
+    "mirostat_mode": 0,
+    "mirostat_tau": 5,
+    "mirostat_eta": 0.1,
+    "guidance_scale": 1,
+    "negative_prompt": "",
+    "grammar_string": "",
+    "banned_tokens": "",
+    "ignore_eos_token_aphrodite": false,
+    "spaces_between_special_tokens_aphrodite": true,
+    "sampler_order": [
+        6,
+        0,
+        1,
+        3,
+        4,
+        2,
+        5
+    ],
+    "logit_bias": [],
+    "n": 1,
+    "rep_pen_size": 0,
+    "genamt": 400,
+    "max_length": 38912
+}
+```
 24GB GPUs can efficiently run Yi-34B-200K models at **40K-90K context** with exllamav2, and performant UIs like [exui](https://github.com/turboderp/exui). I go into more detail in this [post](https://old.reddit.com/r/LocalLLaMA/comments/1896igc/how_i_run_34b_models_at_75k_context_on_24gb_fast/). Empty 16GB GPUs can still run the high context with aggressive quantization.
 To load/train this in full-context backends like transformers, you *must* change `max_position_embeddings` in config.json to a lower value than 200,000, otherwise you will OOM! I do not recommend running high context without context-efficient backends that support flash attention + 8 bit kv cache, like exllamav2, litellm, vllm or unsloth.