Commit
·
ac4b397
1
Parent(s):
51a7be4
Update README.md
Browse files
README.md
CHANGED
@@ -16,8 +16,29 @@ A merge of [**Nous-Capybara-34B**](https://huggingface.co/NousResearch/Nous-Capy
|
|
16 |
|
17 |
> https://github.com/cg123/mergekit/tree/dare
|
18 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
19 |
|
20 |
-
Merged
|
21 |
|
22 |
```
|
23 |
models:
|
@@ -65,27 +86,6 @@ parameters:
|
|
65 |
int8_mask: true
|
66 |
dtype: bfloat16
|
67 |
```
|
68 |
-
***
|
69 |
-
## Prompt template: Orca-Vicuna
|
70 |
-
```
|
71 |
-
SYSTEM: {system_message}
|
72 |
-
USER: {prompt}
|
73 |
-
ASSISTANT:
|
74 |
-
```
|
75 |
-
It might recognize ChatML, or maybe Llama-chat from Airoboros.
|
76 |
-
|
77 |
-
Sometimes the model "spells out" the stop token as `</s>` like Capybara, so you may need to add `</s>` as an additional stopping condition.
|
78 |
-
***
|
79 |
-
## Running
|
80 |
-
Being a Yi model, try running a lower temperature with 0.05-0.1 MinP, a little repitition penalty, and no other samplers. Yi tends to run "hot" by default.
|
81 |
-
|
82 |
-
24GB GPUs can run Yi-34B-200K models at **45K-75K context** with exllamav2, and performant UIs like [exui](https://github.com/turboderp/exui). I go into more detail in this [post](https://old.reddit.com/r/LocalLLaMA/comments/1896igc/how_i_run_34b_models_at_75k_context_on_24gb_fast/)
|
83 |
-
|
84 |
-
I recommend exl2 quantizations profiled on data similar to the desired task. It is especially sensitive to the quantization data at low bpw.
|
85 |
-
|
86 |
-
To load this in full-context backends like transformers and vllm, you *must* change `max_position_embeddings` in config.json to a lower value than 200,000, otherwise you will OOM!
|
87 |
-
***
|
88 |
-
## Testing Notes
|
89 |
|
90 |
Various densities were tested with perplexity tests and long context prompts. Relatively high densities seem to perform better, contrary to the findings of the Super Mario paper.
|
91 |
|
|
|
16 |
|
17 |
> https://github.com/cg123/mergekit/tree/dare
|
18 |
|
19 |
+
***
|
20 |
+
## Prompt template: Orca-Vicuna
|
21 |
+
```
|
22 |
+
SYSTEM: {system_message}
|
23 |
+
USER: {prompt}
|
24 |
+
ASSISTANT:
|
25 |
+
```
|
26 |
+
It might recognize ChatML, or maybe Llama-chat from Airoboros.
|
27 |
+
|
28 |
+
Sometimes the model "spells out" the stop token as `</s>` like Capybara, so you may need to add `</s>` as an additional stopping condition.
|
29 |
+
***
|
30 |
+
## Running
|
31 |
+
Being a Yi model, try running a lower temperature with 0.05-0.1 MinP, a little repitition penalty, and no other samplers. Yi tends to run "hot" by default.
|
32 |
+
|
33 |
+
24GB GPUs can run Yi-34B-200K models at **45K-75K context** with exllamav2, and performant UIs like [exui](https://github.com/turboderp/exui). I go into more detail in this [post](https://old.reddit.com/r/LocalLLaMA/comments/1896igc/how_i_run_34b_models_at_75k_context_on_24gb_fast/)
|
34 |
+
|
35 |
+
I recommend exl2 quantizations profiled on data similar to the desired task. It is especially sensitive to the quantization data at low bpw.
|
36 |
+
|
37 |
+
To load this in full-context backends like transformers and vllm, you *must* change `max_position_embeddings` in config.json to a lower value than 200,000, otherwise you will OOM!
|
38 |
+
***
|
39 |
+
## Testing Notes
|
40 |
|
41 |
+
Merged in mergekit with the following config, and the tokenizer from chargoddard's Yi-Llama:
|
42 |
|
43 |
```
|
44 |
models:
|
|
|
86 |
int8_mask: true
|
87 |
dtype: bfloat16
|
88 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
89 |
|
90 |
Various densities were tested with perplexity tests and long context prompts. Relatively high densities seem to perform better, contrary to the findings of the Super Mario paper.
|
91 |
|