Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,76 @@
|
|
1 |
---
|
2 |
license: llama3
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: llama3
|
3 |
+
base_model: meta-llama/Meta-Llama-3-8B-Instruct
|
4 |
---
|
5 |
+
|
6 |
+
# 32K GGUF config of LLAMA3 8B with imatrix custom edge-quants included
|
7 |
+
|
8 |
+
> [!TIP]
|
9 |
+
> You have to set context with ***-c 32000*** in llama.cpp to take advantage of this when you run it.
|
10 |
+
|
11 |
+
## Run the model in interactive mode with a long prompt inside a textfile with -f
|
12 |
+
```bash
|
13 |
+
./main -m llama3ins-8b-32k-q4ns.gguf --temp 0.3 --color -f ../prompt19k.txt -ngl 33 -n 2000 -i -c 32000
|
14 |
+
```
|
15 |
+
|
16 |
+
## Prompt format
|
17 |
+
|
18 |
+
```verilog
|
19 |
+
<|im_start|>system{You are a hyperintelligent hilarious raccoon that solves everything via first-principles based resoning.}<|im_end|>
|
20 |
+
<|im_start|>user{How to build a city on mars via aldrin cycler orbits DUMP THE BIG LONG PROMPT HERE.}
|
21 |
+
<|im_end|>assistant
|
22 |
+
```
|
23 |
+
|
24 |
+
## Perplexity Benchmarks
|
25 |
+
|
26 |
+
```verilog
|
27 |
+
./perplexity -m ../llama3ins-8b-32k-f16.gguf -ngl 99 -f wiki.test.raw --chunks 16
|
28 |
+
perplexity: 2.10 seconds per pass - ETA 0.13 minutes
|
29 |
+
[1]6.1736,[2]6.8769,[3]7.4226,[4]8.0199,[5]8.4531,[6]8.7808,[7]9.3213,[8]10.0461,[9]10.7468,[10]11.0909,[11]11.2691,[12]11.4318,[13]11.9160,[14]11.4038,[15]11.2641,[16]10.9073,
|
30 |
+
Final estimate: PPL = 10.9073 +/- 0.50026
|
31 |
+
|
32 |
+
./perplexity -m ../llama3ins-8b-32k-q8.gguf -ngl 99 -f wiki.test.raw --chunks 16 YES 8BIT IS BETTER THAN BF16 - F16 conversion
|
33 |
+
perplexity: 2.38 seconds per pass - ETA 0.15 minutes
|
34 |
+
[1]6.1454,[2]6.8672,[3]7.4109,[4]8.0148,[5]8.4472,[6]8.7771,[7]9.3182,[8]10.0466,[9]10.7509,[10]11.0836,[11]11.2563,[12]11.4218,[13]11.9095,[14]11.4000,[15]11.2587,[16]10.9028,
|
35 |
+
Final estimate: PPL = 10.9028 +/- 0.49958
|
36 |
+
|
37 |
+
./perplexity -m ../llama3ins-8b-32k-q6.gguf -ngl 99 -f wiki.test.raw --chunks 16
|
38 |
+
perplexity: 2.36 seconds per pass - ETA 0.15 minutes
|
39 |
+
[1]6.0654,[2]6.7806,[3]7.3319,[4]7.9600,[5]8.3961,[6]8.7512,[7]9.2932,[8]10.0314,[9]10.7402,[10]11.0786,[11]11.2597,[12]11.4410,[13]11.9342,[14]11.4223,[15]11.2818,[16]10.9354,
|
40 |
+
Final estimate: PPL = 10.9354 +/- 0.50190
|
41 |
+
|
42 |
+
./perplexity -m ../llama3ins-8b-32k-q5km.gguf -ngl 99 -f wiki.test.raw --chunks 16
|
43 |
+
perplexity: 2.40 seconds per pass - ETA 0.15 minutes
|
44 |
+
[1]6.0044,[2]6.8263,[3]7.3989,[4]8.0044,[5]8.4508,[6]8.7716,[7]9.3220,[8]10.0606,[9]10.7709,[10]11.1098,[11]11.2956,[12]11.4743,[13]11.9661,[14]11.4569,[15]11.3028,[16]10.9474,
|
45 |
+
Final estimate: PPL = 10.9474 +/- 0.50185
|
46 |
+
|
47 |
+
./perplexity -m ../llama3ins-8b-32k-q4ns.gguf -ngl 99 -f wiki.test.raw --chunks 16
|
48 |
+
perplexity: 2.40 seconds per pass - ETA 0.15 minutes
|
49 |
+
[1]6.5618,[2]7.1233,[3]7.5647,[4]8.1198,[5]8.5365,[6]8.8386,[7]9.4233,[8]10.1359,[9]10.8601,[10]11.1981,[11]11.3705,[12]11.5619,[13]12.0492,[14]11.5287,[15]11.3823,[16]11.0269,
|
50 |
+
Final estimate: PPL = 11.0269 +/- 0.50623
|
51 |
+
|
52 |
+
IQ4_XS - NON IMATRIX FOR REFERENCE is quite a bit worse than my imat one
|
53 |
+
perplexity: 7.41 seconds per pass - ETA 0.48 minutes
|
54 |
+
[1]6.9103,[2]7.4907,[3]7.9577,[4]8.3949,[5]8.8029,[6]9.0275,[7]9.6252,[8]10.2914,[9]10.9833,[10]11.3498,[11]11.5059,[12]11.7275,[13]12.1804,[14]11.6848,[15]11.5226,[16]11.1761,
|
55 |
+
Final estimate: PPL = 11.1761 +/- 0.51803
|
56 |
+
|
57 |
+
./perplexity -m ../llama3ins-8b-32k-q3ns.gguf -ngl 99 -f wiki.test.raw --chunks 16
|
58 |
+
perplexity: 2.43 seconds per pass - ETA 0.15 minutes
|
59 |
+
[1]6.6955,[2]7.2732,[3]7.9483,[4]8.5310,[5]9.0020,[6]9.3664,[7]9.9324,[8]10.7019,[9]11.4163,[10]11.6981,[11]11.8420,[12]12.1191,[13]12.6709,[14]12.1222,[15]11.9778,[16]11.5624,
|
60 |
+
Final estimate: PPL = 11.5624 +/- 0.53444
|
61 |
+
|
62 |
+
./perplexity -m ../llama3ins-8b-32k-q2ns.gguf -ngl 99 -f wiki.test.raw --chunks 16 SUPRISINGLY USABLE
|
63 |
+
perplexity: 2.48 seconds per pass - ETA 0.15 minutes
|
64 |
+
[1]7.0861,[2]7.8057,[3]8.5360,[4]9.1910,[5]9.6240,[6]10.0848,[7]10.7928,[8]11.4729,[9]12.3032,[10]12.5115,[11]12.7422,[12]13.1224,[13]13.7716,[14]13.1772,[15]13.0020,[16]12.5578,
|
65 |
+
Final estimate: PPL = 12.5578 +/- 0.57323
|
66 |
+
|
67 |
+
./perplexity -m ../llama3ins-8b-32k-q1ns.gguf -ngl 99 -f wiki.test.raw --chunks 16 ONE BIT TURNS TO JUNK
|
68 |
+
perplexity: 2.41 seconds per pass - ETA 0.15 minutes
|
69 |
+
[1]15.1640,[2]16.2585,[3]17.8912,[4]18.2226,[5]18.4974,[6]19.2407,[7]20.0085,[8]21.6465,[9]22.7656,[10]22.7903,[11]23.2208,[12]24.2318,[13]25.7172,[14]24.5111,[15]23.8096,[16]22.7933,
|
70 |
+
Final estimate: PPL = 22.7933 +/- 1.05192
|
71 |
+
```
|
72 |
+
> [!TIP]
|
73 |
+
> Yes 8bit q8_0 is slightly better than f16 because converting fom bf16 to f16 reduces bits in the mantisa.
|
74 |
+
> The ns quants are custom nisten quants and work well down to 2 bit.
|
75 |
+
> 1.75bit quant is included for reference however perplexity tanks and is incoherent.
|
76 |
+
|