nisten commited on
Commit
e89f76b
·
verified ·
1 Parent(s): 430cf77

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +73 -0
README.md CHANGED
@@ -1,3 +1,76 @@
1
  ---
2
  license: llama3
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: llama3
3
+ base_model: meta-llama/Meta-Llama-3-8B-Instruct
4
  ---
5
+
6
+ # 32K GGUF config of LLAMA3 8B with imatrix custom edge-quants included
7
+
8
+ > [!TIP]
9
+ > You have to set context with ***-c 32000*** in llama.cpp to take advantage of this when you run it.
10
+
11
+ ## Run the model in interactive mode with a long prompt inside a textfile with -f
12
+ ```bash
13
+ ./main -m llama3ins-8b-32k-q4ns.gguf --temp 0.3 --color -f ../prompt19k.txt -ngl 33 -n 2000 -i -c 32000
14
+ ```
15
+
16
+ ## Prompt format
17
+
18
+ ```verilog
19
+ <|im_start|>system{You are a hyperintelligent hilarious raccoon that solves everything via first-principles based resoning.}<|im_end|>
20
+ <|im_start|>user{How to build a city on mars via aldrin cycler orbits DUMP THE BIG LONG PROMPT HERE.}
21
+ <|im_end|>assistant
22
+ ```
23
+
24
+ ## Perplexity Benchmarks
25
+
26
+ ```verilog
27
+ ./perplexity -m ../llama3ins-8b-32k-f16.gguf -ngl 99 -f wiki.test.raw --chunks 16
28
+ perplexity: 2.10 seconds per pass - ETA 0.13 minutes
29
+ [1]6.1736,[2]6.8769,[3]7.4226,[4]8.0199,[5]8.4531,[6]8.7808,[7]9.3213,[8]10.0461,[9]10.7468,[10]11.0909,[11]11.2691,[12]11.4318,[13]11.9160,[14]11.4038,[15]11.2641,[16]10.9073,
30
+ Final estimate: PPL = 10.9073 +/- 0.50026
31
+
32
+ ./perplexity -m ../llama3ins-8b-32k-q8.gguf -ngl 99 -f wiki.test.raw --chunks 16 YES 8BIT IS BETTER THAN BF16 - F16 conversion
33
+ perplexity: 2.38 seconds per pass - ETA 0.15 minutes
34
+ [1]6.1454,[2]6.8672,[3]7.4109,[4]8.0148,[5]8.4472,[6]8.7771,[7]9.3182,[8]10.0466,[9]10.7509,[10]11.0836,[11]11.2563,[12]11.4218,[13]11.9095,[14]11.4000,[15]11.2587,[16]10.9028,
35
+ Final estimate: PPL = 10.9028 +/- 0.49958
36
+
37
+ ./perplexity -m ../llama3ins-8b-32k-q6.gguf -ngl 99 -f wiki.test.raw --chunks 16
38
+ perplexity: 2.36 seconds per pass - ETA 0.15 minutes
39
+ [1]6.0654,[2]6.7806,[3]7.3319,[4]7.9600,[5]8.3961,[6]8.7512,[7]9.2932,[8]10.0314,[9]10.7402,[10]11.0786,[11]11.2597,[12]11.4410,[13]11.9342,[14]11.4223,[15]11.2818,[16]10.9354,
40
+ Final estimate: PPL = 10.9354 +/- 0.50190
41
+
42
+ ./perplexity -m ../llama3ins-8b-32k-q5km.gguf -ngl 99 -f wiki.test.raw --chunks 16
43
+ perplexity: 2.40 seconds per pass - ETA 0.15 minutes
44
+ [1]6.0044,[2]6.8263,[3]7.3989,[4]8.0044,[5]8.4508,[6]8.7716,[7]9.3220,[8]10.0606,[9]10.7709,[10]11.1098,[11]11.2956,[12]11.4743,[13]11.9661,[14]11.4569,[15]11.3028,[16]10.9474,
45
+ Final estimate: PPL = 10.9474 +/- 0.50185
46
+
47
+ ./perplexity -m ../llama3ins-8b-32k-q4ns.gguf -ngl 99 -f wiki.test.raw --chunks 16
48
+ perplexity: 2.40 seconds per pass - ETA 0.15 minutes
49
+ [1]6.5618,[2]7.1233,[3]7.5647,[4]8.1198,[5]8.5365,[6]8.8386,[7]9.4233,[8]10.1359,[9]10.8601,[10]11.1981,[11]11.3705,[12]11.5619,[13]12.0492,[14]11.5287,[15]11.3823,[16]11.0269,
50
+ Final estimate: PPL = 11.0269 +/- 0.50623
51
+
52
+ IQ4_XS - NON IMATRIX FOR REFERENCE is quite a bit worse than my imat one
53
+ perplexity: 7.41 seconds per pass - ETA 0.48 minutes
54
+ [1]6.9103,[2]7.4907,[3]7.9577,[4]8.3949,[5]8.8029,[6]9.0275,[7]9.6252,[8]10.2914,[9]10.9833,[10]11.3498,[11]11.5059,[12]11.7275,[13]12.1804,[14]11.6848,[15]11.5226,[16]11.1761,
55
+ Final estimate: PPL = 11.1761 +/- 0.51803
56
+
57
+ ./perplexity -m ../llama3ins-8b-32k-q3ns.gguf -ngl 99 -f wiki.test.raw --chunks 16
58
+ perplexity: 2.43 seconds per pass - ETA 0.15 minutes
59
+ [1]6.6955,[2]7.2732,[3]7.9483,[4]8.5310,[5]9.0020,[6]9.3664,[7]9.9324,[8]10.7019,[9]11.4163,[10]11.6981,[11]11.8420,[12]12.1191,[13]12.6709,[14]12.1222,[15]11.9778,[16]11.5624,
60
+ Final estimate: PPL = 11.5624 +/- 0.53444
61
+
62
+ ./perplexity -m ../llama3ins-8b-32k-q2ns.gguf -ngl 99 -f wiki.test.raw --chunks 16 SUPRISINGLY USABLE
63
+ perplexity: 2.48 seconds per pass - ETA 0.15 minutes
64
+ [1]7.0861,[2]7.8057,[3]8.5360,[4]9.1910,[5]9.6240,[6]10.0848,[7]10.7928,[8]11.4729,[9]12.3032,[10]12.5115,[11]12.7422,[12]13.1224,[13]13.7716,[14]13.1772,[15]13.0020,[16]12.5578,
65
+ Final estimate: PPL = 12.5578 +/- 0.57323
66
+
67
+ ./perplexity -m ../llama3ins-8b-32k-q1ns.gguf -ngl 99 -f wiki.test.raw --chunks 16 ONE BIT TURNS TO JUNK
68
+ perplexity: 2.41 seconds per pass - ETA 0.15 minutes
69
+ [1]15.1640,[2]16.2585,[3]17.8912,[4]18.2226,[5]18.4974,[6]19.2407,[7]20.0085,[8]21.6465,[9]22.7656,[10]22.7903,[11]23.2208,[12]24.2318,[13]25.7172,[14]24.5111,[15]23.8096,[16]22.7933,
70
+ Final estimate: PPL = 22.7933 +/- 1.05192
71
+ ```
72
+ > [!TIP]
73
+ > Yes 8bit q8_0 is slightly better than f16 because converting fom bf16 to f16 reduces bits in the mantisa.
74
+ > The ns quants are custom nisten quants and work well down to 2 bit.
75
+ > 1.75bit quant is included for reference however perplexity tanks and is incoherent.
76
+