nisten
/

llama3-8b-instruct-32k-gguf

GGUF

conversational

Model card Files Files and versions Community

nisten commited on Apr 24, 2024

Commit

e89f76b

verified ·

1 Parent(s): 430cf77

Update README.md

Browse files

Files changed (1) hide show

README.md +73 -0

README.md CHANGED Viewed

@@ -1,3 +1,76 @@
 ---
 license: llama3
 ---

 ---
 license: llama3
+base_model: meta-llama/Meta-Llama-3-8B-Instruct
 ---
+# 32K GGUF config of LLAMA3 8B with imatrix custom edge-quants included
+> [!TIP]
+> You have to set context with ***-c 32000*** in llama.cpp to take advantage of this when you run it.
+## Run the model in interactive mode with a long prompt inside a textfile with -f
+```bash
+./main -m llama3ins-8b-32k-q4ns.gguf --temp 0.3 --color -f ../prompt19k.txt -ngl 33 -n 2000 -i -c 32000
+```
+## Prompt format
+```verilog
+<|im_start|>system{You are a hyperintelligent hilarious raccoon that solves everything via first-principles based resoning.}<|im_end|>
+<|im_start|>user{How to build a city on mars via aldrin cycler orbits DUMP THE BIG LONG PROMPT HERE.}
+<|im_end|>assistant
+```
+## Perplexity Benchmarks
+```verilog
+./perplexity -m ../llama3ins-8b-32k-f16.gguf -ngl 99 -f wiki.test.raw --chunks 16
+perplexity: 2.10 seconds per pass - ETA 0.13 minutes
+[1]6.1736,[2]6.8769,[3]7.4226,[4]8.0199,[5]8.4531,[6]8.7808,[7]9.3213,[8]10.0461,[9]10.7468,[10]11.0909,[11]11.2691,[12]11.4318,[13]11.9160,[14]11.4038,[15]11.2641,[16]10.9073,
+Final estimate: PPL = 10.9073 +/- 0.50026
+./perplexity -m ../llama3ins-8b-32k-q8.gguf -ngl 99 -f wiki.test.raw --chunks 16 YES 8BIT IS BETTER THAN BF16 - F16 conversion
+perplexity: 2.38 seconds per pass - ETA 0.15 minutes
+[1]6.1454,[2]6.8672,[3]7.4109,[4]8.0148,[5]8.4472,[6]8.7771,[7]9.3182,[8]10.0466,[9]10.7509,[10]11.0836,[11]11.2563,[12]11.4218,[13]11.9095,[14]11.4000,[15]11.2587,[16]10.9028,
+Final estimate: PPL = 10.9028 +/- 0.49958
+./perplexity -m ../llama3ins-8b-32k-q6.gguf -ngl 99 -f wiki.test.raw --chunks 16
+perplexity: 2.36 seconds per pass - ETA 0.15 minutes
+[1]6.0654,[2]6.7806,[3]7.3319,[4]7.9600,[5]8.3961,[6]8.7512,[7]9.2932,[8]10.0314,[9]10.7402,[10]11.0786,[11]11.2597,[12]11.4410,[13]11.9342,[14]11.4223,[15]11.2818,[16]10.9354,
+Final estimate: PPL = 10.9354 +/- 0.50190
+./perplexity -m ../llama3ins-8b-32k-q5km.gguf -ngl 99 -f wiki.test.raw --chunks 16
+perplexity: 2.40 seconds per pass - ETA 0.15 minutes
+[1]6.0044,[2]6.8263,[3]7.3989,[4]8.0044,[5]8.4508,[6]8.7716,[7]9.3220,[8]10.0606,[9]10.7709,[10]11.1098,[11]11.2956,[12]11.4743,[13]11.9661,[14]11.4569,[15]11.3028,[16]10.9474,
+Final estimate: PPL = 10.9474 +/- 0.50185
+./perplexity -m ../llama3ins-8b-32k-q4ns.gguf -ngl 99 -f wiki.test.raw --chunks 16
+perplexity: 2.40 seconds per pass - ETA 0.15 minutes
+[1]6.5618,[2]7.1233,[3]7.5647,[4]8.1198,[5]8.5365,[6]8.8386,[7]9.4233,[8]10.1359,[9]10.8601,[10]11.1981,[11]11.3705,[12]11.5619,[13]12.0492,[14]11.5287,[15]11.3823,[16]11.0269,
+Final estimate: PPL = 11.0269 +/- 0.50623
+IQ4_XS - NON IMATRIX FOR REFERENCE is quite a bit worse than my imat one
+perplexity: 7.41 seconds per pass - ETA 0.48 minutes
+[1]6.9103,[2]7.4907,[3]7.9577,[4]8.3949,[5]8.8029,[6]9.0275,[7]9.6252,[8]10.2914,[9]10.9833,[10]11.3498,[11]11.5059,[12]11.7275,[13]12.1804,[14]11.6848,[15]11.5226,[16]11.1761,
+Final estimate: PPL = 11.1761 +/- 0.51803
+./perplexity -m ../llama3ins-8b-32k-q3ns.gguf -ngl 99 -f wiki.test.raw --chunks 16
+perplexity: 2.43 seconds per pass - ETA 0.15 minutes
+[1]6.6955,[2]7.2732,[3]7.9483,[4]8.5310,[5]9.0020,[6]9.3664,[7]9.9324,[8]10.7019,[9]11.4163,[10]11.6981,[11]11.8420,[12]12.1191,[13]12.6709,[14]12.1222,[15]11.9778,[16]11.5624,
+Final estimate: PPL = 11.5624 +/- 0.53444
+./perplexity -m ../llama3ins-8b-32k-q2ns.gguf -ngl 99 -f wiki.test.raw --chunks 16 SUPRISINGLY USABLE
+perplexity: 2.48 seconds per pass - ETA 0.15 minutes
+[1]7.0861,[2]7.8057,[3]8.5360,[4]9.1910,[5]9.6240,[6]10.0848,[7]10.7928,[8]11.4729,[9]12.3032,[10]12.5115,[11]12.7422,[12]13.1224,[13]13.7716,[14]13.1772,[15]13.0020,[16]12.5578,
+Final estimate: PPL = 12.5578 +/- 0.57323
+./perplexity -m ../llama3ins-8b-32k-q1ns.gguf -ngl 99 -f wiki.test.raw --chunks 16  ONE BIT TURNS TO JUNK
+perplexity: 2.41 seconds per pass - ETA 0.15 minutes
+[1]15.1640,[2]16.2585,[3]17.8912,[4]18.2226,[5]18.4974,[6]19.2407,[7]20.0085,[8]21.6465,[9]22.7656,[10]22.7903,[11]23.2208,[12]24.2318,[13]25.7172,[14]24.5111,[15]23.8096,[16]22.7933,
+Final estimate: PPL = 22.7933 +/- 1.05192
+```
+> [!TIP]
+> Yes 8bit q8_0 is slightly better than f16 because converting fom bf16 to f16 reduces bits in the mantisa.
+> The ns quants are custom nisten quants and work well down to 2 bit.
+> 1.75bit quant is included for reference however perplexity tanks and is incoherent.