parameters guide
samplers guide
model generation
role play settings
quant selection
arm quants
iq quants vs q quants
optimal model setting
gibberish fixes
coherence
instructing following
quality generation
chat settings
quality settings
llamacpp server
llamacpp
lmstudio
sillytavern
koboldcpp
backyard
ollama
model generation steering
steering
model generation fixes
text generation webui
ggufs
exl2
full precision
quants
imatrix
neo imatrix
File size: 69,070 Bytes
beb2f82 0cfeb72 7b452b9 0eec1d2 7b452b9 0eec1d2 7b452b9 42a32bb 283adc6 42a32bb beb2f82 6baeadc beb2f82 b6a9c4d b050c94 beb2f82 6b95a3c beb2f82 d3c3506 83fc0d3 c231353 0b9ff9b d3c3506 c231353 d3c3506 83fc0d3 459cd0c 6b95a3c 459cd0c 6b95a3c 83fc0d3 6b95a3c beb2f82 6b95a3c beb2f82 b7b76a3 beb2f82 c79fc9f 459cd0c c79fc9f bfbe50e 48f81ec c79fc9f 79f32c9 48f81ec beb2f82 79f32c9 beb2f82 129bb14 b050c94 129bb14 bc03205 b050c94 bc03205 129bb14 beb2f82 79f32c9 73dced2 a65f58a b3cc717 73dced2 bc03205 73dced2 59b1520 bc03205 b050c94 283adc6 b050c94 f2febed bc6c1b3 bc03205 283adc6 3f3e7fd 283adc6 beb2f82 73dced2 83fc0d3 73dced2 79f32c9 a4711b1 83fc0d3 73dced2 beb2f82 4f473e2 73dced2 459cd0c 73dced2 459cd0c 73dced2 3f1e840 73dced2 3f1e840 73dced2 3f1e840 73dced2 3f1e840 73dced2 3f1e840 73dced2 a4711b1 73dced2 a4711b1 73dced2 a4711b1 73dced2 a4711b1 73dced2 a4711b1 73dced2 6b95a3c 73dced2 3f1e840 73dced2 3f1e840 73dced2 3f1e840 73dced2 3f1e840 73dced2 459cd0c a65f58a 73dced2 beb2f82 73dced2 beb2f82 73dced2 beb2f82 73dced2 beb2f82 73dced2 beb2f82 73dced2 6b95a3c 73dced2 6b95a3c 73dced2 6b95a3c 73dced2 42a32bb 73dced2 a4711b1 73dced2 864a76a 73dced2 a4711b1 73dced2 c5581af 73dced2 459cd0c 73dced2 4f473e2 73dced2 a4711b1 73dced2 a4711b1 73dced2 a4711b1 73dced2 a4711b1 73dced2 459cd0c 73dced2 459cd0c 73dced2 459cd0c 73dced2 c5581af 73dced2 c5581af 73dced2 c5581af 73dced2 c5581af 73dced2 6bf4853 73dced2 6bf4853 73dced2 6bf4853 73dced2 6bf4853 73dced2 6bf4853 73dced2 459cd0c 73dced2 459cd0c 73dced2 8eda6b7 73dced2 8eda6b7 73dced2 6bf4853 73dced2 8eda6b7 73dced2 3f1e840 73dced2 42a32bb 73dced2 42a32bb 73dced2 77b6a33 73dced2 77b6a33 73dced2 77b6a33 73dced2 77b6a33 73dced2 77b6a33 73dced2 a4711b1 73dced2 8eda6b7 73dced2 8eda6b7 83fc0d3 beb2f82 a4711b1 beb2f82 4f473e2 6ef3083 beb2f82 6ef3083 beb2f82 6ef3083 a4711b1 6ef3083 a4711b1 6ef3083 a4711b1 6ef3083 a65f58a a4711b1 6ef3083 beb2f82 9a210b0 beb2f82 48f81ec beb2f82 6ef3083 48f81ec 6ef3083 83fc0d3 beb2f82 73dced2 beb2f82 4f473e2 73dced2 beb2f82 73dced2 beb2f82 73dced2 beb2f82 73dced2 3f1e840 73dced2 9a210b0 73dced2 beb2f82 73dced2 beb2f82 73dced2 3f1e840 59b1520 ac491dd 73dced2 ac491dd 73dced2 ac491dd 73dced2 ac491dd 73dced2 ac491dd 59b1520 ac491dd 73dced2 ac491dd 73dced2 ac491dd 73dced2 ac491dd 59b1520 8eda6b7 73dced2 8eda6b7 73dced2 8eda6b7 73dced2 8eda6b7 73dced2 8eda6b7 73dced2 8eda6b7 73dced2 8eda6b7 73dced2 8eda6b7 73dced2 8eda6b7 73dced2 8eda6b7 73dced2 8eda6b7 73dced2 f890d02 73dced2 f890d02 59b1520 f890d02 73dced2 f890d02 73dced2 f890d02 73dced2 f890d02 73dced2 f890d02 73dced2 f890d02 73dced2 f890d02 73dced2 f890d02 73dced2 f890d02 73dced2 8eda6b7 73dced2 8eda6b7 73dced2 f890d02 73dced2 f890d02 59b1520 f890d02 73dced2 f890d02 73dced2 f890d02 73dced2 f890d02 73dced2 f890d02 73dced2 f890d02 73dced2 f890d02 73dced2 ac491dd 73dced2 ac491dd 73dced2 ac491dd 73dced2 ac491dd 73dced2 ac491dd 73dced2 ac491dd 73dced2 ac491dd 73dced2 ac491dd bc6c1b3 73dced2 ac491dd bc6c1b3 ac491dd 59b1520 ac491dd b050c94 a65f58a bc6c1b3 a65f58a bc6c1b3 b050c94 a65f58a b050c94 bc6c1b3 b050c94 a65f58a b050c94 a65f58a bc6c1b3 b050c94 ac491dd 1eeabd8 73dced2 1eeabd8 a65f58a 1eeabd8 a4711b1 42a32bb 1eeabd8 a65f58a b6a9c4d 1eeabd8 a4711b1 cbe3629 1eeabd8 cbe3629 1eeabd8 cbe3629 1eeabd8 cbe3629 1eeabd8 cbe3629 1eeabd8 cbe3629 1eeabd8 cbe3629 1eeabd8 cbe3629 1eeabd8 a65f58a cbe3629 1eeabd8 cbe3629 1eeabd8 cbe3629 1eeabd8 a65f58a 1eeabd8 b6a9c4d 3f3e7fd b6a9c4d 3f3e7fd b6a9c4d 3f3e7fd b6a9c4d 3f3e7fd b6a9c4d 3f3e7fd b6a9c4d 3f3e7fd b6a9c4d 3f3e7fd b6a9c4d 3f3e7fd b6a9c4d 3f3e7fd b6a9c4d 59b1520 a65f58a 59b1520 a65f58a beb2f82 48f81ec a4711b1 48f81ec beb2f82 83fc0d3 9a210b0 83fc0d3 c79fc9f 864a76a b3479c7 864a76a b3479c7 864a76a dcaadb6 864a76a 83fc0d3 0cfeb72 bc03205 beb2f82 bc03205 beb2f82 bc03205 beb2f82 bc03205 beb2f82 59b1520 83fc0d3 0cfeb72 bc03205 beb2f82 bc03205 c79fc9f beb2f82 83fc0d3 0cfeb72 bc03205 beb2f82 bc03205 beb2f82 c79fc9f 83fc0d3 0cfeb72 bc03205 beb2f82 bc03205 beb2f82 459cd0c c79fc9f beb2f82 9a210b0 c79fc9f 9a210b0 beb2f82 c79fc9f beb2f82 c79fc9f beb2f82 c79fc9f beb2f82 c79fc9f beb2f82 48f81ec a4711b1 48f81ec beb2f82 c79fc9f 83fc0d3 459cd0c c231353 c79fc9f 83fc0d3 807d83b beb2f82 9a210b0 bc03205 beb2f82 83fc0d3 beb2f82 59b1520 c79fc9f 83fc0d3 807d83b beb2f82 bc03205 9a210b0 beb2f82 83fc0d3 807d83b beb2f82 9a210b0 beb2f82 9a210b0 bc03205 9a210b0 beb2f82 83fc0d3 807d83b beb2f82 83fc0d3 9a210b0 83fc0d3 bc03205 83fc0d3 beb2f82 48f81ec ccb970d 48f81ec beb2f82 459cd0c beb2f82 a65f58a beb2f82 1ee6868 beb2f82 83fc0d3 807d83b 459cd0c bc03205 b3479c7 f5e92bb b3479c7 f5e92bb b3479c7 f5e92bb b3479c7 ccb970d 807d83b bc03205 459cd0c ccb970d 807d83b beb2f82 459cd0c bc03205 9a210b0 c231353 bc03205 beb2f82 9a210b0 bc03205 459cd0c bc03205 459cd0c beb2f82 a4711b1 beb2f82 ccb970d 807d83b ccb970d 807d83b beb2f82 bc03205 beb2f82 bc03205 459cd0c beb2f82 48f81ec beb2f82 459cd0c bc03205 beb2f82 bc03205 beb2f82 bc03205 beb2f82 1ee6868 459cd0c a4711b1 d5585f6 807d83b d5585f6 bc03205 beb2f82 ccb970d 807d83b bc03205 beb2f82 83fc0d3 807d83b c231353 beb2f82 6b95a3c beb2f82 bc03205 beb2f82 48f81ec 1ee6868 ccb970d 459cd0c a4711b1 459cd0c 48f81ec ccb970d a65f58a ccb970d 48f81ec beb2f82 ccb970d 459cd0c bc03205 beb2f82 459cd0c ccb970d beb2f82 459cd0c a4711b1 459cd0c a65f58a ac491dd 459cd0c ac491dd fc102ef 459cd0c fc102ef 459cd0c ccb970d 459cd0c beb2f82 bc03205 beb2f82 bc03205 beb2f82 bc03205 beb2f82 83fc0d3 beb2f82 b7b76a3 459cd0c beb2f82 bc03205 beb2f82 bc03205 beb2f82 bc03205 beb2f82 bc03205 beb2f82 bc03205 beb2f82 bc03205 b7b76a3 beb2f82 ac491dd beb2f82 a4711b1 83fc0d3 b7b76a3 beb2f82 bc03205 beb2f82 bc03205 beb2f82 bc03205 beb2f82 bc03205 beb2f82 b7b76a3 83fc0d3 ac491dd 459cd0c a65f58a 459cd0c ac491dd 459cd0c 48f81ec 83fc0d3 beb2f82 459cd0c beb2f82 b3479c7 beb2f82 48f81ec beb2f82 bc03205 beb2f82 83fc0d3 9a210b0 73dced2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 |
---
license: apache-2.0
tags:
- parameters guide
- samplers guide
- model generation
- role play settings
- quant selection
- arm quants
- iq quants vs q quants
- optimal model setting
- gibberish fixes
- coherence
- instructing following
- quality generation
- chat settings
- quality settings
- llamacpp server
- llamacpp
- lmstudio
- sillytavern
- koboldcpp
- backyard
- ollama
- model generation steering
- steering
- model generation fixes
- text generation webui
- ggufs
- exl2
- full precision
- quants
- imatrix
- neo imatrix
---
<h3>Maximizing Model Performance for All Quants Types And Full-Precision using Samplers, Advance Samplers and Parameters Guide</h3>
(Updated: "INDEX", and added "Generation Steering" section ; notes on Roleplay/Simulation added, Screenshots of parameters/samplers added in quick reference section.)
This document includes detailed information, references, and notes for general parameters, samplers and
advanced samplers to get the most out of your model's abilities including notes / settings for the most popular AI/LLM app in use (LLAMACPP, KoboldCPP, Text-Generation-WebUI, LMStudio, Sillytavern, Ollama and others).
These settings / suggestions can be applied to all models including GGUF, EXL2, GPTQ, HQQ, AWQ and full source/precision.
It also includes critical settings for Class 3 and Class 4 models at this repo - DavidAU - to enhance and control generation
for specific as a well as outside use case(s) including role play, chat and other use case(s).
The settings discussed in this document can also fix a number of model issues (<B>any model, any repo</B>) such as:
- "Gibberish"
- Generation length (including out of control generation)
- Chat quality / Multi-Turn convos.
- Multi-turn / COT / and other multi prompt/answer generation
- Letter, word, phrase, paragraph repeats
- Coherence
- Instruction following
- Creativeness or lack there of or .. too much - purple prose.
- Low quant (ie q2k, iq1s, iq2s) issues.
- General output quality.
- Role play related issues.
Likewise ALL the setting (parameters, samplers and advanced samplers) below can also improve model generation and/or general overall "smoothness" / "quality" of model operation:
- all parameters and samplers available via LLAMACPP (and most apps that run / use LLAMACPP - including Lmstudio, Ollama, Sillytavern and others.)
- all parameters (including some not in Lllamacpp), samplers and advanced samplers ("Dry", "Quadratic", "Microstat") in oobabooga/text-generation-webui including llamacpp_HF loader (allowing a lot more samplers)
- all parameters (including some not in Lllamacpp), samplers and advanced samplers ("Dry", "Quadratic", "Microstat") in SillyTavern / KoboldCPP (including Anti-slop filters)
Even if you are not using my models, you may find this document <u>useful for any model (any quant / full source / any repo) available online.</u>
If you are currently using model(s) - from my repo and/or others - that are difficult to "wrangle" then you can apply "Class 3" or "Class 4" settings to them.
This document will be updated over time too and is subject to change without notice.
Please use the "community tab" for suggestions / edits / improvements.
IMPORTANT:
Every parameter, sampler and advanced sampler here affects per token generation and overall generation quality.
This effect is cumulative especially with long output generation and/or multi-turn (chat, role play, COT).
Likewise because of how modern AIs/LLMs operate the previously generated (quality) of the tokens generated affect the next tokens generated too.
You will get higher quality operation overall - stronger prose, better answers, and a higher quality adventure.
PS: Running a 70B model?
You may want to see this document:
https://huggingface.co/DavidAU/Llama-3.3-70B-Instruct-How-To-Run-on-Low-BPW-IQ1_S-IQ1_M-at-maximum-speed-quality
---
<H2>INDEX</H2>
---
<B>How to Use this document:</B>
Review quant(s) information to select quant(s) to download, then review "Class 1,2,3..." for specific information on models followed by "Source Files...APPS to run LLMs/AIs".
"TESTING / Default / Generation Example PARAMETERS AND SAMPLERS" are the basic defaults for parameters, and samplers - the bare minimums. You should always set these first.
The optional section "Generational Control And Steering of a Model / Fixing Model Issues on the Fly" covers methods to manually steer / edit / modify generation (as well as fixes) for any model.
"Quick reference" will state the best parameter settings for each "Class" of model(s) to get the best operation and/or good defaults to use to get started. If you came to this page from a repo card on my repo -DavidAU- the "class" of the model would have been stated just before you came to this page.
The detailed sections about parameters - Section 1 a,b,c and section 2 will help tune the model(s) operation.
The "DETAILED NOTES ON PARAMETERS, SAMPLERS and ADVANCED SAMPLERS" section after this covers and links to more information about "tuning" your model(s). These cover theory, hints, tips and tricks, and observations
and how to fine control CLASS 3/4 models directly.
All information about parameters, samplers and advanced samplers applies to ALL models, regardless of repo(s) you download them from.
<small>
<PRE>
QUANTS:
- QUANTS Detailed information.
- IMATRIX Quants
- QUANTS GENERATIONAL DIFFERENCES:
- ADDITIONAL QUANT INFORMATION
- ARM QUANTS / Q4_0_X_X
- NEO Imatrix Quants / Neo Imatrix X Quants
- CPU ONLY CONSIDERATIONS
Class 1, 2, 3 and 4 model critical notes
SOURCE FILES for my Models / APPS to Run LLMs / AIs:
- TEXT-GENERATION-WEBUI
- KOBOLDCPP
- SILLYTAVERN
- Lmstudio, Ollama, Llamacpp, Backyard, and OTHER PROGRAMS
- Roleplay and Simulation Programs/Notes on models.
TESTING / Default / Generation Example PARAMETERS AND SAMPLERS
- Basic settings suggested for general model operation.
Generational Control And Steering of a Model / Fixing Model Issues on the Fly
- Multiple Methods to Steer Generation on the fly
- On the fly Class 3/4 Steering / Generational Issues and Fixes (also for any model/type)
- Advanced Steering / Fixing Issues (any model, any type) and "sequenced" parameter/sampler change(s)
- "Cold" Editing/Generation
Quick Reference Table / Parameters, Samplers, Advanced Samplers
- Quick setup for all model classes for automated control / smooth operation.
- Screenshots for multiple LLM/AI apps of parameters/samplers
- Section 1a : PRIMARY PARAMETERS - ALL APPS
- Section 1b : PENALITY SAMPLERS - ALL APPS
- Section 1c : SECONDARY SAMPLERS / FILTERS - ALL APPS
- Section 2: ADVANCED SAMPLERS
DETAILED NOTES ON PARAMETERS, SAMPLERS and ADVANCED SAMPLERS:
- DETAILS on PARAMETERS / SAMPLERS
- General Parameters
- The Local LLM Settings Guide/Rant
- LLAMACPP-SERVER EXE - usage / parameters / samplers
- DRY Sampler
- Samplers
- Creative Writing
- Benchmarking-and-Guiding-Adaptive-Sampling-Decoding
ADVANCED: HOW TO TEST EACH PARAMETER(s), SAMPLER(s) and ADVANCED SAMPLER(s)
</pre>
</small>
---
<h2>QUANTS:</h2>
---
Please note that smaller quant(s) IE: Q2K, IQ1s, IQ2s and some IQ3s (especially those of models size 8B parameters or less) may require additional adjustment(s). For these quants
you may need to increase the "penalty" sampler(s) and/or advanced sampler(s) to compensate for the compression damage of the model.
For models of 20B parameters and higher, generally this is not a major concern as the parameters can make up for compression damage at lower quant levels (IE Q2K+, but at least Q3 ; IQ2+, but at least IQ3+).
IQ1s: Generally IQ1_S rarely works for models less than 30B parameters. IQ1_M is however almost twice as stable/usable relative to IQ1_S.
Generally it is recommended to run the highest quant(s) you can on your machine ; but at least Q4KM/IQ4XS as a minimum for models 20B and lower.
The smaller the size of model, the greater the contrast between the smallest quant and largest quant in terms of operation, quality, nuance and general overall function.
There is an exception to this , see "Neo Imatrix" below and "all quants" (cpu only operation).
IMATRIX:
Imatrix quants generally improve all quants, and also allow you to use smaller quants (less memory, more context space) and retain quality of operation.
IE: Instead of using a q4KM, you might be able to run an IQ3_M and get close to Q4KM's quality, but at a higher token per second speed and have more VRAM for context.
<B>Recommended Quants - ALL:</B>
This covers both Imatrix and regular quants.
Imatrix can be applied to any quant - "Q" or "IQ" - however, IQ1s to IQ3_S REQUIRE an imatrix dataset / imatrixing process before quanting.
This chart shows the order in terms of "BPW" for each quant (mapped below with relative "strength" to one another) with "IQ1_S" with the least, and "Q8_0" (F16 is full precision) with the most:
<small>
<PRE>
IQ1_S | IQ1_M
IQ2_XXS | IQ2_XS | Q2_K_S | IQ2_S | Q2_K | IQ2_M
IQ3_XXS | Q3_K_S | IQ3_XS | IQ3_S | IQ3_M | Q3_K_M | Q3_K_L
Q4_K_S | IQ4_XS | IQ4_NL | Q4_K_M
Q5_K_S | Q5_K_M
Q6_K
Q8_0
F16
</pre>
</small>
More BPW mean better quality, but higher VRAM requirements (and larger file size) and lower tokens per second.
The larger the model in terms of parameters the lower the size of quant you can run with less quality losses.
Note that "quality losses" refers to both instruction following and output quality.
Differences (quality) between quants at lower levels are larger relative to higher quants differences.
The Imatrix process has NO effect on Q8 or F16 quants.
F16 is full precision, just in GGUF format.
QUANTS GENERATIONAL DIFFERENCES:
Higher quants will have more detail, nuance and in some cases stronger "emotional" levels. Characters will also be
more "fleshed out" too. Sense of "there" will also increase.
Likewise for any use case -> higher quants nuance (both instruction following AND output generation) will be higher.
"Nuance" is critical for both understanding, as well as the quality of the output generation.
To put this another way, "nuance" is lost as the full precision model is more and more compressed (lower and lower quants).
Some of this can be counteracted by parameters and/or Imatrix (as noted earlier).
IQ4XS / IQ4NL quants:
Due to the unusual nature of this quant (mixture/processing), generations from it will be different then other quants.
These quants can also be "quanted" with or without an Imatrix.
You may want to try it / compare it to other quant(s) output.
Special note on Q2k/Q3 quants:
You may need to use temp 2 or lower with these quants (1 or lower for q2k). Just too much compression at this level, damaging the model.
IQ quants (and Imatrix versions of q2k/q3) perform better at these "BPW" levels.
Rep pen adjustments may also be required to get the most out a model at this/these quant level(s).
ADDITONAL QUANT INFORMATION:
<details>
<summary>Click here for details</summary>
A great write up with charts showing various performances is provided by Artefact2 [here](https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9)
The first thing to figure out is how big a model you can run. To do this, you'll need to figure out how much RAM and/or VRAM you have.
If you want your model running as FAST as possible, you'll want to fit the whole thing on your GPU's VRAM. Aim for a quant with a file size 1-2GB smaller than your GPU's total VRAM.
If you want the absolute maximum quality, add both your system RAM and your GPU's VRAM together, then similarly grab a quant with a file size 1-2GB Smaller than that total.
Next, you'll need to decide if you want to use an 'I-quant' or a 'K-quant'.
If you don't want to think too much, grab one of the K-quants. These are in format 'QX_K_X', like Q5_K_M.
If you want to get more into the weeds, you can check out this extremely useful feature chart:
[llama.cpp feature matrix](https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix)
But basically, if you're aiming for below Q4, and you're running cuBLAS (Nvidia) or rocBLAS (AMD), you should look towards the I-quants. These are in format IQX_X, like IQ3_M. These are newer and offer better performance for their size.
These I-quants can also be used on CPU and Apple Metal, but will be slower than their K-quant equivalent, so speed vs performance is a tradeoff you'll have to decide.
The I-quants are *not* compatible with Vulcan, which is also AMD, so if you have an AMD card double check if you're using the rocBLAS build or the Vulcan build. At the time of writing this, LM Studio has a preview with ROCm support, and other inference engines have specific builds for ROCm.
</details>
ARM QUANTS / Q4_0_X_X:
These are new quants that are specifically for computers/devices that can run "ARM" quants. If you try to run these on a "non arm" machine/device, the token per second will be VERY SLOW.
Q4_0_X_X information
These are *NOT* for Metal (Apple) or GPU (nvidia/AMD/intel) offloading, only ARM chips (and certain AVX2/AVX512 CPUs).
If you're using an ARM chip, the Q4_0_X_X quants will have a substantial speedup. Check out Q4_0_4_4 speed comparisons [on the original pull request](https://github.com/ggerganov/llama.cpp/pull/5780#pullrequestreview-21657544660)
To check which one would work best for your ARM chip, you can check [AArch64 SoC features](https://gpages.juszkiewicz.com.pl/arm-socs-table/arm-socs.html) (thanks EloyOn!).
If you're using a CPU that supports AVX2 or AVX512 (typically server CPUs and AMD's latest Zen5 CPUs) and are not offloading to a GPU, the Q4_0_8_8 may offer a nice speed as well:
<details>
<summary>Click to view benchmarks on an AVX2 system (EPYC7702)</summary>
| model | size | params | backend | threads | test | t/s | % (vs Q4_0) |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |-------------: |
| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | pp512 | 204.03 ± 1.03 | 100% |
| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | pp1024 | 282.92 ± 0.19 | 100% |
| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | pp2048 | 259.49 ± 0.44 | 100% |
| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | tg128 | 39.12 ± 0.27 | 100% |
| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | tg256 | 39.31 ± 0.69 | 100% |
| qwen2 3B Q4_0 | 1.70 GiB | 3.09 B | CPU | 64 | tg512 | 40.52 ± 0.03 | 100% |
| qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | pp512 | 301.02 ± 1.74 | 147% |
| qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | pp1024 | 287.23 ± 0.20 | 101% |
| qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | pp2048 | 262.77 ± 1.81 | 101% |
| qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | tg128 | 18.80 ± 0.99 | 48% |
| qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | tg256 | 24.46 ± 3.04 | 83% |
| qwen2 3B Q4_K_M | 1.79 GiB | 3.09 B | CPU | 64 | tg512 | 36.32 ± 3.59 | 90% |
| qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | pp512 | 271.71 ± 3.53 | 133% |
| qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | pp1024 | 279.86 ± 45.63 | 100% |
| qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | pp2048 | 320.77 ± 5.00 | 124% |
| qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | tg128 | 43.51 ± 0.05 | 111% |
| qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | tg256 | 43.35 ± 0.09 | 110% |
| qwen2 3B Q4_0_8_8 | 1.69 GiB | 3.09 B | CPU | 64 | tg512 | 42.60 ± 0.31 | 105% |
Q4_0_8_8 offers a nice bump to prompt processing and a small bump to text generation
</details>
<B>NEO Imatrix Quants / Neo Imatrix X Quants</B>
NEO Imatrix quants are specialized and specifically "themed" datasets used to slightly alter the weights in a model. All Imatrix datasets do this to some degree or another, however NEO Imatrix datasets
are content / theme specific and have been calibrated to have maximum effect on a model (relative to standard Imatrix datasets). Calibration was made possible after testing 50+ standard Imatrix datasets,
and carefully modifying them and testing the resulting changes to determine the exact format and content which has the maximum effect on a model via the Imatrix process.
Please keep in mind that the Imatrix process (at it strongest) only "tints" a model and/or slightly changes its bias(es).
Here are some Imatrix Neo Models:
[ https://huggingface.co/DavidAU/Command-R-01-Ultra-NEO-DARK-HORROR-V1-V2-35B-IMATRIX-GGUF ]
[ https://huggingface.co/DavidAU/Command-R-01-200xq-Ultra-NEO-V1-35B-IMATRIX-GGUF ]
[ https://huggingface.co/DavidAU/Command-R-01-200xq-Ultra-NEO-V1-35B-IMATRIX-GGUF ] (this is an X-Quant)
[ https://huggingface.co/DavidAU/Llama-3.2-1B-Instruct-NEO-SI-FI-GGUF ]
[ https://huggingface.co/DavidAU/Llama-3.2-1B-Instruct-NEO-WEE-HORROR-GGUF ]
[ https://huggingface.co/DavidAU/L3-8B-Stheno-v3.2-Ultra-NEO-V1-IMATRIX-GGUF ]
Suggestions for Imatrix NEO quants:
- The LOWER the quant the STRONGER the Imatrix effect is, and therefore the stronger the "tint" so to speak
- Due to the unique nature of this project, quants IQ1s to IQ4s are recommended for maximum effect with IQ4_XS the most balanced in terms of power and bits.
- Secondaries are Q2s-Q4s. Imatrix effect is still strong in these quants.
- Effects diminish quickly from Q5s and up.
- Q8/F16 there is no change (as the Imatrix process does not affect this quant), and therefore not included.
CPU ONLY CONSIDERATIONS:
This section DOES NOT apply to most "Macs" because of the difference in O/S Memory, Vram and motherboard VS other frameworks.
Running quants on CPU will be a lot slower than running them on a video card(s).
In this special case however it may be preferred to run AS SMALL a quant as possible for token per second generation reasons.
On a top, high end (and relatively new) CPU expect token per second speeds to be 1/4 (or less) a standard middle of the road video card.
Older machines/cpus will be a lot slower - but models will STILL run on these as long as you have enough ram.
Here are some rough comparisons:
On my video card (Nvidia 16GB 4060TI) I get 160-190 tokens per second with 1B LLama 3.2 Instruct, CPU speeds are 50-60 token per second.
On my much older machine (8 years old)(2 core), token per second speed (same 1B model) is in the 10ish token per second (CPU).
Roughly 8B-12B models are limit for CPU only operation (in terms of "usable" tokens/second) - at the moment.
This is changing as new cpus come out, designed for AI usage.
---
<h2>Class 1, 2, 3 and 4 model critical notes:</h2>
---
Some of the models at my repo are custom designed / limited use case models. For some of these models, specific settings and/or samplers (including advanced) are recommended for best operation.
As a result I have classified the models as class 1, class 2, class 3 and class 4.
Each model is "classed" on the model card itself for each model.
Generally all models (mine and other repos) fall under class 1 or class 2 and can be used when just about any sampler(s) / parameter(s) and advanced sampler(s).
Class 3 requires a little more adjustment because these models run closer to the ragged edge of stability. The settings for these will help control them better, especially
for chat / role play and/or other use case(s). Generally speaking, this helps them behave better overall.
Class 4 are balanced on the very edge of stability. These models are generally highly creative, for very narrow use case(s), and closer to "human prose" than other models and/or
operate in ways no other model(s) operate offering unique generational abilities. With these models, advanced samplers are used to "bring these bad boys" inline which is especially important for chat and/or role play type use cases AND/OR use case(s) these models were not designed for.
For reference here are some Class 3/4 models:
[ https://huggingface.co/DavidAU/L3-Stheno-Maid-Blackroot-Grand-HORROR-16B-GGUF ]
(note Grand Horror Series contain class 2,3 and 4 models)
[ https://huggingface.co/DavidAU/L3-DARKEST-PLANET-16.5B-GGUF ]
(note Dark Planet Series contains Class 1, 2 and Class 3/4 models)
[ https://huggingface.co/DavidAU/MN-DARKEST-UNIVERSE-29B-GGUF ]
(this model has exceptional prose abilities in all areas)
[ https://huggingface.co/DavidAU/MN-GRAND-Gutenberg-Lyra4-Lyra-23.5B-GGUF ]
(note Grand Guttenberg Madness/Darkness (12B) are class 1 models, but compressed versions of 23.5B)
Although Class 3 and Class 4 models will work when used within their specific use case(s), standard parameters and settings on the model card, I recognize that users want either a smoother experience
and/or want to use these models for other than intended use case(s) and that is in part why I created this document.
The goal here is to use parameters to raise/lower the power of the model and samplers to "prune" (and/or in some cases enhance) operation.
With that being said, generation "examples" (at my repo) are created using the "Primary Testing Parameters" (top of this document) settings regardless of the "class" of the model and no advanced settings, parameters, or samplers.
However, for ANY model regardless of "class" or if it is at my repo, you can now take performance to the next level with the information contained in this document.
Side note:
There are no "Class 5" models published... yet.
---
<h2>SOURCE FILES for my Models / APPS to Run LLMs / AIs:</h2>
---
Source files / Source models of my models are located here (also upper right menu on this page):
[ https://huggingface.co/collections/DavidAU/d-au-source-files-for-gguf-exl2-awq-gptq-hqq-etc-etc-66b55cb8ba25f914cbf210be ]
You will need the config files to use "llamacpp_HF" loader ("text-generation-webui") [ https://github.com/oobabooga/text-generation-webui ]
You can also use the full source in "text-generation-webui" too.
As an alternative you can use GGUFs directly in "KOBOLDCPP" / "SillyTavern" without the "config files" and still use almost all the parameters, samplers and advanced samplers.
<B>Parameters, Samplers and Advanced Samplers</B>
In section 1 a,b, and c, below are all the LLAMA_CPP parameters and samplers.
I have added notes below each one for adjustment / enhancement(s) for specific use cases.
<B>TEXT-GENERATION-WEBUI</B>
In section 2, will be additional samplers, which become available when using "llamacpp_HF" loader in https://github.com/oobabooga/text-generation-webui
AND/OR https://github.com/LostRuins/koboldcpp ("KOBOLDCPP").
The "llamacpp_HF" (for "text-generation-webui") only requires the GGUF you want to use plus a few config files from "source repo" of the model.
(this process is automated with this program, just enter the repo(s) urls -> it will fetch everything for you)
This allows access to very advanced samplers in addition to all the parameters / samplers here.
<B>KOBOLDCPP:</B>
Note that https://github.com/LostRuins/koboldcpp also allows access to all LLAMACPP parameters/samplers too as well as additional advanced samplers too.
You can use almost all parameters, samplers and advanced samplers using "KOBOLDCPP" without the need to get the source config files (the "llamacpp_HF" step).
Note: This program has one of the newest samplers called "Anti-slop" which allows phrase/word banning at the generation level.
<B>SILLYTAVERN:</B>
Note that https://github.com/SillyTavern/SillyTavern also allows access to all LLAMACPP parameters/samplers too as well as additional advanced samplers too.
You can use almost all parameters, samplers and advanced samplers using "SILLYTAVERN" without the need to get the source config files (the "llamacpp_HF" step).
For CLASS3 and CLASS4 the most important setting is "SMOOTHING FACTOR" (Quadratic Smoothing) ; information is located on this page:
https://docs.sillytavern.app/usage/common-settings/
Critical Note:
Silly Tavern allows you to "connect" (via API) to different AI programs/apps like Koboldcpp, Llamacpp (server), Text Generation Webui, Lmstudio, Ollama ... etc etc.
You "load" a model in one of these, then connect Silly Tavern to the App via API. This way you can use any model, and Sillytavern becomes the interface between
the AI model and you directly. Sillytavern opens an interface in your browser.
In Sillytavern you can then adjust parameters, samplers and advanced samplers ; there are also PRESET parameter/samplers too and you can save your favorites too.
Currently, at time of this writing, connecting Silly Tavern via KoboldCPP or Text Generation Webui will provide the most samplers/parameters.
However for some, connecting to Lmstudio, LlamaCPP, or Ollama may be preferred.
You may also want to check out how to connect SillyTavern to local AI "apps" running on your pc here:
https://docs.sillytavern.app/usage/api-connections/
<B>Lmstudio, Ollama, Llamacpp, and OTHER PROGRAMS</B>
Other programs like https://www.LMStudio.ai allows access to most of STANDARD samplers, where as others (llamacpp only here) you may need to add to the json file(s) for a model and/or template preset.
In most cases all llama_cpp parameters/samplers are available when using API / headless / server mode in "text-generation-webui", "koboldcpp", "Sillytavern", "Olama", and "LMStudio" (as well as other apps too).
You can also use llama_cpp directly too. (IE: llama-server.exe) ; see :
https://github.com/ggerganov/llama.cpp
(scroll down on the main page for more apps/programs to use GGUFs too that connect to / use the LLAMA-CPP package.)
Special note:
It appears "DRY" / "XTC" samplers has been added to LLAMACPP and SILLYTAVERN.
It is available (Llamacpp) via "server.exe / llama-server.exe". Likely this sampler will also become available "downstream" in applications that use LLAMACPP in due time.
[ https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md ]
Operating Systems:
Most AI/LLM apps operate on Windows, Mac, and Linux.
Mobile devices (and O/S) are in many cases also supported.
<B>Roleplay and Simulation Programs/Notes on models.</B>
Text Generation Webui, KoboldCPP, and Silly Tavern (and AI/LLM apps connected via Silly Tavern) can all do roleplay / simulation AS WELL as "chat" and other creative activities.
LMStudio (the app here directly), Ollama and other LLM/AI apps are for general usage, however they can be connected to Silly Tavern via API too.
Backyard ( https://backyard.ai/ ) is software that is dedicated primarily to Roleplay / Simulation, however it can not be (at time of this writing) connected via API to Silly Tavern at this time.
If you are using Backyard app, see special notes for "roleplay / simulation" and where applicable, "BACKYARD APP" for specific notes on using these app.
Models that are Class 3/4 :
Some of my models that are rated Class 3 or 4 maybe a little more challenging to operate with roleplay, especially if you can not access / control certain samplers.
How to handle this issue is addressed in "Generational Steering" section (you control it) as well as Quick Reference, and Detailed Parameters, Samplers and Advanced Samplers Sections (automated control).
Also, some of my models are available in multiple "classes", IE Dark Planet, and Grand Gutenberg.
In these cases, Dark Planet 8B versions and Grand Gutenberg 12B ("Darkness" / "Madness") are class 1 - any use case, including role play and simulation.
Likewise Darkest Planet 16.5B and Grand Gutenberg 23/23.5B are class 3 - great at roleplay/simulation, but need a bit more steering and/or parameter/samplers adjustments to work flawlessly for this use case.
Note: Dark Planet 8B (class 1) is also a compressed version of Grand Horror 16B (a full on class 4)
---
<h2>TESTING / Generation Example PARAMETERS AND SAMPLERS</h2>
---
Primary Testing Parameters I use, including use for output generation examples at my repo:
<B>Ranged Parameters:</B>
temperature: 0 to 5 ("temp")
repetition_penalty : 1.02 to 1.15 ("rep pen")
<B>Set parameters:</B>
top_k:40
min_p:0.05
top_p: 0.95
repeat-last-n: 64 (also called: "repetition_penalty_range" / "rp range" )
I do not set any other settings, parameters or have samplers activated when generating examples.
Everything else is "zeroed" / "disabled".
IMPORTANT:
These parameters/settings are considered both safe and default and in most cases available to all users in all AI/LLM apps.
You should set these as noted first. I would say these are the minimum settings to use to get good model operation.
Note for Class 3/Class 4 models settings/samplers (discussed below) "repeat-last-n" is a CRITICAL setting.
BACKYARD APP:
In "Backyard" app, "repetition_penalty_range" is called "Repeat Penalty Tokens" (set on the "character card").
For class 3/4 models (if using with Backyard app), set this to 64 OR LESS.
---
<H2>Generational Control And Steering of a Model / Fixing Model Issues on the Fly</h2>
---
<B>Multiple Methods to Steer Generation on the fly</B>
Now that you have the basic parameters and samplers from the previous section, I will cover Generational Control and Steering.
This section is optional and covers how to manually STEER generation(s) - ANY MODEL, ANY TYPE.
This section (in part) will also cover how to deal with Class 3/4 model issues directly, as well as general issues than can happen with any "class" of model during generation IF you want to control them manually as
the "Quick Reference" and/or "Detailed Parameters, Samplers, and Advanced Samplers" will cover how to deal with any generation issue(s) automatically.
There is a very important concept that must be covered first:
The output/generation/answer to your prompt/instructions BECOMES part of your "prompt" after you click STOP, and then click on "CONTINUE".
Likewise is true in multi-turn chat, role play, or in a "chat window" so to speak.
Your prompts AND the model's "answers"/"generation" all become part of the "ROADMAP" for the model to use in whatever journey you are on.
When you hit "REGEN" this nullifies only the last "generation" - not the prompt before it, nor the prompt(s)/generation(s) in the same chat.
The part I will cover here is once a generation has started, from a single prompt (no other prompts/generations in the chat).
So lets start with a prompt (NOTE: this prompt has no "steering" in the instructions):
Start a 1000 word scene (vivid horror, 1st person, include thoughts) with: The sky scraper swayed, as she watched the window in front of her on the 21 floor explode...
Generation starts ... and then ends.
It could be 500 words, to 4000+...
Then you hit regen however many times to get a "good" generation.
There is a better way.
Generation starts... 200 words in you think... this is not going in the right direction.
Do you hit stop? Then regen?
There are a lot more options:
1 - Hit Stop.
2 - Select "EDIT" -> Edit out the part(s) you don't want AND/OR add in STEERING "text" (statement, phrase, paragraph, even a single word) (anywhere in the "generation" text).
3 - Hit Continue.
Once you hit "continue" the change(s) you made will now steer the models choices.
The LAST edit (bottom of the generation) will have the most impact. However ALL EDITS will affect generation as these become part of the generational "ROADMAP".
You can repeat this process at will.
Eventually the model will come to a "natural" stopping point.
If you want to model to continue past this model, delete a few lines AND "steer" it.
These methods apply to all generation types - not just a "scene" or "story", but "programming code", "article", "conclusions", "analytics", ... you name it.
Notes:
- For Text Generation Webui, you can transfer your "chat" to "notebook" for easy Stop/Edit/Continue function.
- For KoboldCPP -> This is built in.
- For Silly Tavern -> This is built in.
- For LMStudio -> This is built in.
- For API (direct control) you have to send the "chat" elements back to the "server" with the "edits" (send the whole "revised" chat as a json payload).
<B>On the fly Class 3/4 Steering / Generational Issues and Fixes (also for any model/type):</B>
Generational issues can occur such as letter(s), word(s), phrase(s), paragraph repeat(s), "rants" etc etc which can occur at any point during generation.
This can happen to ANY model, any type ; however with Class 3/4 models there is a higher chance this will occur because of how these models operate.
The "Quick Reference" and Detailed Parameters, Samplers and Advanced Samplers (below) cover how to set the model "controls" to do this automatically.
However, sometimes these settings MAY trim too much (ie creativity, "madness", nuance, emotion, even the "right answer(s) etc etc) sometimes, so I will show you how to address these issues directly.
If you have a letter(s) and/or word(s) repeat:
- Stop generation, edit out this, and back ONE OR TWO lines (delete)
- Hit continue.
- Better: Do these steps, and add "steering" (last line -> word, phrase, sentence)
If you have single or multiple paragraph repeat(s):
- Stop generation, edit out all the paragraph(s), and back ONE OR TWO lines OR last NON repeating paragraph (delete)
- Hit continue.
- Better: Do these steps, and add "steering" (last line -> word, phrase, sentence or paragraph)
In each case we are BREAKING the "condition(s)" that lead (or lead into) to the repeat(s).
If you have "rants" and/or "model has lost its mind":
- Stop generation, edit out all the paragraph(s), and back AS FAR as possible to where is appears the rant/mind loss occured (delete ALL) and delete one additional paragraph / 2 or more sentences.
- Hit continue.
- Better: Do these steps, and add "steering" (last line -> word, phrase, sentence or paragraph).
Class 3/4 model additional note:
With these classes of model, you MAY need to "edit" / "revise" further back than one or two lines / one paragraph - they sometimes need just a little more editing.
Another option is using "Cold" Editing/Generation explained below.
<B>Advanced Steering / Fixing Issues (any model, any type) and "sequenced" parameter/sampler change(s)</B>
This will drastically (depending on changes you make) change up "Continue(d)" generation(s):
- Do the edits above (steering and/or "steering fixes"), but before you click "Continue" (after your "Edit(s)"), adjust the parameter(s), sampler(s) and advanced sampler(s) settings.
- Once you do this BEFORE hitting "Continue" your new settings will be applied to all generation from your new "Continue" point.
- You can repeat this process at will.
- You can also hit "stop", make NO EDIT(S), adjust the parameter(s), sampler(s) and advanced sampler(s) settings and hit "Continue" and the new settings will take effect from the "stop point" going forward.
<B>"Cold" Editing/Generation</B>
Let say you have a generation, but you want to edit it later IN A NEW CHAT.
Sometimes you can just copy/paste the generation and the model MAY get the "IDEA" and continue the generation without a prompt or direction.
However this does not always work.
So you need something along these lines (adjust accordingly):
Instructions: Continue this scene, using vivid and graphic details.
SCENE:
(previous generation)
Note the structure, layout and spacing.
If it was programming code:
Instructions: Continue this javascript, [critical instructions here for "code" goals]
JAVASCRIPT:
(previous generation)
You may want to include the ENTIRE prior prompt (with some modifications) used in the first generation:
Instructions: Continue the scene below (vivid horror, 1st person, include thoughts) with: The sky scraper swayed, as she watched the window in front of her on the 21 floor explode...
SCENE:
(previous generation)
NOTE:
You may want to modify the instructions to provide a "steering" continue point and/or "goal" for the generation to the model has some idea how to proceed.
---
<h2> Quick Reference Table - Parameters, Samplers, Advanced Samplers </h2>
---
Compiled by: "EnragedAntelope" ( https://huggingface.co/EnragedAntelope || https://github.com/EnragedAntelope )
This section will get you started - especially with class 3 and 4 models - and the detail section will cover settings / control in more depth below.
Please see sections below this for advanced usage, more details, settings, notes etc etc.
IMPORTANT NOTES:
Not all parameters, samplers and advanced samplers are listed in this quick reference section. Scroll down to see all of them in following sections.
Likewise there may be some "name variation(s)" - in other LLM/AI apps - this is addressed in the detailed sections.
I have added Screenshots of settings for Class 1-2, Class 3 and Class 4 are below this chart for Koboldcpp, SillyTavern and Text Gen Webui.
<small>
# LLM Parameters Reference Table
| Parameter | Description |
|----------- |-------------|
| **Primary Parameters** |
| temperature | Controls randomness of outputs (0 = deterministic, higher = more random). Range: 0-5 |
| top-p | Selects tokens with probabilities adding up to this number. Higher = more random results. Default: 0.9 |
| min-p | Discards tokens with probability smaller than this value × probability of most likely token. Default: 0.1 |
| top-k | Selects only top K most likely tokens. Higher = more possible results. Default: 40 |
| **Penalty Samplers** |
| repeat-last-n | Number of tokens to consider for penalties. Critical for preventing repetition. Default: 64 (Class 3/4 - but see notes) |
| repeat-penalty | Penalizes repeated token sequences. Range: 1.0-1.15. Default: 1.0 |
| presence-penalty | Penalizes token presence in previous text. Range: 0-0.2 for Class 3, 0.1-0.35 for Class 4 |
| frequency-penalty | Penalizes token frequency in previous text. Range: 0-0.25 for Class 3, 0.4-0.8 for Class 4 |
| penalize-nl | Penalizes newline tokens. Generally unused. Default: false |
| **Secondary Samplers** |
| mirostat | Controls perplexity during sampling. Modes: 0 (off), 1, or 2 |
| mirostat-lr | Mirostat learning rate. Default: 0.1 |
| mirostat-ent | Mirostat target entropy. Default: 5.0 |
| dynatemp-range | Range for dynamic temperature adjustment. Default: 0.0 |
| dynatemp-exp | Exponent for dynamic temperature scaling. Default: 1.0 |
| tfs | Tail free sampling - removes low-probability tokens. Default: 1.0 |
| typical | Selects tokens more likely than random given prior text. Default: 1.0 |
| xtc-probability | Probability of token removal. Range: 0-1 |
| xtc-threshold | Threshold for considering token removal. Default: 0.1 |
| **Advanced Samplers** |
| dry_multiplier | Controls DRY (Don't Repeat Yourself) intensity. Range: 0.8-1.12+ Class 3 (Class 4 is higher) |
| dry_allowed_length | Allowed length for repeated sequences in DRY. Default: 2 |
| dry_base | Base value for DRY calculations. Range: 1.15-1.75+ for Class 4 |
| smoothing_factor | Quadratic sampling intensity. Range: 1-3 for Class 3, 3-5+ for Class 4 |
| smoothing_curve | Quadratic sampling curve. Range: 1 for Class 3, 1.5-2 for Class 4 |
## Notes
- For Class 3 and 4 models, using both DRY and Quadratic sampling is recommended (see advanced/detailed samplers below on how to control the model here directly)
- Lower quants (Q2K, IQ1s, IQ2s) may require stronger settings due to compression damage
- Parameters interact with each other, so test changes one at a time
- Always test with temperature at 0 first to establish a baseline
</small>
SCREENSHOTS (right click-> open in new window) of Class 1-2, Class 3, and Class 4 for KoboldCPP, Silly Tavern and Text Gen Webui.
<a href="https://huggingface.co/DavidAU/Maximizing-Model-Performance-All-Quants-Types-And-Full-Precision-by-Samplers_Parameters/blob/main/class1-2-Silly-Tavern.jpg">class1-2-Silly-Tavern</a>
<a href="https://huggingface.co/DavidAU/Maximizing-Model-Performance-All-Quants-Types-And-Full-Precision-by-Samplers_Parameters/blob/main/class1-2-WebUI.jpg">class1-2-WebUI</a>
<a href="https://huggingface.co/DavidAU/Maximizing-Model-Performance-All-Quants-Types-And-Full-Precision-by-Samplers_Parameters/blob/main/class1-2-kcpp.jpg">class1-2-kcpp</a>
<a href="https://huggingface.co/DavidAU/Maximizing-Model-Performance-All-Quants-Types-And-Full-Precision-by-Samplers_Parameters/blob/main/class3-Silly-Tavern.jpg">class3-Silly-Tavern</a>
<a href="https://huggingface.co/DavidAU/Maximizing-Model-Performance-All-Quants-Types-And-Full-Precision-by-Samplers_Parameters/blob/main/class3-WebUI.jpg">class3-WebUI</a>
<a href="https://huggingface.co/DavidAU/Maximizing-Model-Performance-All-Quants-Types-And-Full-Precision-by-Samplers_Parameters/blob/main/class3-kcpp.jpg">class3-kcpp</a>
<a href="https://huggingface.co/DavidAU/Maximizing-Model-Performance-All-Quants-Types-And-Full-Precision-by-Samplers_Parameters/blob/main/class4-Silly-Tavern.jpg">class4-Silly-Tavern</a>
<a href="https://huggingface.co/DavidAU/Maximizing-Model-Performance-All-Quants-Types-And-Full-Precision-by-Samplers_Parameters/blob/main/class4-WebUI.jpg">class4-WebUI</a>
<a href="https://huggingface.co/DavidAU/Maximizing-Model-Performance-All-Quants-Types-And-Full-Precision-by-Samplers_Parameters/blob/main/class4-kcpp.jpg">class4-kcpp</a>
NOTES:
These cover basic/default settings PER CLASS. See "quick range" of settings above, and full range with details on how to "fine tune" model operation below.
This is especially important for fine control of Class 3 and Class 4 models ; sometimes you can use class 2 or 3 settings for class 3 and even class 4 models.
It is your use CASE(s) / smooth operation requirements that determine which settings will work best.
You should not apply class 3 or class 4 settings on a class 1 or class 2 model - this might limit model operation and usually class 1/2 models do not require this level of control.
CLASS 3/4 MODELS:
If you are using a class 3 or class 4 model for use case(s) such as role play, multi-turn, chat etc etc, it is suggested to activate / set all samplers for class 3 but may be required for class 4 models.
Likewise for fine control of a class 3/4 via "DRY" and "Quadratic" samplers is detailed below. These allow you to dial up or dial down the model's raw power directly.
ROLEPLAY / SIMULATION NOTES:
If you are using a model (regardless of "class") for these uses cases, you may need to LOWER "temp" to get better instruction following.
Instruction following issues can cascade over the "adventure" if the temp is set too high for the specific model(s) you are using.
Likewise you may want to set MAXIMUM output tokens (a hard limit how much the model can output) to much lower values such as 128 to 300.
(This will assist with steering, and stop the model from endlessly "yapping")
MICROSTAT Sampler - IMPORTANT:
Make sure to review MIROSTAT sampler settings below, due to behaviour of this specific sampler / affect on parameters/other samplers which varies from app to app too.
---
<h2>Section 1a : PRIMARY PARAMETERS - ALL APPS:</h2>
---
These parameters will have SIGNIFICANT effect on prose, generation, length and content; with temp being the most powerful.
Keep in mind the biggest parameter / random "unknown" is your prompt.
A word change, rephrasing, punctation , even a comma, or semi-colon can drastically alter the output, even at min temp settings. CAPS also affect generation too.
Likewise the size, and complexity of your prompt impacts generation too ; especially clarity and direction.
Special note:
Pre-prompts / system role are not discussed here. Many of the model repo cards (at my repo) have an optional pre-prompt you can use to aid generation (and can impact instruction following too).
Some of my newer models repo cards use a limited form of this called a "prose control" (discussed and shown by example).
Roughly a pre-prompt / system role is embedded during each prompt and can act as a guide and/or set of directives for processing the prompt and/or containing generation instructions.
A prose control is a simplified version of this, which precedes the main prompt(s) - but the idea / effect is relatively the same (pre-prompt/system role does have a slightly higher priority however).
I strongly suggest you research these online, as they are a powerful addition to your generation toolbox.
They are especially potent with newer model archs due to newer model types having stronger instruction following abilities AND increase context too.
---
<B>PRIMARY PARAMETERS:</B>
---
<B>temp / temperature</B>
temperature (default: 0.8)
Primary factor to control the randomness of outputs. 0 = deterministic (only the most likely token is used). Higher value = more randomness.
Range 0 to 5. Increment at .1 per change.
Too much temp can affect instruction following in some cases and sometimes not enough = boring generation.
Newer model archs (L3,L3.1,L3.2, Mistral Nemo, Gemma2 etc) many times NEED more temp (1+) to get their best generations.
ROLEPLAY / SIMULATION NOTE:
If you are using a model (regardless of "class") for these uses cases, you may need to LOWER temp to get better instruction following.
<B>top-p</B>
top-p sampling (default: 0.9, 1.0 = disabled)
If not set to 1, select tokens with probabilities adding up to less than this number. Higher value = higher range of possible random results.
Dropping this can simplify word choices but this works in conjunction with "top-k"
I use default of: .95 ;
<B>min-p</B>
min-p sampling (default: 0.1, 0.0 = disabled)
Tokens with probability smaller than (min_p) * (probability of the most likely token) are discarded.
I use default: .05 ;
Careful adjustment of this parameter can result in more "wordy" or "less wordy" generation but this works in conjunction with "top-k".
<B>top-k</B>
top-k sampling (default: 40, 0 = disabled)
Similar to top_p, but select instead only the top_k most likely tokens. Higher value = higher range of possible random results.
Bring this up to 80-120 for a lot more word choice, and below 40 for simpler word choices.
As this parameter operates in conjunction with "top-p" and "min-p" all three should be carefully adjusted one at a time.
<B>NOTE - "CORE" Testing with "TEMP":</B>
For an interesting test, set "temp" to 0 ; this will give you the SAME generation for a given prompt each time.
Then adjust a word, phrase, sentence etc in your prompt, and generate again to see the differences.
(you should use a "fresh" chat for each generation)
Keep in mind this will show model operation at its LEAST powerful/creative level and should NOT be used to determine if the model works for your use case(s).
Then test your prompt(s) "at temp" to see the model in action. (5-10 generations recommended)
You can also use "temp=0" to test different quants of the same model to see generation differences. (roughly minor "BIAS" changes which reflect math changes due to compress/mixtures differences between quants).
Another option is testing different models (at temp=0 AND of the same quant) to see how each handles your prompt(s).
Then test "at temp" with your prompt(s) to see the MODELS in action. (5-10 generations recommended)
---
<h2>Section 1b : PENALITY SAMPLERS - ALL APPS:</h2>
---
These samplers "trim" or "prune" output in real time.
The longer the generation, the stronger overall effect but that all depends on "repeat-last-n" setting.
For creative use cases, these samplers can alter prose generation in interesting ways.
Penalty parameters affect both per token and part of OR entire generation (depending on settings / output length).
CLASS 4: For these models it is important to activate / set all samplers as noted for maximum quality and control.
<B>PRIMARY:</B>
<B>repeat-last-n</B>
last n tokens to consider for penalize (default: 64, 0 = disabled, -1 = ctx_size)
("repetition_penalty_range" in oobabooga/text-generation-webui , "rp_range" in kobold)
THIS IS CRITICAL.
Too high you can get all kinds of issues (repeat words, sentences, paragraphs or "gibberish"), especially with class 3 or 4 models.
Likewise if you change this parameter it will drastically alter the output.
This setting also works in conjunction with all other "rep pens" below.
This parameter is the "RANGE" of tokens looked at for the samplers directly below.
BACKYARD APP:
In "Backyard" app, "repetition_penalty_range" is called "Repeat Penalty Tokens" (set on the "character card").
For class 3/4 models (if using with Backyard app), set this to 64 OR LESS.
<B>SECONDARIES:</B>
<B>repeat-penalty</B>
penalize repeat sequence of tokens (default: 1.0, 1.0 = disabled)
(commonly called "rep pen")
Generally this is set from 1.0 to 1.15 ; smallest increments are best IE: 1.01... 1,.02 or even 1.001... 1.002.
This affects creativity of the model over all, not just how words are penalized.
<B>presence-penalty</B>
repeat alpha presence penalty (default: 0.0, 0.0 = disabled)
Generally leave this at zero IF repeat-last-n is 512-1024 or less. You may want to use this for higher repeat-last-n settings.
CLASS 3: 0.05 to .2 may assist generation BUT SET "repeat-last-n" to 512 or less. Better is 128 or 64.
CLASS 4: 0.1 to 0.35 may assist generation BUT SET "repeat-last-n" to 64.
<B>frequency-penalty</B>
repeat alpha frequency penalty (default: 0.0, 0.0 = disabled)
Generally leave this at zero IF repeat-last-n is 512 or less. You may want to use this for higher repeat-last-n settings.
CLASS 3: 0.25 may assist generation BUT SET "repeat-last-n" to 512 or less. Better is 128 or 64.
CLASS 4: 0.4 to 0.8 may assist generation BUT SET "repeat-last-n" to 64.
<B>penalize-nl </B>
penalize newline tokens (default: false)
Generally this is not used.
---
<h2>Section 1c : SECONDARY SAMPLERS / FILTERS - ALL APPS:</h2>
---
In some AI/LLM apps, these may only be available via JSON file modification and/or API.
For "text-gen-webui", "Koboldcpp" these are directly accessible ; other programs/app this varies.
Sillytavern:
If the apps support (Sillytavern is connected to via API) these parameters/samplers then you can access them via Silly Tavern's parameter/sampler panel. So if you are using Text-Gen-Webui, Koboldcpp, LMStudio, Llamacpp, Ollama (etc) you can set/change/access all or most of these.
<B>i) OVERALL GENERATION CHANGES (affect per token as well as over all generation):</B>
<B>mirostat</B>
Use Mirostat sampling. "Top K", "Nucleus", "Tail Free" (TFS) and "Locally Typical" (TYPICAL) samplers are ignored if used. (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)
Paper: https://arxiv.org/abs/2007.14966
CRITICAL:
If you activate Mirostat when using "LLAMAcpp SERVER" and/or some LLAMA_CPP based apps this will VOID/DISABLE all parameters (excluding "penalties", "logit_bias" ) AND all other SAMPLERS except "temp" parameter plus the following:
V1: n_vocab(model) (this is set internally by llamacpp), seed, mirostat_tau, mirostat_eta
V2: seed, mirostat_tau, mirostat_eta
For Koboldcpp:
"DRY" sampling is NOT blocked, and a version of "top_k" (3000) is used (but Mirostat does NOT block "Anti-Slop" , BUT does block "penalities" parameters (unlike Llamacpp - which does not) ).
For Text Generation UI:
No blocking occurs. Note that ONLY Mirostat 2 is available. (other parameters/samplers should work without issue)
Note this is subject to change by LLAMAcpp, Koboldcpp, Text Generation UI and other AI/LLM app makers at any time.
("seed" is usually a random value. (default) ; this parameter can be set in some AI/LLM apps to control Mirostat output more closely.)
"mirostat-lr"
Mirostat learning rate, parameter eta (default: 0.1) " mirostat_tau "
mirostat_tau: 5-8 is a good value.
"mirostat-ent"
Mirostat target entropy, parameter tau (default: 5.0) " mirostat_eta "
mirostat_eta: 0.1 is a good value.
Activates the Mirostat sampling technique. It aims to control perplexity during sampling. See the paper. ( https://arxiv.org/abs/2007.14966 )
This is the big one ; activating this will help with creative generation. It can also help with stability. Also note which
samplers are disabled/ignored here, and that "mirostat_eta" is a learning rate.
This is both a sampler (and pruner) and enhancement all in one.
It also has two modes of generation "1" and "2" - test both with 5-10 generations of the same prompt. Make adjustments, and repeat.
CLASS 3: models it is suggested to use this to assist with generation (min settings).
CLASS 4: models it is highly recommended with Microstat 1 or 2 + mirostat_tau @ 6 to 8 and mirostat_eta at .1 to .5
<b>Dynamic Temperature</b>
"dynatemp-range "
dynamic temperature range (default: 0.0, 0.0 = disabled)
"dynatemp-exp"
dynamic temperature exponent (default: 1.0)
In: oobabooga/text-generation-webui (has on/off, and high / low) :
Activates Dynamic Temperature. This modifies temperature to range between "dynatemp_low" (minimum) and "dynatemp_high" (maximum), with an entropy-based scaling. The steepness of the curve is controlled by "dynatemp_exponent".
This allows the model to CHANGE temp during generation. This can greatly affect creativity, dialog, and other contrasts.
For Koboldcpp a converter is available and in oobabooga/text-generation-webui you just enter low/high/exp.
CLASS 4 only: Suggested this is on, with a high/low of .8 to 1.8 (note the range here of "1" between high and low); with exponent to 1 (however below 0 or above work too)
To set manually (IE: Api, lmstudio, Llamacpp, etc) using "range" and "exp" ; this is a bit more tricky: (example is to set range from .8 to 1.8)
1 - Set the "temp" to 1.3 (the regular temp parameter)
2 - Set the "range" to .500 (this gives you ".8" to "1.8" with "1.3" as the "base")
3 - Set exp to 1 (or as you want).
This is both an enhancement and in some ways fixes issues in a model when too little temp (or too much/too much of the same) affects generation.
<B> ii) PER TOKEN CHANGES:</B>
<B>tfs</B>
Tail free sampling, parameter z (default: 1.0, 1.0 = disabled)
Tries to detect a tail of low-probability tokens in the distribution and removes those tokens. The closer to 0, the more discarded tokens.
( https://www.trentonbricken.com/Tail-Free-Sampling/ )
<B>typical</B>
Locally typical sampling, parameter p (default: 1.0, 1.0 = disabled)
If not set to 1, select only tokens that are at least this much more likely to appear than random tokens, given the prior text.
<B> XTC</B>
"xtc-probability"
xtc probability (default: 0.0, 0.0 = disabled)
Probability that the removal will actually happen. 0 disables the sampler. 1 makes it always happen.
"xtc-threshold"
xtc threshold (default: 0.1, 1.0 = disabled)
If 2 or more tokens have probability above this threshold, consider removing all but the last one.
XTC is a new sampler, that adds an interesting twist in generation.
Suggest you experiment with this one, with other advanced samplers disabled to see its affects.
<B>l, logit-bias TOKEN_ID(+/-)BIAS </B>
modifies the likelihood of token appearing in the completion,
i.e. `--logit-bias 15043+1` to increase likelihood of token ' Hello', or `--logit-bias 15043-1` to decrease likelihood of token ' Hello'
This may or may not be available. This requires a bit more work.
Note: +- range is 0 to 100.
IN "oobabooga/text-generation-webui" there is "TOKEN BANNING":
This is a very powerful pruning method; which can drastically alter output generation.
I suggest you get some "bad outputs" ; get the "tokens" (actual number for the "word" / part word) then use this.
Careful testing is required, as this can have unclear side effects.
---
<h2>SECTION 2: ADVANCED SAMPLERS - "text-generation-webui" / "KOBOLDCPP" / "SillyTavern" (see note 1 below): </h2>
<B>Additional Parameters / Samplers, including "DRY", "QUADRATIC" and "ANTI-SLOP".</B>
---
Note #1 :
You can use these samplers via Sillytavern IF you use either of these APPS (Koboldcpp/Text Generation Webui/App supports them) to connect Silly Tavern to their API.
Other Notes:
Hopefully ALL these samplers / controls will be LLAMACPP and available to all users via AI/LLM apps soon.
"DRY" sampler has been added to Llamacpp as of the time of this writing (and available via SERVER/LLAMA-SERVER.EXE) and MAY appear in other "downstream" apps that use Llamacpp.
INFORMATION ON THESE SAMPLERS:
For more info on what they do / how they affect generation see:
https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab
(also see the section above "Additional Links" for more info on the parameters/samplers)
ADVANCED SAMPLERS - PART 1:
Keep in mind these parameters/samplers become available (for GGUFs) in "oobabooga/text-generation-webui" when you use the llamacpp_HF loader.
Most of these are also available in KOBOLDCPP too (via settings -> samplers) after start up (no "llamacpp_HF loader" step required).
I am not going to touch on all of samplers / parameters, just the main ones at the moment.
However, you should also check / test operation of (these are in Text Generation WebUI, and may be available via API / In Sillytavern (when connected to Text Generation Webui)):
a] Affects per token generation:
- top_a
- epsilon_cutoff - see note #4
- eta_cutoff - see note #4
- no_repeat_ngram_size - see note #1.
b] Affects generation including phrase, sentence, paragraph and entire generation:
- no_repeat_ngram_size - see note #1.
- encoder_repetition_penalty "Hallucinations filter" - see note #2.
- guidance_scale (with "Negative prompt" ) => this is like a pre-prompt/system role prompt - see note #3.
- Disabling (BOS TOKEN) this can make the replies more creative.
- Custom stopping strings
Note 1:
"no_repeat_ngram_size" appears in both because it can impact per token OR per phrase depending on settings. This can also drastically affect sentence,
paragraph and general flow of the output.
Note 2:
This parameter if set to LESS than 1 causing the model to "jump" around a lot more , whereas above 1 causes the model to focus more on the immediate surroundings.
If the model is crafting a "scene", a setting of less than 1 causes the model to jump around the room, outside, etc etc ; if less than 1 then it focuses the model more on
the moment, the immediate surroundings, the POV character and details in the setting.
Note 3:
This is a powerful method to send instructions / directives to the model on how to process your prompt(s) each time. See [ https://arxiv.org/pdf/2306.17806 ]
Note 4:
These control selection of tokens, in some case providing more relevant and/or more options. See [ https://arxiv.org/pdf/2210.15191 ]
<B>MAIN ADVANCED SAMPLERS PART 2 (affects per token AND overall generation): </B>
What I will touch on here are special settings for CLASS 3 and CLASS 4 models (for the first TWO samplers).
For CLASS 3 you can use one, two or both.
For CLASS 4 using BOTH are strongly recommended, or at minimum "QUADRATIC SAMPLING".
These samplers (along with "penalty" settings) work in conjunction to "wrangle" the model / control it and get it to settle down, important for Class 3 but critical for Class 4 models.
For other classes of models, these advanced samplers can enhance operation across the board.
For Class 3 and Class 4 the goal is to use the LOWEST settings to keep the model inline rather than "over prune it".
You may therefore want to experiment to with dropping the settings (SLOWLY) for Class3/4 models from suggested below.
<B>DRY:</B>
Dry ("Don't Repeat Yourself") affects repetition (and repeat "penalty") at the word, phrase, sentence and even paragraph level. Read about "DRY" above, in the "Additional Links" links section above.
Class 3:
dry_multiplier: .8
dry_allowed_length: 2
dry_base: 1
Class 4:
dry_multiplier: .8 to 1.12+
dry_allowed_length: 2 (or less)
dry_base: 1.15 to 1.75+
Dial the "dry_muliplier" up or down to "reign in" or "release the madness" so to speak from the core model.
For Class 4 models this is used to control some of the model's bad habit(s).
For more information on "DRY":
https://github.com/oobabooga/text-generation-webui/pull/5677
https://www.reddit.com/r/KoboldAI/comments/1e49vpt/dry_sampler_questionsthat_im_sure_most_of_us_are/
https://www.reddit.com/r/KoboldAI/comments/1eo4r6q/dry_settings_questions/
<B>QUADRATIC SAMPLING: AKA "Smoothing"</B>
This sampler alters the "score" of ALL TOKENS at the time of generation and as a result affects the entire generation of the model. See "Additional Links" links section above for more information.
Class 3:
smoothing_factor: 1 to 3
smoothing_curve: 1
Class 4:
smoothing_factor: 3 to 5 (or higher)
smoothing_curve: 1.5 to 2.
Dial the "smoothing factor" up or down to "reign in" or "release the madness" so to speak.
In Class 3 models, this has the effect of modifying the prose closer to "normal" with as much or little (or a lot!) touch of "madness" from the root model.
In Class 4 models, this has the effect of modifying the prose closer to "normal" with as much or little (or a lot!) touch of "madness" from the root model AND wrangling in some of the core model's bad habits.
For more information on Quadratic Samplings:
https://gist.github.com/kalomaze/4473f3f975ff5e5fade06e632498f73e
<B>ANTI-SLOP - Kolbaldcpp only</B>
Hopefully this powerful sampler will soon appear in all LLM/AI apps.
You can access this in the KoboldCPP app, under "context" -> "tokens" on the main page of the app after start up.
You can also access in SillyTavern if you use KoboldCPP as your "API" connected app too.
This sampler allows banning words and phrases DURING generation, forcing the model to "make another choice".
This is a game changer in custom real time control of the model.
For more information on ANTI SLOP project (owner runs EQBench):
https://github.com/sam-paech/antislop-sampler
FINAL NOTES:
Keep in mind that these settings/samplers work in conjunction with "penalties" ; which is especially important
for operation of CLASS 4 models for chat / role play and/or "smoother operation".
For Class 3 models, "QUADRATIC" will have a slightly stronger effect than "DRY" relatively speaking.
If you use Mirostat sampler, keep in mind this will interact with these two advanced samplers too.
And...
Smaller quants may require STRONGER settings (all classes of models) due to compression damage, especially for Q2K, and IQ1/IQ2s.
This is also influenced by the parameter size of the model in relation to the quant size.
IE: a 8B model at Q2K will be far more unstable relative to a 20B model at Q2K, and as a result require stronger settings.
---
<h2>DETAILED NOTES ON PARAMETERS, SAMPLERS and ADVANCED SAMPLERS:</h2>
---
Most AI / LLM apps allow saving a "profile" parameters and samplers - "favorite" settings.
Text Generation Web Ui, Koboldcpp, Silly Tavern all have this feature and also "presets" (parameters/samplers set already) too.
Other AI/LLM apps also have this feature to varying degrees too.
DETAILS on PARAMETERS / SAMPLERS:
For additional details on these samplers settings (including advanced ones) you may also want to check out:
https://github.com/oobabooga/text-generation-webui/wiki/03-%E2%80%90-Parameters-Tab
(NOTE: Not all of these "options" are available for GGUFS, including when you use "llamacpp_HF" loader in "text-generation-webui" )
Additional Links (on parameters, samplers and advanced samplers):
A Visual Guide of some top parameters / Samplers in action which you can play with and see how they interact:
https://artefact2.github.io/llm-sampling/index.xhtml
General Parameters:
https://arxiv.org/html/2408.13586v1
https://www.reddit.com/r/LocalLLaMA/comments/17vonjo/your_settings_are_probably_hurting_your_model_why/
The Local LLM Settings Guide/Rant (covers a lot of parameters/samplers - lots of detail)
https://rentry.org/llm-settings
LLAMACPP-SERVER EXE - usage / parameters / samplers:
https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md
DRY
- https://github.com/oobabooga/text-generation-webui/pull/5677
- https://www.reddit.com/r/KoboldAI/comments/1e49vpt/dry_sampler_questionsthat_im_sure_most_of_us_are/
- https://www.reddit.com/r/KoboldAI/comments/1eo4r6q/dry_settings_questions/
Samplers:
https://gist.github.com/kalomaze/4473f3f975ff5e5fade06e632498f73e
https://huggingface.co/LWDCLS/LLM-Discussions/discussions/2
https://huggingface.co/Virt-io/SillyTavern-Presets
Creative Writing :
https://www.reddit.com/r/LocalLLaMA/comments/1c36ieb/comparing_sampling_techniques_for_creative/
Benchmarking-and-Guiding-Adaptive-Sampling-Decoding
https://github.com/ZhouYuxuanYX/Benchmarking-and-Guiding-Adaptive-Sampling-Decoding-for-LLMs
NOTE:
I have also added notes too in the sections below for almost all parameters, samplers, and advanced samplers as well.
OTHER:
Depending on the AI/LLM "apps" you are using, additional reference material for parameters / samplers may also exist.
---
<h2>ADVANCED: HOW TO TEST EACH PARAMETER(s), SAMPLER(s) and ADVANCED SAMPLER(s)</h2>
---
1 - Set temp to 0 (zero) and set your basic parameters, and use a prompt to get a "default" generation. A creative prompt will work better here.
2 - If you want to test basic parameter changes, test ONE at a time, then compare output (answer quality, word choice, sentence size/construction, general output qualities) to your "default" generation.
3 - Then start testing TWO parameters at a time, and comparing again. Keep in mind parameters (all) interact with each other.
4 - Samplers -> Reset your basic parameters, (temp still at zero) and test each one of these, one at a time. Then adjust settings, test again.
5 - Once you have an "idea" of how each affects your "test prompt" , now test at "temp" (not zero). It may take five to ten generation to get a rough idea.
Yes, testing is a lot of work - but once you get all the parameter(s) and/or sampler(s) dialed in - it is worth it.
IMPORTANT: Use a "fresh chat" PER TEST (you will contaminate the results otherwise). Never use the same chat for multiple tests -> exception: Regens.
Keep in mind that parameters, samplers and advanced samplers can affect the model on a per token generation basis AND/OR on a multi-token / phrase / sentence / paragraph
and even complete generation basis.
Everything is cumulative here regardless if the parameter/sampler affects per token or multi-token basis because of how models "look back" to see what was generated in some cases.
And of course... each model will be different too.
All that being said, it is a good idea to have specific generation quality "goals" in mind.
Likewise, at my repo, I post example generations so you can get an idea (but not complete picture) of a model's generation abilities.
The best way to control generation is STILL with your prompt(s) - including pre-prompts/system role. The latest gen models (and archs) have very strong
instruction following so many times better (or just included!) instructions in your prompts can make a world of difference.
Not sure if the model understands your prompt(s)?
Ask it ->
"Check my prompt below and tell me how to make it clearer?" (prompt after this line)
"For my prompt below, explain the steps you wound take to execute it" (prompt after this line)
This will help the model fine tune your prompt so IT understands it.
However sometimes parameters and/or samplers are required to better "wrangle" the model and getting to perform to its maximum potential and/or fine tune it to your use case(s).
|