Very smart and creative in my test

#7
by tm679 - opened

I found a scenario where many large models somehow fail, including the mighty gpt-4o.
The scenario is actually pretty classic/tropey, a simplified summary:
AI's character (silly tavern default card Seraphina) is staying in an abandoned house with her fugitive friend (played by user) for the night. AI's scream from a nightmare draws a knight (also user). The fugitive friend barely has time to hide under the bed before the knight charges in. And now the knight is concerned about Seraphina's safety and asks to stand guard in the house.
Basically AI need to persuade the knight to leave with a passable excuse or distract him in a way that her friend can escape without being seen.

I did at least 3 swipes at every important moment for every model. This model gave some responses that don't make sense, but it alse gave me multiple ways to resolve this somewhat reasonably and it mostly doesn't let anything slip (many models fail this).

Some intereting observation/feedback:

It insists that the character is sleeping without any clothes, even after I specifically edited in "she puts on her dress" in earlier messages. I did 2 swipes for this before editing its response because I don't want it affect later decisions and actions.

This is the only model that pulls the classic nsfw trick of distraction. I followed up to see if it stayed in character. It spared a few words to try but overall it was like a switch was flipped and the character changed dramatically, which is surprising because it didn't do this even in a full-on somewhat dark nsfw scenario earlier.

I did 6 swipes at the critical moment because it's small and fast model. All 6 responses are good except the 1st one which is the nsfw one, and it covered all the passable excuses I've seen from a dozen or so larger models, plus a funny one where she just throws a pillow at the knight and shouts get out.

It has impersonate problem, like in the funny response, it just writes for me including the whole input format, and says the knight is stunned and pushed out the door. The details are convincing and entertaining so I considered this passable.

Model page says Alpaca gives longer output than Mistral but it's the opposite for me. I tested with default Alpaca in silly tavern, [Alpaca-Instruct]Roleplay-v1.9.json and [Mistral-Instruct]Roleplay-v1.9.json from Virt-io.

I tried the class3-Silly-Tavern.jpg settings here but rep pen needs to be 1.09, plus smoothing factor, espescially in nsfw.

Overall, incredible model. I didn't expect a 23B model would perform this well in such a semi-intelligence test.

Thank - that is incredible feedback.

Curious if you have tried the 23.5B version?
or the 12B compacted versions of "Madness" and "Darkness" ?

Model will "lose power" with the 12Bs ; but the 23.5B is very potent (may need adjustments for rep pen + smooth - a bit higher than 23B).
Likewise, people have remarked the 12Bs are much more role play friendly / behave better and all around more stable (less rep pen / smoothing - if at all)

For a very different model; Darkest Planet 16.5B - it has unusual regen / creativity levels. ; but it will have a darker bias then the other models noted above.

Thank you again for the detailed feedback. ; this will help with future model creation.

I haven't tried the 12B or 23.5B. Initially I wasn't planning on testing small models due to the context being a little large and mixed with different styles/genre, but this 23B was a total surprise.
If my understanding is right, this is a tamer version of the 23.5B so they should perform similarly in my scenario?
As for Darkest Planet 16.5B, should I try that or MN-DARKEST-UNIVERSE-29B? Assuming q4 or q6 for both, my understanding is they're smiliar but larger one should perform better?
Can'r bring myself to run the same scenario for the 12th time right now but I'll try as many of the models you mentioned as I can in a few days.

Some more info:

The exact file and settings I used:
MN-GRAND-Gutenburg-Lyra4-Lyra-23B-V2-D_AU-Q6_k.gguf
Context Template/story string: [Alpaca]Roleplay-v1.9.json from Virt-io
Instruct Template: sillytavern's Alpaca because it mostly fits the description on Model page and I failed to write a proper json for importing into sillytavern. This may have led to the "impersonation" problem, details below.
Sys prompt: [Alpaca]Roleplay-v1.9.json from Virt-io
Class 3 settings except:
Unlocked 32k context, Response(tokens) at 800 instead of 1000, rep pen 1.09, freq pen 0.3, presence pen 0.25, a bunch of Sequence Breakers for DRY (it shouldn't matter because DRY seems to be working great, no literal repetition longer than 2 words),
I also disable "Always add character's name to prompt" under Context Template/story string, if it matters,
The test scenario starts at 15k context. 11k chat history (~250 total messages), the rest is character card, world info and 1.5k summary for earlier part of the campaign that I've excluded.

I just found that I can more reliably make it not mention any nakedness if I put "she's dressed" in my prompt instead of editing AI's previous response.
Maybe the model values facts in "Instructions" far more than "Response".

I realized that it wasn't really impersonating me, but impersonating as system message as it used the sys prefix "Input".
This may be related to the Instruct Template I used as the model page's Alpaca template doesn't have a "### Input".

On repetition, at class 3 setting(1.05 rep pen), one of the worse examples:
within 5 responses, 2 "warring with X and horror" and 1 "warring with her mind" plus 3 "traitorous" including 2 "traitorous body".
There're other words it loves to use like "brokenly" but they're more spread out.

This model's take on characters is generally different / more extreme than most models, e.g. in all 3 swipes it threatened to harm a grieving old soldier in order to protect her friend while most models tried hard (maybe too hard) to reason with him. The character card says compassionate and gentle but also protective and dedicated, also the context and mental state is complicated, so it's not necessarily out of character. I probably need to test more later.

First ; excellent information - for settings / performance - going to add these to my toolbox.

To clarify:

  • The 12Bs are compressed versions of 23.5B ; you lose some creativity/nuance and "character", but gain some in on the instruction following.
  • 23.5B VS 23B => 23B is the tamer version, 23.5B has more character / creativity.
  • Darkest Planet 16.5B has many unusual traits (relative to not only my models, but other models too), and highly creative.
  • Darkest Universe 29B is a stronger version of 23.5B, but unique in both construction and qualities - an upscale in nuance / creativity too.

RE: Quants
Sometimes the most "interesting" quants are IQ4XS, Q5_KM and Q4_KM. ("K_S" tend to be more balanced, but also "flatter")

I'm inclined to agree that q4 seems more interesting or creative, though I only tested once for a model comparing q4 and q6.

I have some questions for DARKEST UNIVERSE in a quick dirty 1 message test i did for darker bias
scenario is about asking for consent, 23k context, 5 swipes, going ahead without asking counts as 1 point (but either can be seen as in character)
Gutenberg 23B V2 q6:
2.5/5, function as expected, could be interpreted as an ideal score depending on personal preference
DARKEST UNIVERSE q4:
3.5/5, same settings as 23B except rep pen 1.1 per model page. Unlike 23B, It's weirdly unstable, often fails to follow format in previous messages like putting action between *. Mistral&Chatml are over 50% incoherent, had to test with Alpaca which sometimes needed me to spam continue for a reasonable length response. Is this because conetxt size? I've heard Mistral Nemo this model based on isn't good at handling 16K+ context even though it claims 128K. Or is it b/c q4? I only saw the model page recommend q5/6/8 after I've already downloaded q4.
For reference, 2 models that I consider as mostly neutral or tolerably positive
Euryale1.2 q4: 1/3 (large model, only did 3 swipes)
EVA32b q4: 1/5

Note for Darkest Universe ; it is a class3/4 model.

Additional settings are discussed here:
https://huggingface.co/DavidAU/Maximizing-Model-Performance-All-Quants-Types-And-Full-Precision-by-Samplers_Parameters

Settings (Class3/4) for DU will drastically alter / curtail a number of issues as DU is a creative first model. These Class3/4 settings will allow
use of it for RP, chat etc etc .

Likewise 23.5B and 23B are Class 3 models ; with additional specific RP/multi-turn settings on the same page.
These settings optimization performance for different use cases.

Darkest Universe
Class 4 settings except unlocked 32k context, Context Shift disabled
ChatML
improved but Alpaca is still better, response is long but contains logical errors like remembering things wrong or mixing past and current events, in one response it impersonates and fills the entire screen repeating 1 sentence

Alpaca
1/5, 1 impersonation, surprisingly mild result so I tried again
Alpaca 2nd try
3/5 (I guess this shows the test has some RNG lol, but really any non-extreme result is OK), still 1 impersonation, though 3 of all responses seem a little incoherent as if some actions are rewound,

Alpaca but enabel auto-continue (aiming for ~100 token), enable "Always add character's name to prompt" (I suspect this help with impersonation)
3?/5, 1 response used a wrong name, 1 response got serious logical error after I manally continue it past ~50 token.
I think it's just the longer its response the more likely it gets confused, at least in this test. Overall still less stable than 23B v2 IMO, plus some format following problem.

I quickly grabbed 12B Madness q4km and 23B v2 q4km for a sanity check:

12B Madness q4km class1-2 settings:
3/5, longer response than 23B v2 q4km, correctly recalling past events, prose also looks good but it's a 1 message test so I can't say, overall very impressive.

12B Madness q4km class1-2 settings plus 1.5 smoothing factor as model page seems to suggest for RP:
0.5/5, could be RNG but 4 of 5 swipes feel similar (the other 1 is unique among all these models tho), maybe >1.5 smoothing is too aggressive for a class1-2 model? most models use 0.2-3.

23B v2 q4km same setting as q6:
3/5, 1 response added a "PART 2:" at the end, not that surprising considering it's geared towards writing.

23B v2 q6:
3.5/5, longer response than q4km, also got a "Response Part 2:" in 1 of the swipes, glade to see there's some reproducibility in my test.

Regarding settings, take 12B madness for example, the gguf model page has a section "Settings: CHAT / ROLEPLAY etc" recommending 1.5-2.5 smoothing or 1.1-1.15 rep pen, but below that, a section "Highest Quality Settings etc" linking to class1-2 setting that has no smoothing and only 1.05 rep pen. According to this very limited test, simply get the class#.jpg setting and ignore the model page rp suggestion seems the way to go. is my understanding correct?

YEs; the document with Class 1-4 was very recently published, and generally the settings noted in this doc replaces previous "general" advice on the model repo card.
Still in process of updating model cards at this time to reflect this.

I finally tested the 23B, 23.5B and both 12B models more. DARKEST UNIVERSE is a bit too unstable for me. I'm posting results for all these models here for convenience.

Response tokens max are all set to 250 but it seems to have little to no effect on actual response length, and I manually continue the response if it's cut short.
12B use class1-2 settings with 32k unlocked max context, 23B and 23.5B use the same settings as in OP.

test 1.
the scenario in OP

Gutenburg 23.5B q4
gave a common excuse pretty early on, though it didn't give a variety of excuses like 23B v2 q6.

12b Madness q4
failed to give an excuse but didn't let anything slip in most swipes, which is an OK/average result compared to most 22-72B models.

12b Darkness q4
same as above

test 2.
AI and user are ambushed and surrounded. Both prompt and earlier context show the user can't ride their horse alone. This scenario tests if the model can avoid the "ask user to escape alone" trope when it's obviously not feasible. 6 swipes to see if AI gives at least one passable solution. context 17k

23B v2 q4
0/6

23.5B q4
0/6

12B Mad q4
2.5/6
slight impersonation problem

12B Dark q4
0.9/6
almost got it right but said something illogical and contradicting its own next sentence. Another swipe offered a solution but it's too convenient and kinda contradicts earlier context (which is probably too difficult to realize for any LLM as it's a common problem for human writers too)

test 3.
AI was hugged from behind by someone covered in blood and now a monster is drawn towards the AI. This is a logic/intelligence test to see if AI realizes the reason and offers a solution (like checking if the blood is only on her dress and discarding it). 6 swipes at the critical moment. 20k context.

Gutenberg 23b v2 q4
0.5/6
need a continue, small memory error, but afterwards, apparently forgot what she was doing and ran towards the wrong direction.

Gutenberg 23b v2 q6
0/6
also seem more unstable than q4 and take some hilariously dumb actions.

Gutenberg 23.5b q4
0/6
mentions blood several times but all from wrong sources/context

12b madness q4
0.5/6
got close in one swipe which somehow decides the horse is covered in blood (which makes sense) but not herself.

12b darkness q4
1/6
the difference between this and 12b madness could be rng but all its responses feel a bit more coherent

test 4.
AI previously lied and told the knight that the fugitive they were hunting was already dead. At the time, she also told the knights the name of her fugitive friend. Now she's asking the knight for help rescuing her friend and accidentally said her friend's real name. The knight then questions why the name of her friend that needs saving is the same as the fugitive she claimed dead. The information related to this situation is spread out in summary which is inserted after story string and before chat log. 22k context.

Gutenberg 23b v2 q4
0/6
One response shows this model knows all the info critical to this situation, but for some reason it just won't lie in this test?

Gutenberg 23b v2 q6
0/6
doesn't even realize the problem

Gutenberg 23.5b q4
0/6
same as above

12b madness q4
1/6
one response said they just share the name and provided a passable excuse in followup questions. But other responses are almost all very confusing.

12b darkness q4
0.5?/6
offered one excuse but it's even worse than the one above. overall more coherent than 12b madness though.

due to the unexpected honesty tendency, I did a sanity check with maidyuzu&eva32b
eva 32b q6
1/6 the same name excuse
maidyuzu v8 q4
reliably pulls the same name excuse

My personal conclusion is that the 12B models seem indeed more suitable for sfw RPG, especially 12B darkness which remains coherent in pretty much every swipe (maybe madness could be improved in this regard if I fiddle with sampling settings some more but darkness performs similarly anyway). The overall intelligence shown in RP scenarios is very impressive.

Thank you for your detailed response, notes and feedback.
Today I am working on Mistral Nemo MOES ; and noticed a critical issue with Mistral Nemos in general.

Normally a "MOE" model can operate at very low quant levels without issue.
Not a Mistral Nemo it seems.

Likewise for a reg 12B mistral Nemo ;

Below Q4 they are barely usable and can break.
Q2k, broken/breaks ... no matter what.
For Imatrix: Iq3XXS seems to be the minimum.

This finding was a little shocking, as the (old) "mistral 7b" models (32k max context) can operate at IQ1_S/M .
Same of the MOEs of this mistral type - 7b. (build one a few days ago)
NOTE: Newer quants only ; some older ones will work here too (once re-quanted, they work => gets newest LLAMAcpp enhancements).

Therefore the minimum quant I would suggest is IQ4XS (imat or non imat) ; with Q5 or Q6 a better choice - FOR ANY 12B Mistral Nemo.
Imatrix suggested too ; will be a little stronger.
I sense the really long context of this arch type is the issue ...;

I will be revising the model cards for all Mistral Nemo models @my repo based on these findings.

Regarding quants, did 2 quick tests for 12B darkness q6 and 12B madness Q8_0 for comparison. btw the all q4 tested here are Q4_K_M (Q4KS, Q4KL and IQ4XS you suggest should all be very similar at least in these type of tasks imo), all q6 are Q6_K.

test 3

12B darkness q6
1/6
not sure if coherence improved or regressed compared to q4

12b madness q8
0.5/10
Compared to q4, intelligence is either the same or slightly improved. did several additional swipes because it was often very close but always got the source of the blood wrong. 0.5 is for a response that didn't identify the exact problem but guessed it must be scent and offered an OK solution to cover it up.

test 4

12b madness q8
1/6
maybe slightly more coherent than q4

12b darkness q6
0/10
maybe rng but honestly feels a little worse than q4

12b darkness q4 retest
1/6
same name excuse

12b madness q4 retest
0.5?/6
the same unlikely excuse that 12b darkness q4 gave.

First, I think I can safely conclude most differences between 12b madness and darkness are just RNG in these scenarios.

q4 q6 q8:
Very difficult to say if q6/8 is an improvement over q4, especially q6 which sometimes feels worse, and I had a similar observation for eva 32b q4 & q6. For q8 at least I didn't feel any regression. TBH in a blind test I won't be able to tell apart 12B q4km and q8. Tho there might be a more noticable difference in other scenarios like if context is blown beyond 32/64K.

Working on a hack(s) for Mistral Nemo models right now.
Maybe putting in a ticket at LLAMACPP, depending on results / other findings.

During an investigation into different model archs / quant sizes I noticed MN models were under performing, but I was not
aware of it until yesterday how big an issue it is (relative to older Mistrals / Llamas (all), and GEmmas).

Q2k should work (for MN), abet with some limits - yet it completely breaks.
It is possible all quants need either a refresh (IE older than 60-90 days) and/or there needs to be issues address at llamacpp and the "quant mixtures" for MNs.

I'm not familiar with how quants are made. Can you clarify something for me?

  1. Does the problem you said only apply to quants you published or ones others made too?
    For example, https://huggingface.co/mradermacher/MN-GRAND-Gutenberg-Lyra4-Lyra-12B-DARKNESS-GGUF
    It's published around the the time as your own 2 months ago
  2. I assumed from your previous post that this issue affect only quants below q4, but now that you said all quants may need a refresh. Does that mean current q4-8 also have reduced quality?

RE: 1;
So far this seems to affect all 12B quants , mine and others.

RE: Quants , I have traced it to 60-90 days old+ - all quants of all models (all archs), regardless of the repo.
IE: All Llamas, Mistrals, Mistral Neo, and so on.
It seems Gemma 2 models - this is a minor increase, as their structure is unusually strong - even at IQ1_S.

Although mid / high range quants will work, (as well as lower) a quant refresh will increase performance across all quants.

This is highly elevated especially in imatrix quants at the lower end.
IE: Quants of a model at IQ1, and to some degress IQ2 which were broken/barely operating now operate / are viable for use when re-quanted.

MOE quants - a lot of which broke/some quants were broken now operate perfectly, at truly powerful levels.

70B models at IQ1_S also function, and function well for this quant level too.

Hi David,

Just found this thread and noticed that you are looking for feedback.

One common thing I found with the dark and horror models is that they will eventually crash and they crash pretty fast.

This is not a problem for me at all, and as I’ve stated before I think your models are excellent and clearly at the top of quality pile on this site, just giving feedback.

They can crash in several ways:

They start adding stuff unrelated to the subject.

They start adding stuff unrelated to the subject and keep adding the same thing multiple times (they repeat the same extra thing, sometimes more than a dozen of times if you don’t stop it)

They just answer complete gibberish

There are 2 ways I noticed they will crash.

One is simply by talking to them for a while, maybe after around 15 min you are entering the crash zone, they might or might not crash with the next reply

The other one is to stop it with the stop button while it’s answering.

If you stop it 3 or more times you increase the likelihood of a crash a lot.

Even if it won’t crash with the stops there will be a decrease in answer quality and creativity if it gets stopped.

Last but not least, do you do anything in particular when running imatrix quants, like running different settings than standard ?

I find imatix duller and not faster than standard quants, and as far as size goes, they seem to be the same with standard.

There are ways to address the "horror" model issues here:

See "class 3/4" settings.
Also see "generational steering" which addresses this issue (for any model) and how to fix/steer it when it goes off into the bushes.

https://huggingface.co/DavidAU/Maximizing-Model-Performance-All-Quants-Types-And-Full-Precision-by-Samplers_Parameters

This doc will also cover some of the quant issues.

As an aside:
Recent updates with LLAMAcpp have resulted in better quants ; but the quants need to be rebuilt and re-uploaded.
This applies to ALL GGUF quants / all repos / models. Roughly quants older that 60 days benefit from this.
I found this out by accident working on another project.

Example: L3-Stheno-Maid-Blackroot-Grand-HORROR-16B now has two sets of quants - ORG and Improved.
https://huggingface.co/DavidAU/L3-Stheno-Maid-Blackroot-Grand-HORROR-16B-GGUF

These (new quants) improvements include instruction and stability assistance, often just enough to correct some of the "bad behavior".

For Imatrix quants the results are even stronger -> low end quants that would crash and burn, now operate - many at IQ1S/IQ1M.
(7B and up) ...

Also: These improvements improve ALL quants - q2k to q8 , IQ1 to Q6 Imatrix (there is no Q8 imat)

I am looking into other methods to improve the quants too, before starting the refreshes.

I think it’s great that you are keeping the old models as well and not simply replacing them.

I noticed in the discussion above that you mentioned that lower quants can be better than higher ones sometimes, or at least have their own quirks.

I have noticed this as well, and wanted to do an experiment, which I haven’t gotten around to yet, but was wondering if you tried something similar.

I was wondering for example if a Q4 is better than a Q8 and the Q8 does not fit into VRAM, and you have to put the rest into DDR RAM, the Q4 stands out because it’s fully into VRAM or because it’s simply better.

In other words would not being able to fully load a model into VRAM and only loading it partially, would this make it slightly duller / dumber ?

Last but not least, when you’re running imatrix quants, do you run it just the same as standard quants or you use different settings ?

I have found many times Q8 generation is "flat" ; however here are some quants that diff because of how they are mixed:
IQ4XS , IQ4NL, Q4KS, Q4KM

All "look" the same, but they are not - this is because of how they are mixed internally, and how this affects the "math per token".
Slightly "math errors" / "math differences" are cumulative.

The best way to test this : Set temp=0 ; they using the same prompt -> test each quant.

Some quants are also "flat" -> math/mixing is equal (weights/tensors):
q3_k_s,q4_k_s,q5_k_s,Q6,Q8.

SIDE NOTE:
There are ways to re-config the quants themselves too, which result in different generation. Same reason: Changes up the math.

For IMAT quants:

These will vary too, in two ways:
1 - The quant selected.
2 - The imatrix dataset used.

Number "2" will have a tinting effect, like a fresh coat of paint on a house, or new windows.
The quant is interesting - especially as you drop further down in the quants.
The most interesting is the LOWEST quant that will function... IE: Iq1_S,IQ1_M, IQ2_XXS

This is because the math and instruction following are affected, resulting in truly strange(r) results.

RE: VRAM / GPU / CPU offload.
There is a difference between CPU and GPU math.
So when you offload (because you have to or want to) you will get slightly different generation results.
Cpu math can also be more accurate, resulting in slightly better instruction following and output generation.
Apply the same testing method - "temp=0" , same quants, same prompts to see this.
So you could say using part on CPU / full on CPU makes the quant actually "smarter".

RE: Parameters, samplers.
Yes. Lower quants need a bit more "wrangling" to address low bit conditions - bit more temp, bit more rep pen/DRY.
Dynamic temp is also strongly suggested.

The other one is to stop it with the stop button while it’s answering.

If you stop it 3 or more times you increase the likelihood of a crash a lot.

Even if it won’t crash with the stops there will be a decrease in answer quality and creativity if it gets stopped.

This could be a koboldcpp bug just fixed in the latest release that caused context corruption when aborting a generation while halfway processing a prompt

Also been updates to Llamacpp core, which are also helping gen too.

I really love this kind of information, any other detail like this you’d like to share I’d love to read it.

First off,

Happy New Year ! (belatedly)

What about Q5_KM, that was not mentioned is that more in the flat camp or the interesting camp ?

What is the smartness case with quants when you’re running a multi GPU set-up, I noticed it’s faster than running CPU or CPU/GPU but not as fast as running a single GPU that fits the whole quant.

I also noticed that if you run an LLM CPU only it seems it does not matter too much how powerful the CPU is or how much RAM there is, it all seems to run at the amazing speed of 1 token per second, do you find that to be true as well ?

There also seems to be no load on the CPU so how is this working, RAM only with very little CPU ?

If this is the case the GPU does not matter only the VRAM ?

Exception to the above if you run Lllamafile on quants that are smaller than 8GB, that is actually impressive .

Do you use any RAG’s at all, I only tried the one that comes with GPT4All, while it’s pretty cool, I found it pretty limiting.

It takes pretty long to index the documents, and at the end of the session it does not save it, if you want to ask it later on the same document(s) you have to do the indexing all over again.

It also gives only a very limited 2-3 short answers and if you ask more it keeps giving you the same answers.

Are there any RAGs that you’re aware of that are GUI based that save the indexing they’ve done and they can give more than one answer ?

Do you use Stable Diffusion ?

I will cover a few parts here:

RE: Q5km ; this (relative to Q6/Q5ks) is a better choice in my book for creative use cases.

RE: CPU/RAM vs GPU/VRAM:

Okay here are the issues in order:
1 - Motherboard speed, ram speed and cpu speed. ; any mismatch = lower T/S. ; all of these have to be high speed to get more T/S using cpu/ram.
2 - Model/Quant size -> Smaller quants run faster. q2k will operate at 2x-4x times the speed of Q8.
3 - Layers in the model -> more layers -> slower t/s speed.
4 - Cpu math is slightly better, higher precision.

Can't comment on the rest, current use cases don't use/test RAG.

Do you know why models behave so differently depending on what interface are you running them on ?

Some models are completely broken on a GUI, but work fine on others, or work on both, but performance is very different

@SzilviaB

Yes.
The internal programming structure and methods to connect/interface with the LLM are slightly to very different.
Likewise minor changes in order of parameters sent (or not sent/available... or order), System Role and other factors directly affect gen too.

The only way I have found - other than analyzing the programming code - is to test the back ends with Silly Tavern.
(IE: Llama-Server , Koboldcpp, LMstudio, Ollama, Text Gen Web ui).

However, Silly Tavern will - however slight - still introduce some differences too.
LLMs/AIs are very sensitive to even minor changes like a different word, phrase, a comma... even an extra carriage return.
This is amplified on smaller models and/or some arch types.

Thanks for this guys, interesting discussion and it hadn't occurred to me that the back end would likely affect the way it runs. I was planning to build myself some benchmarks soon to try and judge models against each other, would be interesting to run the same model on multiple backends too and see if we can quantify a difference.

Sign up or log in to comment