Repetition and token leaking

#3
by Varkoyote - opened

Hey! I've been testing this model a bit but it seems to quickly fall into repetition for me, and incoherence (repeating an action that happened 4 or 5 messages ago). It also seems to occasionally include "user", "ai" or "assistant" as an end token...

Latitude org

That's quite odd! You're certain you're using ChatML to communicate with the model?

Yeah I'm sure :( It's my default one

Latitude org

Hrm. Alright, what software are you using to load it, and which GGUF did you end up downloading?

KoboldCPP 1.82 using q8 from this repo

Latitude org

Well, that's as perfect as can be when it comes to GGUF quality! I'm honestly stumped as to what might be the issue here, but I'm also the first to admit it's been forever since I last used KoboldCPP.

I'll keep the discussion open for now in case someone else's willing to make a more educated guess.

Hey @Gryphe

So, out of curiosity, I gave the model a go, and yeah, I encountered similar issues. But I think I know why it flew under the radar for some users but not others.

I'm using Q8 with KCPP 1.82 as a backend (FlashAttention, No KV Cache compression, no context shift trickery), and my own front-end, in ChatML. I tried with a few different samplers (yours and much more conservative ones i use for Mistral models).

On some characters I was near consistently hit with issues like infinite text generation (well, until we reach max_rep length), the bot talking for me or for people who don't exist in the scenario, random out of place system and assistant messages, and all the hallmarks of a model whose instruction formatting hasn't been properly enforced. But, in other situations with simpler system prompts and bio, the model does a lot better. Nothing to write home about, but It stays in its role (mostly) and doesn't go absolutely crazy.

The longer the system prompt, and the more system messages there is in the chat itself, the worse it gets in my (admittedly short) experience. Out of 8 chatlogs / characters, I regenerated a ton of messages from the bot. 2 characters (with very long detailed system prompts) were unusable, 5 worked okay, 1 was mixed.

Hope that helps.

Edit: I used the model in a 1v1 chat setting, which maybe is why it's reacting so poorly, as it seems to be designed for a storyteller / GM role.

It was answered by latitude on the community tab (https://huggingface.co/LatitudeGames/Wayfarer-12B/discussions/4) on the non GGUF model what the reason for this is. It's because of the fact that the training data is generated from the AI Dungeon system.

"The model does tend to underuse the eos token since we typically rely on max token length instead"

This means that the model will kind of be forced into situations with thin data and low performance thanks to generations longer than its intended usage because of the way a message with an LLM is actually stopped. There's a special token, kind of like a message, that tells the system "hey this is the end of the message". If that token is for whatever reason less frequent in the training data, this is a likely result. HOWEVER, it does actually have a kind of neat potential emergent behavior imo:

Continuing a message this way seems to be disproportionately likely to extend whatever idea it's on. A description, a monologue, etc, can be extended past what we are seeing for the average response that a model trained for general chatbot use likes to spit out. This is potentially at the expense of performance, but within the context of creative writing I'm finding a few 'rerolls' (not real re-rolls as it would just create a new message, more like deleting and trying to continue a half written response) actually returns interesting results.

But yes, I think this version of wayfarer is kind of just... you're gunna have to be the stop token and clean up the end of the message yourself in a lot of situations. I honestly have been letting it ramble until it does something off the wall or devolves into nonsense or just makes a choice I don't want. I leave the relevant generation, and continue the message until I would like some sort of response, or just kind of loosely when I want to transition to the next idea. While it's annoying, sure, it's usefulness has me returning to it. Your mileage may vary.

edit: my testing is with mostly Q4_K_M, using mostly KCPP

I mean "under-using the eos token" is a very nice way to write "we didn't tune the model for its stated instruction set strongly enough". If I wanted the model to continue no matter what, I can already tell the KCPP to ignore the EOS token and rely on max_reply_length alone.

Too bad, I really liked the overall idea behind the model.

The model is trained for performance on their system, use outside of AI dungeon is a non-intended use case.

Using an instruction set during training doesn't mean that it wont have any other instructions, special tokens, or whatever else in it's data. So, by the same logic, it doesn't mean that you using chatml in your frontend automatically means you're using it the way the model was trained to be used, there are fundamental formatting and other differences between their service and say something like sillytavern. If they for some reason trained a small model specifically for at home personal use (burning money that will likely not be recouped if released with a nice permissive license like wayfarer's apache-2.0) they would actually get WORSE performance on their own system, so what you're saying isn't super factually accurate, just in my perspective. They've trained it enough for it's actual use case. Here's a part of my novel I made in the community tab of the non quant'd model.

"I recommend noting the format that is used for generating data for training runs in the future, provided those tokens aren’t cleaned before training. Mistral models have their own instruct/formatting quirks, but it might even be worth creating officially recommended inference templates that correspond with the trained model, kind of like a function calling outline. Even creating simple Python scripts for internal use meant to update the dataset with new changes to your system’s formatting when they occur, ensuring the dataset is consistent and up to date. Really hard to guess at what the entire workflow looks like for you folks so I’m sure a lot of this is redundant or non applicable." - moi

Also, I actually wonder about chatml on top of the tekken instruct here without the rubric of their own system to reinforce their 'instruction set' (using it as an umbrella now). I do realize that they've told you to use chatml, but if you wanna get wild maybe give mistralv3 tekken stuff a try? Here's an example assuming you're using ST with the artifacts you're mentioning. Shot in the dark.

'[INST]{{#if system}}{{system}}
{{/if}}{{#if wiBefore}}{{wiBefore}}
{{/if}}{{#if description}}{{description}}
{{/if}}{{#if personality}}{{personality}}
{{/if}}{{#if scenario}}{{scenario}}
{{/if}}{{#if wiAfter}}{{wiAfter}}
{{/if}}{{#if persona}}{{persona}}
{{/if}}{{trim}}[/INST]Understood.'

Also worth noting, if you're using KCPP and you're having problems with it spitting stuff out you dont't like, while testing the model consider using something like Kobold's anti-slop:

From release notes of an earlier release on their github repo:

"koboldcpp-1.76

shivers down your spine edition

NEW: Added Anti-Slop Sampling (Phrase Banning) - You can now provide a specified list of words or phrases prevented from being generated, by backtracking and regenerating when they appear. This capability has been merged into the existing token banning feature. It's now also aliased into the banned_strings field.

Note: When using Anti-Slop phrase banning, streaming outputs are slightly delayed - this is to allow space for the AI to backtrack a response if necessary. This delay is proportional to the length of the longest banned slop phrase.
Up to 48 phrase banning sequences can be used, they are not case sensitive."

On top of that, frontends often allow you to add custom stop strings. This means if it is trying to use an unconventional token or string as a stop, you can just tell whatever you're using that "Assistant" or whatever is actually to be treated as a stop token.

Lastly, Latitude actually say this on the main model's page:

__

"We thought they would like it, but since releasing it on AI Dungeon, players have reacted even more positively than we expected.

Because they loved it so much, we’ve decided to open-source the model so anyone can experience unforgivingly brutal AI adventures! Anyone can download the model to run locally.

Or if you want to easily try this model for free, you can do so at https://aidungeon.com.

We plan to continue improving and open-sourcing similar models, so please share any and all feedback on how we can improve model behavior. Below we share more details on how Wayfarer was created."

__

^ That last line is kind of the important part. Let's just work together to incentivize that decision, further contributing the pool of knowledge within the open source ML, which is critical.

I don't like bigger corpos, probably hate them more than the next person, but like... make sure you point your fingers at the right people. This is another step towards normalizing open source releases. If we bark because they trained it for their own service but still released it despite potential sub optimal performance within self-hosted workflows, we're just showing companies that they shouldn't release their work as open source, because they'll get backlash.

As a disclaimer, I am not affiliated with Latitude or AI Dungeon in away way other than having paid for a half a year of subscription to their service a couple years ago.

cheers

Dude, I know how models work. I've been toying with AI for a long long time.

Wayfarer uses ChatML, not Mistral-Tekken. It's derived from the base model of Mistral-Nemo, not the instruction tuned one, Latitude (thankfully) picked ChatML instead of the abomination invented by Mistral (I love mistral, but damn, they fucked up royally with their format). You technically can use whatever you want, even Alpaca, and if your chat history is long enough, the model will likely try to comply with your formatting because that's literally what language models do. But that's akin to intentionally dumbing down the model (and wasting many tokens in the process).

That was just one of my points, the last one being much more important. Only reason I suggested it is because I'm not having any of the problems mentioned, and those were the settings I used as I am lazy & I like nemo. ML is funny sometimes, just like wider development which you definitely understand as a game dev (After the collapse looks really cool!). I've seen weirder things! Hope that helps explain why I suggested that.

Latitude org

Alright, folks, just a little clarification here - I'm the person who trained this model, albeit under Latitude's name. I am not exactly new to training models, and this one is most definitely setup correctly to handle EOS tokens and will generally respond with suitable lengths. I did an extra verification of this last night by loading the 6_K GGUF onto my local environment, using llama-server.

Due to the "wild west living on the edge" style of software we all use on a daily basis some of you might be hitting some snags with Nemo's tokenizer, which, if I remember correctly, took a while to get fixed back in the days it had only just released. Sometimes these bugs show up due to recent code changes, but that's about the full extent of my knowledge since there's only so much time in a day.

@CyborgPaloma What Kolby stated about EOS tokens wasn't untrue, but it was somewhat outdated information; This is apparently how the older models used to work. In this case we're dealing with a simple ChatML-formatted model which will cut off its responses properly. Providing there's no bugs messing things up, of course.

Hope this helps.

I will say that I've used Wayfarer-12B-Q4_K_M.gguf using Koboldcpp in conjunction with silly tavern today out of curiosity after seeing a post about this model on reddit.

Like a nerd of course I looked here at the comments out of curiosity and wondered if I'd have the same issue. I can happily report I didn't seem to have Repetition or Stopping issues. This is even tested outside of ChatML as well.

I have 4 current active RP's that I tested it on: 1) Slice of life, 2) Medieval/Fantasy/Dungeon/Adventure (been using since august) 3) Spy Thriller 4) Sci-fi/Space Exploration/Spaceship playthrough.

I also tested it with a Zombie apocalypse RP (fresh RP using this from the start) and it all functioned well. If anything, I'm actually very pleased with the responses generated and shocked the q4_k_m was as good as it is.

I tested it in Koboldcpp browser chat interface briefly as well summarizing a very long playthrough. That seemed to work well too.

I will continue to use it as I find it a nice change of fresh air in regard to its use of vocabulary compared to other models and the way it describes scenes/responses. If for some reason I have any issues I'll report back. But so far I've really be intrigued by this model.

Thanks for training it and Latitude making it available. Also, thank you for being responsive to people's posts on here as well too to give us more insight.
Have a great rest of your day.

Edit: also, for those interested my Medieval/Fantasy/Dungeon/Adventure is somewhere around 70k context with 17 different character cards in a group chat. It's really easy for models to jank up. So that was very cool to see how well it managed to function.

Okay so I tested the model with a fresh RP session. New scenario, new characters, etc. The model was really good, and I really liked the whole death consequences thing I was reading about. I had a character get so mad they pistol whipped my character and put the barrel to my characters head just for insulting them. It was a tense scene where honestly, I thought the character was going to end my character.

Now to the bad. If you change context response length after a long session, it doesn't know when to stop. It breaks stop tokens somehow. Other models seem more likely to adapt to this especially if your responses reflect less text as well. So, I generally run a 512 context response length. I occasionally boost it up to 1024 or even 2048 towards the end of my RP session or during major events where my character takes a back seat to a situation (like being pistol whipped and knocked the hell out). Letting the AI character take the wheel for a bit while I steer through narration to some degree.

At this point changing context length in an already established RP, the model takes way longer to adapt to context length regardless of modifying responses and my responses being less wordy.

Then we can also talk about the regeneration of message replies. I didn't have a problem the majority of my session. But the longer it went on or the closer you got to your context length it gets wonky. You often get very similar replies, and it takes directions very poorly. You can be very specific in guiding the model at what type of response you are looking for and it will ignore half or more of what you request. It just begins to do its own thing, and each regeneration is a mild variation of the previous one if that.

If you've been using LLM's for RP for a while these are issues that are pretty common sometimes in long context RP or towards the end of the context limit and not completely unexpected.

That being said I wouldn't say it's unusable, I just think the learning curve in understanding how it behaves gives me more knowledge going forward to adjust for these issues and it won't be a problem.

The prose and manner of speaking in long context replies is a bit much sometimes. It will often ramble on in circles before reaching a conclusion or moving the scene forward. So, I wouldn't recommend anything too crazy like 1024/2048 responses unless you've steered it in a certain way. I'd also suggest prompting it in silly tavern or other ways to use a more colloquial language if it fits your needs for that issue.

I like its consistency of characters though. Often times I'll do something that drastically alters the characters personality. Then they go from being an asshole to being really nice and ignore their personality traits. That also happened to me using this model. But shockingly it would rebound and fall back into its negative character traits not long after. Like it was only a momentary change in personality adapting to the change in scene/dialogue. So that's pretty cool.

I'm also an English speaker but my RP's often have characters that are of different cultural backgrounds. It's really hard sometimes to get a character to accurately portray someone of another culture in a way that seems believable through text. I'm happy to say one of my characters is German. What was interesting is there was a 50% chance that their response would also be in German along with an English translation which was actually pretty amazing. Imagine having subtitles for a book. That really added something to my RP I didn't know I wanted or needed. Now every regeneration didn't follow this consistency, but it was as I said a 50/50 chance which is still pretty decent.

So overall I did see what some other users were talking about in regard to stop tokens and repetition, but I also think these are things that can be avoided depending on how you steer the AI. The positives in my opinion far outweigh the negatives (which can be avoided).

My overall opinion is that I really like this model for RP and will most definitely be using this going forward. I also appreciate the restraint it has from wanting to jump my bones at the mere possibility of sexual situations. It's like come on, I just met you 5 minutes ago and your already trying to have sex with me? please. This model is more restrained in that manner but not unwilling to explore those things if you want. I appreciate and enjoy the spontaneity of things that happen during the RP. I don't like orchestrating a world where I know the outcome of every situation. This model adheres to the scenario while throwing curveballs that make the story move forward in ways I did not expect. I love that.

I apologize for any issues with my response. I'm on my second cup of coffee after waking up but wanted to report back my experience and give my thanks for sharing this model, especially the GGUF variation.

Sign up or log in to comment