Change AI response limitations.

#3
by Arnon2 - opened

Hello @dvruette . I wasn't sure how to get your attention, so I've chosen this method. First of all, I would like to say that these trained "concepts" are indeed very successful. As an amateur writer with a lot of ideas but not quite an extensive vocabulary, I find this space perfect for adding colour to my initial drafts. However, there seems to be one detail that is probably unnecessary. An AI response is limited to 512 "tokens", but I have noticed that AI actually does the calculations before it starts writing, so it just stops in the middle of a sentence when it has apparently already calculated the rest of the sentence. So this limitation is unnecessary, as AI cannot write to infinity even without it, as it is already limited by 60 seconds of "thinking" time. I think it would be better to remove this line or change it to a larger value. Of course, I could be wrong, but if I am, I'd rather know than be completely ignored. Thanks in advance, and best of luck with other projects. Oh, and it seems the length penalty does nothing, but that could just be my selective testing.

Hi Arnon, thanks for opening this PR! I agree that 512 tokens can be quite limiting, I originally added it because for very large guidance strengths the model can sometimes just go off the rails and ramble on forever. I’ll have to double check whether there is a way to manually interrupt it if that happens. If there is, I see no problem in increasing the limit to 1024 or 2048, although I would keep some limit just to be safe.

Either way, one workaround for this is to just tell the model to continue. In my experience, this works decently well, although granted it’s not the same as letting the model finish it’s “thought”.

Thanks for your response! Right, I have noticed that the AI tends to ramble with too large guidance. I think it has to do with the llama database itself - from what I read about it, it cannot handle more than 3000 tokens at once, or something like that. Since I assume that concepts add tokens to the overall AI prompt, they naturally overflow it faster. Even without guidance, the AI will start repeating the same word or even just pulling a string of zeros after about 10 responses or so (maybe there is some merit to limiting how far back the AI traces, or just some other restriction to prevent overflow).
I myself use "5" as a guidance scale, which I think is perfect for adding said colour to text without changing it altogether, which is what I'm after. Of course, it's just what suits me. Oh, and I had some thoughts on Fabric before, maybe you could see that too. (I'm less certain about that one, as I found myself completely lost at how to properly get an image I want. It's much easier here, having expirience with text witing and all, but I just don't have proper perception of images, I guess)

The prompt is actually independent of guidance, it's the model activations that are guided towards/away from any particular concept.

I also just tested interrupting responses: While the frontend allows stopping the current request, the backend still keeps processing it until the model stops or the token limit is reached, so unfortunately I think it's best to keep the limit at 512 (it already takes ~1 min to generate that many tokens) and to use the workaround of telling the model to "continue" if it gets reaches the token limit mid-sentence.

In the future, I might look into having a more sophisticated interruption mechanism, but for now, this is the best we can do, sorry :/

Oh, I thought it was the other way around (AI generating entire bulk of text and then being abrupted by limit, rather then generating text on the go, until limit is reached). I see, it is indeed then best to limit it this way (I guess I can just use smaller text batches, it's kinda cumbersome, but managable. By the way, will it take too long to generate text if it's set on a "free" engine? Zero engine kinda annoys me, as it subtracts seconds if you pull request back (and also in cases of internet disconnections. I'm on mobile, and connection is not very stable here, at times) I can't test it, since Llama's owners think that their database is too precious to just give to random people.

Yes, I agree that the Zero engine is quite cumbersome, but I'm grateful that we get to run this on a GPU for free at all! If you're going to run this on a free engine, I think that generation is going to be exceedingly slow if it doesn't crash altogether due to memory requirements, so it's unfortunately not an option :/

I was afraid, you'd say that. Well, if it's just takes long like 10-20 minutes, I myself would be okay with that, as that is how much I have to wait when my time expires due to Internet spikes (although now that I think about it, if internet lags, when it generates like this, I'd be even more upset). Anyway, thanks for your considereation.

Oh yeah, I also had that question. What exactly does "Length Penalty" do? I tried same text with 0 and 2 values and there was no differense?...

Yes it seems you're right about the fact that it doesn't actually do anything! It sounded like it would help prevent the model from going off the rails, but it is actually only used when doing beam search, not when sampling.

Also, if you don't have access to Llama, maybe you can clone the space and switch the model to Mistral. It's not the same model, but AFAIK Mistral is openly available.

Ah, I'm sorry to bother, but I'm kinda new at this. Mistral's gave their ok, but space still asks for a token. How I do this properly?

Ah yes, you're right. You'll have to add it under Settings > Variables and secrets > New secret and enter the name HF_TOKEN and paste your token into the value field.

To create a token, you'll have to go to your account settings > Access tokens > Create new token > select "read" permissions and give it a name.

Hope that helps!

Thank you very much, I'll try. Also, Fabric works good (at least from my first try), but now like/dislike images... tab(?) looks very weird. I holds one image per line, and they're gigantic. Also now I'm curious what would happen if normal max.feedback weight were to be increased, but maybe result would be too weird.

Really sorry to bother, it ended up throwing Runtime error. it says: Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 40 column 3. It I do tokens wrong, or it just crushed under it's own weight?

Hmm sorry, I've never seen that error before.. Could be a version mismatch compared to what I'm running here, or could be a Mistral specific issue.

Hello @dvruette . I really don't want to seem nagging, but I'm not sure if you're seeing my messages or not, and I don't know how else to contact you. Fabric space has crashed with a runtime error and I would really like you to look into this. Thanks in advance.

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment