Feedback and questions: Model underfitting and poor format adaptation?
First of all, I just want to express that I'm very grateful of the much needed work you've been doing on instruction tuned LMs for Norwegian, and releasing it with a truly open-source license.
I'm making this post to give my feedback on the model as encouraged in the model description, along with a couple questions :)
I have some internal dataset/task that i wanted to try out norwegian instruction tuned models for, but when trying it out, it exhibited responses / behaviors that more closely resembled a foundation model that is not tuned for instructions. I had to double check that the model posted here was not the same norallm/normistral-7b-warm as they both, respond similarly to the example prompt in the model card.
My experience with the model so far is that it does generally perform better than the base model but the difference in my limited, manual testing was marginal.
example prompt
<|im_start|> user
Hva er hovedstaden i Norge?<|im_end|>
<|im_start|> assistant
Hovedstaden i Norge er Oslo. Denne byen ligger i den sørøstlige delen av landet, ved Oslofjorden. Oslo er en av de raskest voksende byene i Europa, og den er kjent for sin rike historie, kultur og moderne arkitektur. Noen populære turistattraksjoner i Oslo inkluderer Vigelandsparken, som viser mer enn 200 skulpturer laget av den berømte norske skulptøren Gustav Vigeland, og det kongelige slott, som er den offisielle residensen til Norges kongefamilie. Oslo er også hjemsted for mange museer, gallerier og teatre, samt mange restauranter og barer som tilbyr et bredt utvalg av kulinariske og kulturelle opplevelser.<|im_end|>
<|im_start|> user
Gi meg en liste over de beste stedene å besøke i hovedstaden<|im_end|>
<|im_start|> assistant
normistral-7b-warm-instruct
1. Vigelandsparken: Denne parken viser mer enn 200 skulpturer laget av den berømte norske skulptøren Gustav Vigeland. Parken er et flott sted å tilbringe en solrik dag, og den tilbyr en unik innsikt i Norges kunst- og kulturhistorie.
2. Det kongelige slott: Dette er den offisielle residensen til Norges kongefamilie. Slottet er et vakkert eksempel på norsk arkitektur og er åpent for publikum om sommeren. Du kan ta en guidet omvisning i slottet og lære om historien og den kongelige familien.
3. Holmenkollen: Dette er et av de mest populære turistmålene i Oslo. Det tilbyr fantastisk utsikt over byen og er hjemsted for det berømte Holmenkollbakken, som er vertskap for det årlige FIS Nordic World Ski Championships.
4. Frammuseet: Dette museet viser fram skipet som ble brukt av norske oppdagelsesreisende Fridtjof Nansen og Roald Amundsen. Museet tilbyr en fascinerende innsikt i Norges maritime historie og polarutforskning.
normistral-7b-warm
De beste stedene å besøke i hovedstaden er:
1. Vigelandsparken
2. Det kongelige slott
3. Munch-museet
4. Vikingskipshuset
Furthermore, the model does not seem very adapted to the prompt format used, which raises the question if the model has been trained with the same format for every dataset used. If the model was appropriately adapted to the format, I would expect it to complete parts of the format upon greedy decoding. For instance, if i sent <|im_start|> as input, I would expect either _user
or _assistant
, but with this model, i get (using greedy decoding):
<|im_start|>”,” sa han. ”Jeg har ikke sett noen av dem på en stund.”
”Jeg har ikke sett noen av dem på en stund,” sa jeg.
...
With _user being the 29247th most probable token, indicating that the model has not yet learned the prompt format.
Similarly, the model doesn't seem to generate <|im_end|> either unless it is given some "few-shot" chat turns as preliminary context. But then again, if it relies on that to use the prompt format correctly, it is not yet much different from a base/foundation model.
So I'm left with a couple questions:
- Is the model trained on the given prompt format?
- Are there any other formats it has been trained on?
- The model is a work in progress, but is the training complete or is this an early checkpoint of a model that is still currently undergoing training?
- The current model, how many instruction documents is it trained on, and has it been trained for multiple passes? If yes, has it been trained until convergence of some validation loss?
I do acknowledge that the model is a work in progress, I'm just not sure if this is a mid-training checkpoint or a first iteration of a complete model
Hi, thanks for the interesting questions!
I had to double check that the model posted here was not the same norallm/normistral-7b-warm as they both, respond similarly to the example prompt in the model card.
Well, it is actually the same model, the instruct model has just been briefly finetuned on instruction data, so it's expected that it "reasons" similarly. I'm just a bit surprised that the base model follows the instruction format at all :)
If the model was appropriately adapted to the format, I would expect it to complete parts of the format upon greedy decoding. For instance, if i sent <|im_start|> as input, I would expect either _user or _assistant.
The model is not trained to do that, it's only trained to give a response (because, well, that's what it's supposed to do). So for example, if the training sample looks like this:
<|im_start|> user
Hva er hovedstaden i Norge?<|im_end|>
<|im_start|> assistant
Hovedstaden i Norge er Oslo. Denne byen ligger i den sørøstlige delen av landet, ved Oslofjorden. Oslo er en av de raskest voksende byene i Europa, og den er kjent for sin rike historie, kultur og moderne arkitektur. Noen populære turistattraksjoner i Oslo inkluderer Vigelandsparken, som viser mer enn 200 skulpturer laget av den berømte norske skulptøren Gustav Vigeland, og det kongelige slott, som er den offisielle residensen til Norges kongefamilie. Oslo er også hjemsted for mange museer, gallerier og teatre, samt mange restauranter og barer som tilbyr et bredt utvalg av kulinariske og kulturelle opplevelser.<|im_end|>
The loss will only be calculated on these tokens: Hovedstaden i Norge er... opplevelser.<|im_end|>
. Thus, the model will have no idea what should follow after <|im_start|>
, for example.
Similarly, the model doesn't seem to generate <|im_end|> either
I haven't observed this behavior in hundreds of conversations with this model, the responses are definitely not always perfect, but they always end with <|im_end|>
. Please make sure you let the model generate enough tokens (it tends to be quite verbose) and that you don't use a repetition penalty.
Is the model trained on the given prompt format? Are there any other formats it has been trained on?
Yes and no.
The model is a work in progress, but is the training complete or is this an early checkpoint of a model that is still currently undergoing training?
This checkpoint is fully trained. The next versions are going to be trained in the same way but we will update the training data to hopefully mitigate some of the current issues.
The current model, how many instruction documents is it trained on, and has it been trained for multiple passes? If yes, has it been trained until convergence of some validation loss?
It's trained for 2 epochs on 60K conversations. No, training/validation loss is in our experience not a great indicator of convergence in this case, the goal is not to fit the data but to teach the model how to respond to user prompts. See for example https://arxiv.org/abs/2403.04652 for more details .
Piggybacking off this: you say
(...) and that you don't use a repetition penalty.
Why is that an issue? Currently testing but results so far look promising, our parameters used are
PARAMETER num_ctx 4096
PARAMETER temperature 0.3
PARAMETER repeat_penalty 1.0
PARAMETER stop <|im_end|>
Like I said we're still testing, but results so far look promising with this. It's even less verbose: even though the initial response goes on a bit (and echoes some parts of the system prompt we don't want it to spit out), the following responses tend to be single-sentence responses unless we prompt it for more information.
For reference, using the Q5_K_M quantized version of the model, and running it through ollama (although I don't expect that should matter much here).
That's a great question, I wish I had a good answer :) NorMistral seems to be more sensitive to repetition penalty than other LLMs and I don't really know what's causing it.
Repetition penalty lowers the probability of any token that has already appeared in the previous context (in the prompts as well as in the responses) -- mathematically it multiplies the temperature with the repetition penalty value for all such tokens. The bad thing is that it also penalizes tokens that should repeat -- like stopwords, punctuation or the special <|im_end|>
token, which can cause the outputs to be very random/chaotic.
NorMistral probably outputs more even distribution across tokens than other LLMs and that's why it suffers more from the drawbacks of repetition penalty (as you can see, it's a rather dumb hack to prevent repetitions) -- but I have no idea why would the output distribution be different.
Hmm, alright. I’ll keep testing and keep you updated, but like I said so far it’s acting quite well under the parameters I’ve set. It is quite brief in its answers though, which may be due to the repetition penalty. I’ll try tuning this a bit, since the original non-penalized version tended to go on for too long.
Just to be clear, setting repetition penalty to 1.0 is equivalent to turning the repetition penalty off (as in the parameters you sent). Only a higher value than 1.0 starts penalizing repeated tokens.
Do you use a special system prompt to get short replies? The responses I get tend to be a bit too long and verbose :)
Oh, god. You’re right, my mistake😅
Yeah we have a few paragraphs of instructions, it’s a RAG application that should be fact-finding and not make things up. I think that might motivate it to keep it short whenever it “ventures off” outside the knowledge base. Might try to tweak that around a bit to see what we get, but it’s been quite to the point without being restrictive. Just a little no-nonsense, but that’s what we want I guess.