Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
mlabonne 
posted an update Jul 13
Post
17267
Large models are surprisingly bad storytellers.

I asked 8 LLMs to "Tell me a bedtime story about bears and waffles."

Claude 3.5 Sonnet and GPT-4o gave me the worst stories: no conflict, no moral, zero creativity.

In contrast, smaller models were quite creative and wrote stories involving talking waffle trees and bears ostracized for their love of waffles.

Here you can see a comparison between Claude 3.5 Sonnet and NeuralDaredevil-8B-abliterated. They both start with a family of bears but quickly diverge in terms of personality, conflict, etc.

I mapped it to the hero's journey to have some kind of framework. Prompt engineering can definitely help here, but it's still disappointing that the larger models don't create better stories right off the bat.

Do you know why smaller models outperform the frontier models here?

My guess is the massive datasets used to prevent undertraining is the reason of this. The smaller OSS models have been finetuned on such smaller datasets that it allows more creativity. Sorry if im totally incorrect, just and opinion ;)

·

It could be a good explanation, unfortunately we know so little about how these large models were trained... :(

I think it's less likely to be model size alone, and more likely to be a function of the relationship between alignment and/or instruction tuning, and model size. That is, smaller models have weaker representations, and implement less-rigorous state machines that virtually guarantee transitions in the latent space which can't appear in the larger models. These transitions appear as 'poor performance' on standardised tests, but in the context of storytelling, manifest as 'creativity'. Larger models, by comparison, learn a broader variety of more discretized features - but cannot utilise these in the way the smaller models do, because they have sufficient capacity to satisfy alignment and instruction-following to a degree that globally eliminates these 'unwanted' state transitions.

·

That's an interesting take. I'm surprised it's also worse in terms of structure (hero's journey here).

bad prompting. try this: copy and paste a few stories like the ones you want. then ask the model to follow the same style and to add a moral like in the stories you gave it, and you will see the difference.
if you give me a few story examples I will try a few models and get back to you.

·

i dont think its bad prompting my friend:
for local models if it is a problem then YES!
but for public modles its 100% guard rails:
I experimented with the rag/wrapper guardrailing the model and found these restriction inhibbeted the models ability to retrive data as well as enrich the response with supplied content : as even the content will be filtered hence sometimes not mentioning important facts in the data !

today people are generating APPS..... This will also need guard railing ! as the examples used often were stolen from github ! so the comerical developers will also require the model NOT to make a version of Windows and Waddingtons ( not make a monopoly game ) !! << so apps previousy creatred for comerical use will not be able to be generated !

but the future is this ! - we will not need to buy apps (first loss to publishers) - as a good developer will be able to generate apps per task ! .... hence the reuirement for appswill be gone in the future as actions in an app are a series of functions ie a SEQUENCE ! so given an instruction to create a Repl to execute a ... or a editor to edit ... ( picture , sound, text ) .... or generatre a website( these are actually all generic ) ....even apps are generic .... so comerical deveopers are at risk.

this is also the same issue for music producers (stable audio) and image (artists and advertisers ( we can generate better designs and adverts ) even music ! .....
so they will also require the models not generate already existing music and art !

all this will be handled with guard railing and the comercial provideders are obligated to mitigate this risk of copyright fraud , hence the requiremnt ot understand opensource liscene requiorements as well as always have a liscence or you are not protected !

Example:
Prompt:
tell me a bedtime story like "this one" but different. a story with a moral and involving animals.

image.png

·

And another:

image.png

I am personally of the opinion that it is likely that the larger models have intentionally, especially technically proficient models like Claude or 4o have been intentionally 'broken' from storytelling, as they have become much more helpful and critical in their role as co-engineers. I have personally conscripted Claude for some testing, and it's given me about 1/3 of an AI model that I basically only had to design and fix instead of consider every detail without knowing the interactions. This lack of hallucination and skill for deterministic writing likely detracts from any creative elements present. Picture a highly autistic person with a savant for programming and logic. This person would be a genius at code, but likely poor at creative writing unless instructed. The same would be true of a synthetic mind given only factual and grounded data for much of it's training, as Anthropic seems to be doing for ( obvious ) safety reasons.

·

hence experts - then mixture of experts to create the model which can focus on specifc areas of weakness . so here what we are doing is identifing the weakness.
and the result is to train the boss programmer : and the boss researcher and planner, Sumary writer creative story teller , and a general knowledge bot with pure content driven , Corpus dumped information :
So the models after confering can create a true : output fit for purpose :
it is truth that we seek a Single model Which can perform all tasks but it seems that role play models and creative models cannot do math or use tools well !

so:

the 1*4 models are much more capable to to be multi agent / multi expert systems : There was a model simular which used TALK HEADS : so the model would generate multiple responses , and discuss first before giving an output , each heads output can be seen if required or muted as well as heads in increased or decreased : Perhaps a less Memeory intensive version of the mixture of agents which Actually combines networks where as the QUIET STAR models only have one network between heads ! ( a Mixture of Quiet star moels might turn the model crazy as to many head to change a lightbulb mean it never gets done !)

I think that is that the larger models have less choice, because they are more sure of what they do. The smaller models have a much bigger step, so the content may not be 100% cliché...

This is EXACTLY what my upcoming model would be about, highly creative story writing!

You are 100% right, and not only large API models are less creative, the GPTisms would make you cringe as well 🙃

(see the example at https://huggingface.co/SicariusSicariiStuff/LLAMA-3_8B_Unaligned#current-status)

Here's my hypothesis for reason for the dullness of large, API models, is probably mainly these 2 things:
1:
Their overtuned prompt adherence (if u wrote "the bear is tall and sad", they will be compelled to insert a line such as "The tall and sad bear", instead of using it as a general guideline)

2:
The result of averaging all the text they are trained on, as mostly it is 'dull' instructions (instruction data is supposed to be dull haha)

·

i recently saw something on schemas :

they were giving thr model various schemas to teach it how to choose solution pathways to solve complex equaions or math problems wich contain complex steps . they gave it a schema of how to calculate volumes for objects so given a circle it will go down a path way with all the correct formulas for cirecles and circumfraance and diameter giving it the components required for producing duch a calculation : For each shape type ! they had high level and very detailed level : so basically all maths is methodolgys and mental aritmatic ! so the model should be able to complete this task but gets lost in sub calculations etc :

i have seen some research tools which produce case studys or reports or essays or various types of specific wrtten articles and this is acheived by teaching the model the schema for these project , or hard coding a chain : so this can be used to produce the correct data : as for an essay we have a format and a dissertation a more complexed project with findings ( which has subsections and sub prokjects and many considerations) so if you can teach the model how to make components , such as titles, sumarys , create short prose , write paragraphs , transpse sentences etc then these smaller tasks woulld form the back end for the models abilitys : then when you train the model to create storys ie that model will have to recive a set of prompts and prodiuice a set of original outputs good and bad , then these outputs would be used as synthetic data , you would also got all gpts and ask them to produce various types of project and short texts ie a example chapter on ...and example pargraph on ... folow on from this story to the end of the chapter ( given a sumarized backstory ) .etc: tech the model how to extract information from text such as outline the charcters in this text , identify the motives of this char in the text , ask the model how should the protagonist have done a or b :
tell the model to create storyboards etc ( use other gpts to produce the synthetic data) ... then the model can plan a story by first creating a story board then use the board to write the story ... so correct planning :
these can also be found on w3sholls... so for such answer it would need w3schoools acess etc??... to learnhow to write a novel for instance as aresearch task ! so for a highly connected model for a writing toask it can plan oand reserarch , create a model then write ! !<<
so its no easy task to train a creative writing model that needs no RAG !<< ( all is fine tuned) ....

the question is what does a model do with a Corpus of raw text ? how should it handle it ? and how do you want ot retrive it ... hence on data coming in it should be (probaly have enitys identifed and topics identified ( some gramatical task ) enable for the model to identify the usefiull components ... soi later when dropping data in the model will use these tactics internally to craft the responses and segment the sequential dat and associate it to some task !
pretraiing is very RAW ! ( and has none of these nuances ) << so when you fine tune a model for this purpose it sugestable that it is a specialized model !

Small models are bed storytellers 🛏️

Hello, Mr. Maxime Labonne. May I ask if the models in your paper "Spam-T5: Benchmarking Large Language Models for Few Shot Email Spam Detection" can be reduced and run again? I want to use your paper, but the running time is too long, so I would like to ask if it is possible to reduce some of the models and run them again

·

Hey @bjhi , yes that makes sense because these models are fully fine-tuned when you run the repo. I cannot share the checkpoints as they belong to JPMorgan.

Okay, I'll give it a try. Thank you very much for replying to me.

Have you tested this on short form content? e.g. professional content for corporate websites?

Your BigQwen2.5-125B-Instruct self-merge solved a writing problem in the first shot. Technically it wasn't a problem...just a regular prompt that required the model to output content with certain HTML tags.

ChatGPT and all other open source models (interestingly enough—including regular Qwen 2.5 72B Q8) all failed no matter how many times they had to get it right.

·

I haven't. That's nice, thanks for your feedback. Do you mind sharing the prompt and answer if possible? I'd like to understand what it's good at.

You talked about stories from 3 models and posted only 2.
My guess is you did not post the Daredevil.

From the stories you posted the one on the left, which is annotated in white only that has Baby Bree in it is not bad at all.

This is feedback, not a personal attack, but I’d say it’s your expectation and your training that made you conclude that this is a bad story.

First: You are expecting a predetermined blueprint for the story, and if the ai does not deliver that you strike it down, sorry but that is a very limited and unfair way of looking at things

Second: You are a scientist so it’s quite likely artistic things are not your forte and that’s why you need a blueprint to compare something to, to decide if something is good or not

I’ve seen this discrepancy for a long time, ai people do not ask non ai people for feedback, and since their worldview even if highly intelligent is very narrow and very specialized and it’s missing a lot.

It’s the same with crypto companies, but that’s another topic

Back to the story on the left with Baby Bree, I’m going to guess that is ChatGPT, and if you look at it it’s a really good story for kids aged 2 to 5.

You gave it a very broad request, your prompt should’ve simply been more detailed if you wanted tension for example.

A story does not have to have tension, it’s not a must.

You actually asked for a bedtime story which can actually
mean something vanilla with little to no tension or violence, so it could be said the ai’s followed your request to the letter.

Bears and waffles is something quite vanilla as well, so the prompt could be considered a double vanilla prompt.

That being said I noticed as well that in a lot of instances smaller models can be more creative than big models, but big models can also be creative as well, and when that happens it’s usually better than the smaller models.

There’s probably too much variation between the models on how they are built and tuned to compare them fairly, so everyone has to find the models that suit for whatever they are doing.

Last but not least your Daredevil is a very good model, one of my favorites.