HuggingFaceM4/idefics-80b-instruct · Demo example produces bad results

Thanks for making this effort. It is great and will push the limits of Open Source models further.
However, I'm a bit surprised that the provided example with Idefix (the dog) does not lead to good results in my tests. Usually the examples the creator of a model gives are cherry-picked and work better than the model on average. However, when I ask the 9B model about the image it is hallucinating.
"In this picture from Asterix and Obelix, we can see the dog, Rufus, running. Rufus is a dog who appears in the Asterix comics. He is the pet of the Roman legionary, Dogmatix."
If I ask the 80B model it is correct, but rough: "In this picture from Asterix and Obelix, we can see a dog running."
Since the context in the prompt already hints at Asterix and Obelix, I thought it would be no problem for the model to identify the dog as Idefix.
In further tests with other images the trend is continued. The Cesar image you also provide as an example is not recognized by the 9B model (Hercules), but the 80B model answers correctly.
Given a photo of Brad Pitt, the 80B models says correctly that it is Brad Pitt if the prompt says "The name of the man on this picture is", but halluzinates if you exchange man with woman, which I did by accident because I tested a photo of Julia Roberts before. Any thoughts on that?

Hi,
Thanks for your remarks. Indeed, the model still hallucinates occasionally, and the 9B more so than the 80B. You should get less of those with the greedy mode, but there are still some cases of hallucination.
Regarding the Idefix example, using the one provided in greedy should yield this: "Yes, the characters in the image are Asterix, Obelix, and Dogmatix. Their French names are Astérix, Obélix, and Idéfix." with the 80B version.
The chosen examples were picked using the 80B version and not the 9B one, so it is not guaranteed that the 9B performs well on all of them.
Hallucinations of facts are a known problem for LLMs, so it is not surprising to see some in multimodal models as well. You can see that in the SEED leaderboard (https://huggingface.co/spaces/AILab-CVC/SEED-Bench_Leaderboard), the "instance" tasks which would be related to hallucinations, are far from being solved.
Hallucinations are due to multiple factors including the size of images we use (only 224x224 here), the quality of the vision encoder, and the quality of the finetuning. Unfortunately, the image-text datasets used for finetuning are still early compared to the text-only ones.
Hopefully future multimodal models will keep improving on those, but Idefics 80B should be quite accurate in general!