Evaluating In-Context Learning Ability
Hello,
First of all, I would like to thank you for this amazing project. I am evaluating Chameleon's in-context learning ability. However, I think I am missing something about the inference process. When I work with a zero-shot setting, the model outputs are normal. However, with a few-shot setting, the model's responses are awkward. It sometimes avoids answering and occasionally outputs irrelevant characters. I do not encounter this problem in the zero-shot setting. Below you can find the code that I used.
def load_model(self, args) -> None:
"""
Load the Chameleon model and processor.
Parameters:
- args: The arguments to load the model.
Returns:
None
"""
from transformers import ChameleonForConditionalGeneration, ChameleonProcessor, BitsAndBytesConfig
print('Loading Chameleon!!!')
self.model = ChameleonForConditionalGeneration.from_pretrained(
args.hf_path,
device_map="cuda:0",
torch_dtype=torch.bfloat16,
).to(args.device).eval()
self.processor = ChameleonProcessor.from_pretrained(args.hf_path)
self.generation_cfg = {
'do_sample': True,
'temperature': 0.7,
'top_p': 0.9,
'repetition_penalty': 1.2,
}
if args.is_zero_cot_active or args.is_few_cot_active:
self.generation_cfg['max_new_tokens'] = 512
else:
self.generation_cfg['max_new_tokens'] = 50
print('Chameleon loaded!!!')
def calculate_generated_text(self, prompt, vision_x):
"""
Calculate generated text given a prompt and vision data.
Parameters:
- prompt (str): The input prompt.
- vision_x (list[PIL Images]): List of PIL Images containing vision data.
Returns:
Tuple[str, str]: Tuple containing the raw and salt answer text.
"""
"""
Example Prompt:
In zero-shot: "<image> <Question> <Options> Answer: "
In few-shot: "<image> <Question> <Options> Answer: <Answer> <image> <Question> <Options> Answer: "
"""
if self.model is None or self.processor is None:
raise AttributeError('Model or processor is not initialized. Call load_model first!')
inputs = self.processor(prompt, images=vision_x, padding=True, return_tensors="pt").to(device=self.model.device, dtype=torch.bfloat16)
out = self.model.generate(**inputs, **self.generation_cfg)
generated_text = self.processor.decode(out[0], skip_special_tokens=True)
salt_prompt = prompt.replace("<image>", "")
salt_answer = generated_text[len(salt_prompt):]
return generated_text, salt_answer
Hi @mustafaa . I'm highly interested in trying this model but there are no clear instructions yet, so I tried your code. I'm wondering how you deal with prompt length? I got an ValueError when executing inputs=processor(...), and the error still exists after I set generation_cfg.max_length and generation_cfg.max_new_tokens=2048. My prompt is "<image> Briefly describe the image. ". Just the image would have length more than 1000 so I can't really reduce the input length.
ValueError: Input length of input_ids is 1029, but
max_length
is set to 20. This can lead to unexpected behavior. You should consider increasingmax_length
or, better yet, > settingmax_new_tokens
.