The performance difference between llava-hf/llava-1.5-7b-hf and liuhaotian/llava-v1.5-7b on MME benchmark.
#44
by
NaForAll
- opened
I find a performance difference between hf version and liu version. The results are pretty low when I test llava-1.5 hf with MME benchmark. While liu releases his llava 1.5, which performs over 1500 scores on MME in his paper https://arxiv.org/abs/2310.03744, I find llava 1.5 hf performs around 1000 scores on MME perception and cognitive tasks many times. This performance gap is pretty confusing.
My code and last record of performance are as follows:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_id = args.model_path
model = LlavaForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
trust_remote_code=True,
).to(device)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
mme_folder = "\MME\"
mme_type_dict = {
"Perception": ["existence", "count", "position", "color", "posters", "celebrity", "scene", "landmark", "artwork", "OCR"],
"Cognition": ["commonsense_reasoning", "numerical_calculation", "text_translation", "code_reasoning"]
}
for type, task_list in mme_type_dict.items():
for task in task_list:
answer_path = os.path.join(mme_folder, "gen_answers/{}.txt".format(task))
if not os.path.exists(os.path.join(mme_folder, "gen_answers".format(task))):
os.makedirs(os.path.join(mme_folder, "gen_answers".format(task)))
answer_file = open(answer_path, 'w')
gt_file = open(os.path.join(mme_folder, 'eval_tool/Your_Results/examples/{}.txt'.format(task)), 'r', encoding='utf-8')
gt_lines = gt_file.readlines()
for gt_line in tqdm(gt_lines, desc=task):
img_id = gt_line.split("\t")[0]
qs = gt_line.split("\t")[1]
gt_answer = gt_line.split("\t")[2]
raw_image = Image.open(os.path.join(mme_folder, "{}/{}".format(task, img_id)))
question = "<image> " + qs
inputs = processor(images=raw_image, text=question, return_tensors='pt', padding=True, truncation=True, max_length=512).to(device)
output = model.generate(
**inputs,
do_sample=False,
max_new_tokens=32,
use_cache=True,
)
gen_answer = processor.decode(output[0][inputs.input_ids.size(1):], skip_special_tokens=True).strip().replace("\n", "")
answer_file.write("{}\t{}\t{}\t{}".format(img_id, qs, gt_answer.replace("\n", ""), gen_answer) + "\n")
answer_file.close()
NaForAll
changed discussion status to
closed
NaForAll
changed discussion status to
open
NaForAll
changed discussion status to
closed
I encountered the same problem.