llava-hf/llava-1.5-7b-hf · The performance difference between llava-hf/llava-1.5-7b-hf and liuhaotian/llava-v1.5-7b on MME benchmark.

Jan 8

•

I find a performance difference between hf version and liu version. The results are pretty low when I test llava-1.5 hf with MME benchmark. While liu releases his llava 1.5, which performs over 1500 scores on MME in his paper https://arxiv.org/abs/2310.03744, I find llava 1.5 hf performs around 1000 scores on MME perception and cognitive tasks many times. This performance gap is pretty confusing.

My code and last record of performance are as follows:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_id = args.model_path
model = LlavaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    trust_remote_code=True,
).to(device)

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

mme_folder = "\MME\"
mme_type_dict = {
"Perception": ["existence", "count", "position", "color", "posters", "celebrity", "scene", "landmark", "artwork", "OCR"],
"Cognition": ["commonsense_reasoning", "numerical_calculation", "text_translation", "code_reasoning"]
}

for type, task_list in mme_type_dict.items():
    for task in task_list:
        answer_path = os.path.join(mme_folder, "gen_answers/{}.txt".format(task))
        if not os.path.exists(os.path.join(mme_folder, "gen_answers".format(task))):
            os.makedirs(os.path.join(mme_folder, "gen_answers".format(task)))
        answer_file = open(answer_path, 'w')
        gt_file = open(os.path.join(mme_folder, 'eval_tool/Your_Results/examples/{}.txt'.format(task)), 'r', encoding='utf-8')
        gt_lines = gt_file.readlines()

        for gt_line in tqdm(gt_lines, desc=task):
            img_id = gt_line.split("\t")[0]
            qs = gt_line.split("\t")[1]
            gt_answer = gt_line.split("\t")[2]

            raw_image = Image.open(os.path.join(mme_folder, "{}/{}".format(task, img_id)))
            question = "<image> " + qs

            inputs = processor(images=raw_image, text=question, return_tensors='pt', padding=True, truncation=True, max_length=512).to(device)

            output = model.generate(
                **inputs,
                do_sample=False,
                max_new_tokens=32,
                use_cache=True,
                )

            gen_answer = processor.decode(output[0][inputs.input_ids.size(1):], skip_special_tokens=True).strip().replace("\n", "")
            
            answer_file.write("{}\t{}\t{}\t{}".format(img_id, qs, gt_answer.replace("\n", ""), gen_answer) + "\n")

        answer_file.close()

NaForAll changed discussion status to closed Jan 8

NaForAll changed discussion status to open Jan 8

NaForAll changed discussion status to closed Jan 8

lalahaha1

10 days ago

•

edited 10 days ago

I encountered the same problem.