How did you evaluate the Qwen chat models on MMLU (or any other datasets)

#49

by omers66 - opened Jan 2, 2024

Jan 2, 2024

Hey, I'm curious to understand how the evaluation process took place for any of the chat models on question-answering data sets.
Say for example MMLU: I know that we usually pad the prompts with:
'The following are multiple choice questions (with answers) about ...
But then the chat model doesn't necessarily answer with 'A', 'B', 'C', or 'D', but it can start chatting about the questions, answers, etc..
How did you deal with that?

jklj077

Qwen org Jan 17, 2024

Hi, please see the following for an example:

https://github.com/QwenLM/Qwen/blob/main/eval/evaluate_chat_mmlu.py

omers66 changed discussion status to closed Feb 4, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment