Qwen
/

How did you evaluate the Qwen chat models on MMLU (or any other datasets)

#49
by omers66 - opened

Hey, I'm curious to understand how the evaluation process took place for any of the chat models on question-answering data sets.
Say for example MMLU: I know that we usually pad the prompts with:
'The following are multiple choice questions (with answers) about ...
But then the chat model doesn't necessarily answer with 'A', 'B', 'C', or 'D', but it can start chatting about the questions, answers, etc..
How did you deal with that?

Hi, please see the following for an example:

omers66 changed discussion status to closed
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment