How did you evaluate the Qwen chat models on MMLU (or any other datasets)
#49
by
omers66
- opened
Hey, I'm curious to understand how the evaluation process took place for any of the chat models on question-answering data sets.
Say for example MMLU: I know that we usually pad the prompts with:
'The following are multiple choice questions (with answers) about ...
But then the chat model doesn't necessarily answer with 'A', 'B', 'C', or 'D', but it can start chatting about the questions, answers, etc..
How did you deal with that?
Hi, please see the following for an example:
omers66
changed discussion status to
closed