Spaces:
Running
Which models do you want to see on here?
We started with the following models as we've seen them most commonly used in eval pipelines
- OpenAI (GPT-4o, GPT-4 Turbo, GPT-3.5 Turbo)
- Anthropic (Claude 3.5 Sonnet / Haiku, Claude 3 Opus / Sonnet / Haiku)
- Meta (Llama 3.1 Instruct Turbo 405B / 70B / 8B)
- Alibaba (Qwen 2.5 Instruct Turbo 7B / 72B, Qwen 2 Instruct 72B)
- Google (Gemma 2 9B / 27B)
- Mistral (Instruct v0.3 7B, Instruct v0.1 7B)
What models would you be curious to see on here next?
What about these models:
- microsoft ( Phi-3-medium-4k-instruct 14B )
- Alibaba ( Qwen 2.5 32B, 14B ), they have EQbench scores closer to Qwen 2.5 72B than 7B
- Upstage ( solar-pro-preview-instruct 22B)
- Mistral ( Mistral-Large-Instruct-2407 123B )
(as reference for which models to choose) other than the some common benchmarks
here's one [benchmark] that is related to judging:
But how are the judging scores extracted?, - by number, words or something else? (see https://arxiv.org/abs/2305.14975)
Gemini models.
https://www.flow-ai.com/judge? I believe it fits the criteria and seems like an interesting smoller competitor based on their pitch in the release blo
https://www.flow-ai.com/judge? I believe it fits the criteria and seems like an interesting smoller competitor based on their pitch in the release blo
hey! great initiative :) Would love to see a small model like Flow-Judge-v0.1 here! Happy to support with the integration if needed.
What about these models:
- microsoft ( Phi-3-medium-4k-instruct 14B )
- Alibaba ( Qwen 2.5 32B, 14B )
- Upstage ( solar-pro-preview-instruct 22B)
- Mistral ( Mistral-Large-Instruct-2407 123B )
(as reference for which models to choose) other than the some common benchmarks
here's one [benchmark] that is related to judging:
But how are the judging scores extracted?, - by number, words or something else? (see https://arxiv.org/abs/2305.14975)
Good shouts! I'm curious to see how those Qwen models would do given that the 2.5 7B is doing pretty well. And those benchmarks are very interesting, evaluating writing quality is a seriously tough task...
The judge score and critique are extracted from a JSON output {"feedback": "", "result": } similar to the Lynx paper
https://www.flow-ai.com/judge? I believe it fits the criteria and seems like an interesting smoller competitor based on their pitch in the release blo
hey! great initiative :) Would love to see a small model like Flow-Judge-v0.1 here! Happy to support with the integration if needed.
👀 will add flow judge in our next update, I'm super excited to see how a dedicated 3.8B model does
Add Command-r and Command-r+, both old and new. They were the least positively biased in my experience.
What a great work! we are looking forward such an arena for Judge models!
How about add compassjudger series (https://github.com/open-compass/CompassJudger),
which reached top performance on
RewardBench(https://huggingface.co/spaces/allenai/reward-bench),
JudgerBench(https://huggingface.co/spaces/opencompass/judgerbench_leaderboard),
JudgeBench(https://huggingface.co/spaces/ScalerLab/JudgeBench) between generative models.
And also can be applied to many subjective evaluation datasets as judge model. For example in ArenaHard: https://github.com/lmarena/arena-hard-auto/issues/49
New models live on Judge Arena!
Prometheus-7b-v2, Command-R, Command-R+ models are now in the race🚀
We’ll get other specialised judge models on here soon.
++
@bittersweet
do you have an email address I can reach out to? I’ve tried to get in touch RE getting CompassJudger on here
New models live on Judge Arena!
Prometheus-7b-v2, Command-R, Command-R+ models are now in the race🚀
We’ll get other specialised judge models on here soon.
++
@bittersweet do you have an email address I can reach out to? I’ve tried to get in touch RE getting CompassJudger on here
Just use this: [email protected]
o1
Hermes3-405B (I think Lambda is still offering this for free)
Athene-v2
Deepseek-2.5
the OG GPT-4-0314
Grok-2