The evaluator was a painful (and long) lesson in how poor LLM agents can be
#6
by
hardknee
- opened
- Way way too strict in requiring answers match the reference verbatim. Incapable of recognising that code using different naming etc' is also correct answer. Many examples but,
#2 insists on creating a new visit_webpage tool when importing VisitWebPage is just as correct.
#5 rejects answer for referencing anthropic/claude-3.5-sonnet instead of anthropic/claude-3-sonnet.
- Consistently inconsistent: previously accepted answers later rejected.
- Reference files are incorrect: #3 only accepts an answer that is incorrect and inconsistent with smolagents documentation. No smolagents import named E2BSandbox. Rejected correct example given in both smolagents and E2B documentation.
Same problem. I don't understand what is E2BSandbox. #3 seems unsolvable.
Same here, the matching should allowed for moore flexibility.
There are errors given to the candidate like: "too strict requirement for tooling" in question # 4. I don't even understand...
E2BSandbox seems unsolvable by looking at the HF blog, smolagents source code which seems a bit hard to satisfy...