The evaluator was a painful (and long) lesson in how poor LLM agents can be

#6
by hardknee - opened
  • Way way too strict in requiring answers match the reference verbatim. Incapable of recognising that code using different naming etc' is also correct answer. Many examples but,

#2 insists on creating a new visit_webpage tool when importing VisitWebPage is just as correct.
#5 rejects answer for referencing anthropic/claude-3.5-sonnet instead of anthropic/claude-3-sonnet.

  • Consistently inconsistent: previously accepted answers later rejected.
  • Reference files are incorrect: #3 only accepts an answer that is incorrect and inconsistent with smolagents documentation. No smolagents import named E2BSandbox. Rejected correct example given in both smolagents and E2B documentation.

Same problem. I don't understand what is E2BSandbox. #3 seems unsolvable.

Same here, the matching should allowed for moore flexibility.
There are errors given to the candidate like: "too strict requirement for tooling" in question # 4. I don't even understand...
E2BSandbox seems unsolvable by looking at the HF blog, smolagents source code which seems a bit hard to satisfy...

Sign up or log in to comment