Rethinking FID: Towards a Better Evaluation Metric for Image Generation Paper • 2401.09603 • Published Nov 30, 2023 • 16
LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models Paper • 2402.10524 • Published Feb 16 • 21
Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming Paper • 2402.14261 • Published Feb 22 • 10
RewardBench: Evaluating Reward Models for Language Modeling Paper • 2403.13787 • Published Mar 20 • 21