CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy Paper • 2410.13218 • Published Oct 17 • 4
Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale Paper • 2409.08264 • Published Sep 12 • 43
Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers Paper • 2409.04109 • Published Sep 6 • 43
Learning to Refuse: Towards Mitigating Privacy Risks in LLMs Paper • 2407.10058 • Published Jul 14 • 29
UNcommonsense Reasoning: Abductive Reasoning about Uncommon Situations Paper • 2311.08469 • Published Nov 14, 2023 • 10