SMMILE: An Expert-Driven Benchmark for Multimodal Medical In-Context Learning Paper • 2506.21355 • Published Jun 26 • 9
MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning Paper • 2506.22992 • Published Jun 28 • 12
Reverse Image Retrieval Cues Parametric Memory in Multimodal LLMs Paper • 2405.18740 • Published May 29, 2024
Almanac Copilot: Towards Autonomous Electronic Health Record Navigation Paper • 2405.07896 • Published Apr 30, 2024
AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments Paper • 2405.07960 • Published May 13, 2024 • 1
MIRIAD: Augmenting LLMs with millions of medical query-response pairs Paper • 2506.06091 • Published Jun 6 • 9
Med-PRM: Medical Reasoning Models with Stepwise, Guideline-verified Process Rewards Paper • 2506.11474 • Published Jun 13 • 18
reWordBench: Benchmarking and Improving the Robustness of Reward Models with Transformed Inputs Paper • 2503.11751 • Published Mar 14 • 16
Discover and Cure: Concept-aware Mitigation of Spurious Correlation Paper • 2305.00650 • Published May 1, 2023
STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases Paper • 2404.13207 • Published Apr 19, 2024
AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning Paper • 2406.11200 • Published Jun 17, 2024
Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task Paper • 1809.08887 • Published Sep 24, 2018 • 2
ScisummNet: A Large Annotated Corpus and Content-Impact Models for Scientific Paper Summarization with Citation Networks Paper • 1909.01716 • Published Sep 4, 2019
Beyond Positive Scaling: How Negation Impacts Scaling Trends of Language Models Paper • 2305.17311 • Published May 27, 2023 • 1
WILDS: A Benchmark of in-the-Wild Distribution Shifts Paper • 2012.07421 • Published Dec 14, 2020 • 1
LM-Critic: Language Models for Unsupervised Grammatical Error Correction Paper • 2109.06822 • Published Sep 14, 2021