Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming Paper • 2501.18837 • Published 5 days ago • 7
Trading Inference-Time Compute for Adversarial Robustness Paper • 2501.18841 • Published 5 days ago • 3
DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning Paper • 2411.04983 • Published Nov 7, 2024 • 9
PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding Paper • 2501.16411 • Published 8 days ago • 17
People who frequently use ChatGPT for writing tasks are accurate and robust detectors of AI-generated text Paper • 2501.15654 • Published 9 days ago • 9
Virus: Harmful Fine-tuning Attack for Large Language Models Bypassing Guardrail Moderation Paper • 2501.17433 • Published 6 days ago • 7
Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate Paper • 2501.17703 • Published 6 days ago • 45
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training Paper • 2501.17161 • Published 7 days ago • 93
DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation Paper • 2501.16764 • Published 7 days ago • 20
Are Vision Language Models Texture or Shape Biased and Can We Steer Them? Paper • 2403.09193 • Published Mar 14, 2024 • 9