Representation Engineering: A Top-Down Approach to AI Transparency Paper • 2310.01405 • Published Oct 2, 2023 • 5
Score-CAM: Score-Weighted Visual Explanations for Convolutional Neural Networks Paper • 1910.01279 • Published Oct 3, 2019
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal Paper • 2402.04249 • Published Feb 6 • 4
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning Paper • 2403.03218 • Published Mar 5 • 1
Out-of-Distribution Detection & Applications With Ablated Learned Temperature Energy Paper • 2401.12129 • Published Jan 22 • 1
Representation Learning in Continuous-Time Dynamic Signed Networks Paper • 2207.03408 • Published Jul 7, 2022
A Careful Examination of Large Language Model Performance on Grade School Arithmetic Paper • 2405.00332 • Published May 1 • 30
Planning In Natural Language Improves LLM Search For Code Generation Paper • 2409.03733 • Published Sep 5
Learning Goal-Conditioned Representations for Language Reward Models Paper • 2407.13887 • Published Jul 18
A Baseline Analysis of Reward Models' Ability To Accurately Analyze Foundation Models Under Distribution Shift Paper • 2311.14743 • Published Nov 21, 2023
Federated Reconnaissance: Efficient, Distributed, Class-Incremental Learning Paper • 2109.00150 • Published Sep 1, 2021
Pre-Training Multimodal Hallucination Detectors with Corrupted Grounding Data Paper • 2409.00238 • Published Aug 30