Post Hoc Explanations of Language Models Can Improve Language Models Paper • 2305.11426 • Published May 19, 2023 • 1
BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation Paper • 2101.11718 • Published Jan 27, 2021
Black-Box Access is Insufficient for Rigorous AI Audits Paper • 2401.14446 • Published Jan 25, 2024 • 3
Mitigating Gender Bias in Distilled Language Models via Counterfactual Role Reversal Paper • 2203.12574 • Published Mar 23, 2022
TalkToModel: Explaining Machine Learning Models with Interactive Natural Language Conversations Paper • 2207.04154 • Published Jul 8, 2022
Towards Bridging the Gaps between the Right to Explanation and the Right to be Forgotten Paper • 2302.04288 • Published Feb 8, 2023
Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs) Paper • 2407.14937 • Published Jul 20, 2024 • 1
Measuring Fairness of Text Classifiers via Prediction Sensitivity Paper • 2203.08670 • Published Mar 16, 2022
Insights from the Inverse: Reconstructing LLM Training Goals Through Inverse RL Paper • 2410.12491 • Published Oct 16, 2024 • 4