matlok
's Collections
Papers - Observability and Interpretability
updated
JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and
Attention
Paper
•
2310.00535
•
Published
•
2
Interpretability in the Wild: a Circuit for Indirect Object
Identification in GPT-2 small
Paper
•
2211.00593
•
Published
•
2
Rethinking Interpretability in the Era of Large Language Models
Paper
•
2402.01761
•
Published
•
22
Does Circuit Analysis Interpretability Scale? Evidence from Multiple
Choice Capabilities in Chinchilla
Paper
•
2307.09458
•
Published
•
10
Sparse Autoencoders Find Highly Interpretable Features in Language
Models
Paper
•
2309.08600
•
Published
•
13
Interpretability at Scale: Identifying Causal Mechanisms in Alpaca
Paper
•
2305.08809
•
Published
•
2
Natural Language Decomposition and Interpretation of Complex Utterances
Paper
•
2305.08677
•
Published
•
2
Information Flow Routes: Automatically Interpreting Language Models at
Scale
Paper
•
2403.00824
•
Published
•
3
Structural Similarities Between Language Models and Neural Response
Measurements
Paper
•
2306.01930
•
Published
•
2
The Impact of Depth and Width on Transformer Language Model
Generalization
Paper
•
2310.19956
•
Published
•
9
The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing
& Attribution in AI
Paper
•
2310.16787
•
Published
•
5
A Pretrainer's Guide to Training Data: Measuring the Effects of Data
Age, Domain Coverage, Quality, & Toxicity
Paper
•
2305.13169
•
Published
•
3
A Watermark for Large Language Models
Paper
•
2301.10226
•
Published
•
8
Universal and Transferable Adversarial Attacks on Aligned Language
Models
Paper
•
2307.15043
•
Published
•
2
Vision Transformers Need Registers
Paper
•
2309.16588
•
Published
•
78
Can Sensitive Information Be Deleted From LLMs? Objectives for Defending
Against Extraction Attacks
Paper
•
2309.17410
•
Published
•
4
On the Origin of LLMs: An Evolutionary Tree and Graph for 15,821 Large
Language Models
Paper
•
2307.09793
•
Published
•
46
Tools for Verifying Neural Models' Training Data
Paper
•
2307.00682
•
Published
•
2
Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models
Paper
•
2310.02949
•
Published
•
2
Chain-of-Thought Reasoning Without Prompting
Paper
•
2402.10200
•
Published
•
102
Building and Interpreting Deep Similarity Models
Paper
•
2003.05431
•
Published
•
2
Long-form factuality in large language models
Paper
•
2403.18802
•
Published
•
24
Unsolvable Problem Detection: Evaluating Trustworthiness of Vision
Language Models
Paper
•
2403.20331
•
Published
•
14
Locating and Editing Factual Associations in Mamba
Paper
•
2404.03646
•
Published
•
3
BERT Rediscovers the Classical NLP Pipeline
Paper
•
1905.05950
•
Published
•
2
Prompt-to-Prompt Image Editing with Cross Attention Control
Paper
•
2208.01626
•
Published
•
2
LVLM-Intrepret: An Interpretability Tool for Large Vision-Language
Models
Paper
•
2404.03118
•
Published
•
23