
Open Concept Steering
AI & ML interests
None defined yet.
Open Concept Steering
Open Concept Steering is an open-source library for discovering and manipulating interpretable features in large language models using Sparse Autoencoders (SAEs). Inspired by Anthropic's work on Scaling Monosemanticity and Golden Gate Claude, this project aims to make concept steering accessible to the broader research community.
Features
Coming soon!
- Universal Model Support: Train SAEs on any Hugging Face transformer model
- Feature Discovery: Find interpretable features representing specific concepts
- Concept Steering: Amplify or suppress discovered features to influence model behavior
- Interactive Chat: Chat with models while manipulating their internal features
Pre-trained Models
In the spirit of fully open-source models, we have started training SAEs on OLMo 2 7B.
We provide pre-trained SAEs and discovered features for popular models on Hugging Face:
Each model repository will include:
- Trained SAE weights
- Catalog of discovered interpretable features
- Example steering configurations
Datasets
The dataset from OLMo 2 7B's middle layer is here. It is about 600 million residual stream vectors.
More to come!
Quick Start
Examples
Check out the steered OLMo 7B model!
License
This project is licensed under the MIT License.
Citation
If you feel compelled to cite this library in your work, feel free to do so however you please.
Acknowledgments
This project builds upon the work described in Towards Monosemanticity: Decomposing Language Models With Dictionary Learning, Update on how we train SAEs, and Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet by Anthropic, and this project absolutely would not have been possible without it.
models
1
