Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
@@ -1,10 +1,59 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Open Concept Steering
|
2 |
+
|
3 |
+
Open Concept Steering is an open-source library for discovering and manipulating interpretable features in large language models using Sparse Autoencoders (SAEs). Inspired by Anthropic's work on [Scaling Monosemanticity](https://transformer-circuits.pub/2024/scaling-monosemanticity/) and [Golden Gate Claude](https://www.anthropic.com/news/golden-gate-claude), this project aims to make concept steering accessible to the broader research community.
|
4 |
+
|
5 |
+
## Features
|
6 |
+
|
7 |
+
- **Universal Model Support**: Train SAEs on any HuggingFace transformer model
|
8 |
+
- **Feature Discovery**: Find interpretable features representing specific concepts
|
9 |
+
- **Concept Steering**: Amplify or suppress discovered features to influence model behavior
|
10 |
+
- **Interactive Chat**: Chat with models while manipulating their internal features
|
11 |
+
|
12 |
+
## Pre-trained Models
|
13 |
+
|
14 |
+
We provide pre-trained SAEs and discovered features for popular models on HuggingFace:
|
15 |
+
|
16 |
+
Each model repository includes:
|
17 |
+
- Trained SAE weights
|
18 |
+
- Catalog of discovered interpretable features
|
19 |
+
- Example steering configurations
|
20 |
+
- Performance benchmarks
|
21 |
+
|
22 |
+
|
23 |
+
## Quick Start
|
24 |
+
|
25 |
+
```python
|
26 |
+
from open_concept_steering import ModelLoader, SAETrainer, FeatureScanner, ChatInterface
|
27 |
+
|
28 |
+
# Load a model and train SAE
|
29 |
+
model = ModelLoader.load("llama2")
|
30 |
+
sae = SAETrainer.train(model)
|
31 |
+
|
32 |
+
# Find features for a concept
|
33 |
+
scanner = FeatureScanner(sae)
|
34 |
+
feature = scanner.find_feature("golden gate bridge")
|
35 |
+
|
36 |
+
# Chat with the steered model
|
37 |
+
chat = ChatInterface(model)
|
38 |
+
chat.run(amplified_features=[feature])
|
39 |
+
```
|
40 |
+
|
41 |
+
## Examples
|
42 |
+
|
43 |
+
See the `examples/` directory for detailed notebooks demonstrating:
|
44 |
+
- Training SAEs on different models
|
45 |
+
- Finding and analyzing features
|
46 |
+
- Steering model behavior
|
47 |
+
- Interactive chat sessions
|
48 |
+
|
49 |
+
## License
|
50 |
+
|
51 |
+
This project is licensed under the MIT License.
|
52 |
+
|
53 |
+
## Citation
|
54 |
+
|
55 |
+
If you feel compelled to cite this library in your work, feel free to do so however you please.
|
56 |
+
|
57 |
+
## Acknowledgments
|
58 |
+
|
59 |
+
This project builds upon the work described in ["Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet"](https://transformer-circuits.pub/2024/scaling-monosemanticity/) by Anthropic.
|