hbfreed commited on
Commit
2381570
·
verified ·
1 Parent(s): 162815e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +59 -10
README.md CHANGED
@@ -1,10 +1,59 @@
1
- ---
2
- title: README
3
- emoji: 🏢
4
- colorFrom: indigo
5
- colorTo: indigo
6
- sdk: static
7
- pinned: false
8
- ---
9
-
10
- Edit this `README.md` markdown file to author your organization card.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Open Concept Steering
2
+
3
+ Open Concept Steering is an open-source library for discovering and manipulating interpretable features in large language models using Sparse Autoencoders (SAEs). Inspired by Anthropic's work on [Scaling Monosemanticity](https://transformer-circuits.pub/2024/scaling-monosemanticity/) and [Golden Gate Claude](https://www.anthropic.com/news/golden-gate-claude), this project aims to make concept steering accessible to the broader research community.
4
+
5
+ ## Features
6
+
7
+ - **Universal Model Support**: Train SAEs on any HuggingFace transformer model
8
+ - **Feature Discovery**: Find interpretable features representing specific concepts
9
+ - **Concept Steering**: Amplify or suppress discovered features to influence model behavior
10
+ - **Interactive Chat**: Chat with models while manipulating their internal features
11
+
12
+ ## Pre-trained Models
13
+
14
+ We provide pre-trained SAEs and discovered features for popular models on HuggingFace:
15
+
16
+ Each model repository includes:
17
+ - Trained SAE weights
18
+ - Catalog of discovered interpretable features
19
+ - Example steering configurations
20
+ - Performance benchmarks
21
+
22
+
23
+ ## Quick Start
24
+
25
+ ```python
26
+ from open_concept_steering import ModelLoader, SAETrainer, FeatureScanner, ChatInterface
27
+
28
+ # Load a model and train SAE
29
+ model = ModelLoader.load("llama2")
30
+ sae = SAETrainer.train(model)
31
+
32
+ # Find features for a concept
33
+ scanner = FeatureScanner(sae)
34
+ feature = scanner.find_feature("golden gate bridge")
35
+
36
+ # Chat with the steered model
37
+ chat = ChatInterface(model)
38
+ chat.run(amplified_features=[feature])
39
+ ```
40
+
41
+ ## Examples
42
+
43
+ See the `examples/` directory for detailed notebooks demonstrating:
44
+ - Training SAEs on different models
45
+ - Finding and analyzing features
46
+ - Steering model behavior
47
+ - Interactive chat sessions
48
+
49
+ ## License
50
+
51
+ This project is licensed under the MIT License.
52
+
53
+ ## Citation
54
+
55
+ If you feel compelled to cite this library in your work, feel free to do so however you please.
56
+
57
+ ## Acknowledgments
58
+
59
+ This project builds upon the work described in ["Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet"](https://transformer-circuits.pub/2024/scaling-monosemanticity/) by Anthropic.