--- language: en tags: - clip - vision - transformers - interpretability - sparse autoencoder - sae - mechanistic interpretability license: apache-2.0 library_name: torch pipeline_tag: feature-extraction metrics: - type: explained_variance value: 88.2 pretty_name: Explained Variance % range: min: 0 max: 100 - type: l0 value: 930.023 pretty_name: L0 --- # CLIP-B-32 Sparse Autoencoder x64 vanilla - L1:5e-05 data:image/s3,"s3://crabby-images/137ac/137ac62de7ae3cfda72259b88998368d78336ed5" alt="Explained Variance" data:image/s3,"s3://crabby-images/1d87c/1d87cfba66cdef3a46c152519d13542f89188bc4" alt="Sparsity" ### Training Details - Base Model: CLIP-ViT-B-32 (LAION DataComp.XL-s13B-b90K) - Layer: 5 - Component: hook_mlp_out ### Model Architecture - Input Dimension: 768 - SAE Dimension: 49,152 - Expansion Factor: x64 (vanilla architecture) - Activation Function: ReLU - Initialization: encoder_transpose_decoder - Context Size: 50 tokens ### Performance Metrics - L1 Coefficient: 5e-05 - L0 Sparsity: 930.0232 - Explained Variance: 0.8821 (88.21%) ### Training Configuration - Learning Rate: 0.0004 - LR Scheduler: Cosine Annealing with Warmup (200 steps) - Epochs: 10 - Gradient Clipping: 1.0 - Device: NVIDIA Quadro RTX 8000 **Experiment Tracking:** - Weights & Biases Run ID: pve5lo8t - Full experiment details: https://wandb.ai/perceptual-alignment/clip/runs/pve5lo8t/overview - Git Commit: e22dd02726b74a054a779a4805b96059d83244aa ## Citation ```bibtex @misc{2024josephsparseautoencoders, title={Sparse Autoencoders for CLIP-ViT-B-32}, author={Joseph, Sonia}, year={2024}, publisher={Prisma-Multimodal}, url={https://huggingface.co/Prisma-Multimodal}, note={Layer 5, hook_mlp_out, Run ID: pve5lo8t} }