Spaces:
Running
Running
Model details
#1
by
Gaeros
- opened
Hey, really appreciate this model! It's been tough finding a vision model that prioritizes style while filtering out content semantics.
Did you train it yourself? I'm planning to build my own embedding database using the weights, but I'd love to understand more about the training process. Could you share some details?
Glad you like it! It's a model I trained for adding style control for image generation. Incidentally I'm reworking this recently since I didn't properly log everything last time, but here's what I can recall and dig up:
- Dataset: danbooru-artists-10k artists train split
- Model: ViT-S/14 (weights initialized from DINOv2)
- Training objective: Supervised contrastive loss
- Training resolution: Varied (aspect ratio preserving resized to ~518x518 pixels)
- Augmentation: Random Resized Crop (scale=(0.7,1.0))
- Training duration: ~60k batches of 48 images (16 artists * 3 imgs/artist)
- Optimizer: Prodigy, lr=1
- Training Precision: bfloat16 mixed
- Training cost: <15 4060-Ti hours
- Validation MAP@R, ~518x518: 42.3%
- Evaluating at (518 * 1.3) ** 2 slightly improves metrics
- The settings were not tweaked much and the model is undertrained, the training loss is nowhere near convergence