--- license: mit tags: - vision-transformer - ViT - classification - cifar10 - computer-vision - deep-learning - machine-learning --- # ViT-Classification-CIFAR10 ## Model Description This model is a Vision Transformer (ViT) architecture trained on the CIFAR-10 dataset for image classification. It is trained from scratch without pre-training on a larger dataset. **Metrics:** * Test accuracy: 82.04% * Test loss: 0.5560 ## Training Configuration **Hardware:** NVIDIA RTX 3090 **Training parameters:** * Epochs: 200 * Batch size: 2048 * Input size: 3x32x32 * Patch size: 4 * Sequence length: 8*8 * Embed size: 128 * Num of layers: 12 * Num of heads: 4 * Forward multiplier: 2 * Dropout: 0.1 * Optimizer: AdamW ## Intended Uses & Limitations This model is intended for practice purposes and exploration of ViT architectures on the CIFAR-10 dataset. It can be used for image classification tasks on similar datasets. **Limitations:** * This model is trained on a relatively small dataset (CIFAR-10) and might not generalize well to unseen data. * Training is done without fine-tuning, potentially limiting its performance compared to a fine-tuned model. * Training is performed on a single RTX 3090. ## Training Data The model is trained on the CIFAR-10 dataset, containing 60,000 32x32 color images in 10 classes. * Training set: 50,000 images * Test set: 10,000 images **Data Source:** [https://paperswithcode.com/dataset/cifar-10](https://paperswithcode.com/dataset/cifar-10) ## Documentation * GitHub Repository: [ViT-Classification-CIFAR10](https://github.com/nick8592/ViT-Classification-CIFAR10.git)