Model Card for envisage
This is the official model card for envisage
, a Vision Transformer (ViT) model fine-tuned for image classification.
This model was fine-tuned from the google/vit-base-patch16-224-in21k
base model on the cifar10
dataset, which consists of 60,000 32x32 color images in 10 distinct classes.
Model Description
- Base Model:
google/vit-base-patch16-224-in21k
- Dataset:
cifar10
- Task: Image Classification
- Framework: PyTorch, Transformers
- Classes (10):
airplane
,automobile
,bird
,cat
,deer
,dog
,frog
,horse
,ship
,truck
How to Use
The easiest way to use this model for inference is with the pipeline
API from the transformers
library.
First, ensure you have the necessary libraries installed:
pip install transformers torch pillow
Then, you can use the following Python snippet to classify an image:
from transformers import pipeline
from PIL import Image
import requests
# Load the classification pipeline with your model
pipe = pipeline("image-classification", model="louijiec/envisage")
# Load an image from a URL (e.g., a cat)
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cat-tree.jpeg"
image = Image.open(requests.get(url, stream=True).raw)
# Get the predictions
predictions = pipe(image)
print("Predictions:")
for p in predictions:
print(f"- {p['label']}: {p['score']:.4f}")
# Expected output will show the model's confidence for each class,
# with 'cat' likely having the highest score.
Training Procedure
The model was trained in a Google Colab environment using the transformers
Trainer
API.
Hyperparameters
- Learning Rate: 5e-5
- Training Epochs: 3
- Batch Size: 16 per device
- Gradient Accumulation Steps: 4 (Effective batch size of 64)
- Optimizer: AdamW with a linear learning rate schedule
- Warmup Ratio: 0.1
Evaluation
The model was evaluated on the cifar10
test split, which contains 10,000 images.
- Final Accuracy on Test Set: [TODO: Add final accuracy from the
trainer.evaluate()
step here. For example: 0.965]
Intended Use & Limitations
This model is intended for educational purposes and as a demonstration of fine-tuning a Vision Transformer on a common benchmark dataset. It performs well on images similar to those in the cifar10
dataset (small, low-resolution images of the 10 specified classes).
Limitations:
- The model will likely perform poorly on images that are significantly different from the
cifar10
data (e.g., high-resolution photos, medical images, or classes not seen during training). - The training data may reflect biases present in the original
cifar10
dataset.
- Downloads last month
- 3
Model tree for louijiec/envisage
Base model
google/vit-base-patch16-224-in21k