Model Card for envisage

This is the official model card for envisage, a Vision Transformer (ViT) model fine-tuned for image classification.

This model was fine-tuned from the google/vit-base-patch16-224-in21k base model on the cifar10 dataset, which consists of 60,000 32x32 color images in 10 distinct classes.

Model Description

  • Base Model: google/vit-base-patch16-224-in21k
  • Dataset: cifar10
  • Task: Image Classification
  • Framework: PyTorch, Transformers
  • Classes (10): airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck

How to Use

The easiest way to use this model for inference is with the pipeline API from the transformers library.

First, ensure you have the necessary libraries installed:

pip install transformers torch pillow

Then, you can use the following Python snippet to classify an image:

from transformers import pipeline
from PIL import Image
import requests

# Load the classification pipeline with your model
pipe = pipeline("image-classification", model="louijiec/envisage")

# Load an image from a URL (e.g., a cat)
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cat-tree.jpeg"
image = Image.open(requests.get(url, stream=True).raw)

# Get the predictions
predictions = pipe(image)

print("Predictions:")
for p in predictions:
    print(f"- {p['label']}: {p['score']:.4f}")

# Expected output will show the model's confidence for each class,
# with 'cat' likely having the highest score.

Training Procedure

The model was trained in a Google Colab environment using the transformers Trainer API.

Hyperparameters

  • Learning Rate: 5e-5
  • Training Epochs: 3
  • Batch Size: 16 per device
  • Gradient Accumulation Steps: 4 (Effective batch size of 64)
  • Optimizer: AdamW with a linear learning rate schedule
  • Warmup Ratio: 0.1

Evaluation

The model was evaluated on the cifar10 test split, which contains 10,000 images.

  • Final Accuracy on Test Set: [TODO: Add final accuracy from the trainer.evaluate() step here. For example: 0.965]

Intended Use & Limitations

This model is intended for educational purposes and as a demonstration of fine-tuning a Vision Transformer on a common benchmark dataset. It performs well on images similar to those in the cifar10 dataset (small, low-resolution images of the 10 specified classes).

Limitations:

  • The model will likely perform poorly on images that are significantly different from the cifar10 data (e.g., high-resolution photos, medical images, or classes not seen during training).
  • The training data may reflect biases present in the original cifar10 dataset.
Downloads last month
3
Safetensors
Model size
85.8M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for louijiec/envisage

Finetuned
(2333)
this model