---
license: apache-2.0
base_model: google/vit-base-patch16-224-in21k
tags:
- generated_from_trainer
metrics:
- accuracy
model-index:
- name: vit-finetuned-food101
  results: []
datasets:
- ethz/food101
pipeline_tag: image-classification
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->


# Model Card: ViT Fine-tuned on Food-101

## Model Overview

This model is a fine-tuned version of [google/vit-base-patch16-224-in21k](https://huggingface.co/google/vit-base-patch16-224-in21k) on the Food-101 dataset. The Vision Transformer (ViT) architecture is leveraged for image classification tasks, particularly for recognizing and categorizing food items.

### Model Details
- **Model Type**: Vision Transformer (ViT)
- **Base Model**: [google/vit-base-patch16-224-in21k](https://huggingface.co/google/vit-base-patch16-224-in21k)
- **Fine-tuning Dataset**: Food-101
- **Number of Labels**: 101 (corresponding to different food categories)

## Performance

The model achieves the following results on the evaluation set:
- **Loss**: 1.6262
- **Accuracy**: 89.6%

## Intended Uses & Limitations

### Intended Use Cases
- **Image Classification**: This model is designed for classifying images into one of 101 food categories, making it suitable for applications like food recognition in diet tracking, restaurant menu analysis, or food-related search engines.

### Limitations
- **Dataset Bias**: The model's performance may degrade when applied to food images that are significantly different from those in the Food-101 dataset, such as non-Western cuisines or images captured in non-standard conditions.
- **Generalization**: While the model performs well on the Food-101 dataset, its ability to generalize to other food-related tasks or datasets is not guaranteed.
- **Input Size**: The model expects input images of size 224x224 pixels. Images of different sizes should be resized accordingly.

## Training and Evaluation Data

The model was fine-tuned on the Food-101 dataset, which consists of 101,000 images across 101 different food categories. Each category contains 1,000 images, with 750 used for training and 250 for testing. The dataset includes diverse food items but may be skewed towards certain cuisines or food types.

## Training Procedure

### Training Hyperparameters

The model was fine-tuned using the following hyperparameters:
- **Learning Rate**: 5e-05
- **Train Batch Size**: 16
- **Eval Batch Size**: 16
- **Seed**: 42
- **Gradient Accumulation Steps**: 4
- **Total Train Batch Size**: 64
- **Optimizer**: Adam with betas=(0.9, 0.999) and epsilon=1e-08
- **Learning Rate Scheduler**: Linear with a warmup ratio of 0.1
- **Number of Epochs**: 3

### Training Results

| Training Loss | Epoch | Step | Validation Loss | Accuracy |
|---------------|-------|------|-----------------|----------|
| 2.7649        | 0.992 | 62   | 2.5733          | 0.831    |
| 1.888         | 2.0   | 125  | 1.7770          | 0.883    |
| 1.6461        | 2.976 | 186  | 1.6262          | 0.896    |

### Framework Versions
- **Transformers**: 4.42.4
- **PyTorch**: 2.4.0+cu121
- **Datasets**: 2.21.0
- **Tokenizers**: 0.19.1

## Inference Example

To run inference using this model, you can load an image from the Food-101 dataset and classify it as follows:

```python
from datasets import load_dataset
from transformers import pipeline
from PIL import Image
import requests
from io import BytesIO

# Load a sample image from the internet
image_url = "https://example.com/path-to-your-image.jpg"  # Replace with your image URL
response = requests.get(image_url)
image = Image.open(BytesIO(response.content))

# Load the fine-tuned model for image classification
classifier = pipeline(
    "image-classification",
    model="ashaduzzaman/vit-finetuned-food101"
)

# Run inference
result = classifier(image)
print(result)

```

## Ethical Considerations

- **Bias**: The Food-101 dataset primarily consists of popular Western dishes, which may introduce bias in the model’s predictions for non-Western food items.
- **Privacy**: When using this model in applications, ensure that the images are sourced ethically and that privacy considerations are respected.

## Citation

If you use this model in your work, please cite it as:

```
@misc{vit_finetuned_food101,
  author = {Ashaduzzaman},
  title = {ViT Fine-tuned on Food-101},
  year = {2024},
  url = {https://huggingface.co/ashaduzzaman/vit-finetuned-food101},
}
```