File size: 5,206 Bytes

---
license: apache-2.0
datasets:
- Hemg/cifake-real-and-ai-generated-synthetic-images
language:
- en
metrics:
- accuracy
library_name: transformers
tags:
- Diffusors
- GanDetectors
- Cifake
base_model:
- google/vit-base-patch16-224
inference: True
---
# AI Guard Vision Model Card

[![License: Apache 2.0](https://img.shields.io/badge/license-Apache--2.0-blue)](LICENSE)

## Overview

This model, **AI Guard Vision**, is a Vision Transformer (ViT)-based architecture designed for image classification tasks. Its primary objective is to accurately distinguish between real and AI-generated synthetic images. The model addresses the growing challenge of detecting manipulated or fake visual content to preserve trust and integrity in digital media.

## Model Summary

- **Model Type:** Vision Transformer (ViT) – `vit-base-patch16-224`
- **Objective:** Real vs. AI-generated image classification
- **License:** Apache 2.0
- **Fine-tuned From:** `google/vit-base-patch16-224`
- **Training Dataset:** [CIFake Dataset](https://www.kaggle.com/datasets/birdy654/cifake-real-and-ai-generated-synthetic-images)
- **Developer:** Aashish Kumar, IIIT Manipur

## Applications & Use Cases

- **Content Moderation:** Identifying AI-generated images across media platforms.
- **Digital Forensics:** Verifying the authenticity of visual content for investigative purposes.
- **Trust Preservation:** Helping maintain the integrity of digital ecosystems by combating misinformation spread through fake images.

## How to Use the Model

```python
from transformers import AutoImageProcessor, ViTForImageClassification
import torch
from PIL import Image
from pillow_heif import register_heif_opener, register_avif_opener

register_heif_opener()
register_avif_opener()

def get_prediction(img):
    image = Image.open(img).convert('RGB')
    image_processor = AutoImageProcessor.from_pretrained("AashishKumar/AIvisionGuard-v2")
    model = ViTForImageClassification.from_pretrained("AashishKumar/AIvisionGuard-v2")
    inputs = image_processor(image, return_tensors="pt")
    
    with torch.no_grad():
        logits = model(**inputs).logits
    
    top2_labels = logits.topk(2).indices.squeeze().tolist()
    top2_scores = logits.topk(2).values.squeeze().tolist()
    
    response = [{"label": model.config.id2label[label], "score": score} for label, score in zip(top2_labels, top2_scores)]
    return response
```

## Dataset Information

The model was fine-tuned on the **CIFake dataset**, which contains both real and AI-generated synthetic images:
- **Real Images:** Collected from the CIFAR-10 dataset.
- **Fake Images:** Generated using Stable Diffusion 1.4.
- **Training Data:** 100,000 images (50,000 per class).
- **Testing Data:** 20,000 images (10,000 per class).

## Model Architecture

- **Transformer Encoder Layers:** Utilizes self-attention mechanisms.
- **Positional Encodings:** Helps the model understand image structure.
- **Pretrained Weights:** Pretrained on ImageNet-21k and fine-tuned on ImageNet 2012 for enhanced performance.

### Why Vision Transformer?

- **Scalability and Performance:** Excels at high-level global feature extraction.
- **State-of-the-Art Accuracy:** Leverages transformers to outperform traditional CNN models.

## Training Details

- **Learning Rate:** 0.0000001
- **Batch Size:** 64
- **Epochs:** 100
- **Training Time:** 1 hr 36 min

## Evaluation Metrics

The model was evaluated using the CIFake test dataset, with the following metrics:

- **Accuracy:** 92%
- **F1 Score:** 0.89
- **Precision:** 0.85
- **Recall:** 0.88

| Model         | Accuracy | F1-Score | Precision | Recall |
|---------------|----------|----------|-----------|--------|
| Baseline      | 85%      | 0.82     | 0.78      | 0.80   |
| Augmented     | 88%      | 0.85     | 0.83      | 0.84   |
| Fine-tuned ViT| **92%**  | **0.89** | **0.85**  | **0.88**|

## Evaluation Fig:
![image/png](https://cdn-uploads.huggingface.co/production/uploads/640ed1fb06c3b5ca883d5ad5/vmiE8IhMLUwJIOLK-Q9dT.png)

## System Workflow

- **Frontend:** ReactJS
- **Backend:** Python Flask
- **Database:** PostgreSQL(Supabase)
- **Model:** Deployed via Pytorch and TensorFlow frameworks

## Strengths and Limitations

### Strengths:
- **High Accuracy:** Achieves state-of-the-art performance in distinguishing real and synthetic images.
- **Pretrained on ImageNet-21k:** Allows for efficient transfer learning and robust generalization.

### Limitations:
- **Synthetic Image Diversity:** The model may underperform on novel or unseen synthetic images that are significantly different from the training data.
- **Data Bias:** Like all machine learning models, its predictions may reflect biases present in the training data.

## Conclusion and Future Work

This model provides a highly effective tool for detecting AI-generated synthetic images and has promising applications in content moderation, digital forensics, and trust preservation. Future improvements may include:
- **Hybrid Architectures:** Combining transformers with convolutional layers for improved performance.
- **Multimodal Detection:** Incorporating additional modalities (e.g., metadata or contextual information) for more comprehensive detection.