File size: 5,206 Bytes
6569276
 
 
 
 
 
 
 
 
 
 
 
 
12fe065
 
fc1de3e
6569276
12fe065
6569276
12fe065
6569276
12fe065
6569276
12fe065
6569276
12fe065
6569276
12fe065
 
 
 
 
 
6569276
12fe065
6569276
12fe065
 
 
6569276
12fe065
6569276
12fe065
 
 
 
 
6569276
12fe065
 
6569276
12fe065
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6569276
12fe065
6569276
12fe065
 
 
 
 
6569276
12fe065
6569276
12fe065
 
 
6569276
12fe065
6569276
12fe065
 
6569276
 
 
12fe065
6569276
 
 
 
12fe065
6569276
12fe065
6569276
12fe065
 
 
 
6569276
12fe065
 
 
 
 
6569276
959f331
 
 
12fe065
6569276
12fe065
 
959f331
12fe065
6569276
12fe065
6569276
12fe065
 
 
6569276
12fe065
 
 
6569276
12fe065
6569276
12fe065
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
license: apache-2.0
datasets:
- Hemg/cifake-real-and-ai-generated-synthetic-images
language:
- en
metrics:
- accuracy
library_name: transformers
tags:
- Diffusors
- GanDetectors
- Cifake
base_model:
- google/vit-base-patch16-224
inference: True
---
# AI Guard Vision Model Card

[![License: Apache 2.0](https://img.shields.io/badge/license-Apache--2.0-blue)](LICENSE)

## Overview

This model, **AI Guard Vision**, is a Vision Transformer (ViT)-based architecture designed for image classification tasks. Its primary objective is to accurately distinguish between real and AI-generated synthetic images. The model addresses the growing challenge of detecting manipulated or fake visual content to preserve trust and integrity in digital media.

## Model Summary

- **Model Type:** Vision Transformer (ViT) – `vit-base-patch16-224`
- **Objective:** Real vs. AI-generated image classification
- **License:** Apache 2.0
- **Fine-tuned From:** `google/vit-base-patch16-224`
- **Training Dataset:** [CIFake Dataset](https://www.kaggle.com/datasets/birdy654/cifake-real-and-ai-generated-synthetic-images)
- **Developer:** Aashish Kumar, IIIT Manipur

## Applications & Use Cases

- **Content Moderation:** Identifying AI-generated images across media platforms.
- **Digital Forensics:** Verifying the authenticity of visual content for investigative purposes.
- **Trust Preservation:** Helping maintain the integrity of digital ecosystems by combating misinformation spread through fake images.

## How to Use the Model

```python
from transformers import AutoImageProcessor, ViTForImageClassification
import torch
from PIL import Image
from pillow_heif import register_heif_opener, register_avif_opener

register_heif_opener()
register_avif_opener()

def get_prediction(img):
    image = Image.open(img).convert('RGB')
    image_processor = AutoImageProcessor.from_pretrained("AashishKumar/AIvisionGuard-v2")
    model = ViTForImageClassification.from_pretrained("AashishKumar/AIvisionGuard-v2")
    inputs = image_processor(image, return_tensors="pt")
    
    with torch.no_grad():
        logits = model(**inputs).logits
    
    top2_labels = logits.topk(2).indices.squeeze().tolist()
    top2_scores = logits.topk(2).values.squeeze().tolist()
    
    response = [{"label": model.config.id2label[label], "score": score} for label, score in zip(top2_labels, top2_scores)]
    return response
```

## Dataset Information

The model was fine-tuned on the **CIFake dataset**, which contains both real and AI-generated synthetic images:
- **Real Images:** Collected from the CIFAR-10 dataset.
- **Fake Images:** Generated using Stable Diffusion 1.4.
- **Training Data:** 100,000 images (50,000 per class).
- **Testing Data:** 20,000 images (10,000 per class).

## Model Architecture

- **Transformer Encoder Layers:** Utilizes self-attention mechanisms.
- **Positional Encodings:** Helps the model understand image structure.
- **Pretrained Weights:** Pretrained on ImageNet-21k and fine-tuned on ImageNet 2012 for enhanced performance.

### Why Vision Transformer?

- **Scalability and Performance:** Excels at high-level global feature extraction.
- **State-of-the-Art Accuracy:** Leverages transformers to outperform traditional CNN models.

## Training Details

- **Learning Rate:** 0.0000001
- **Batch Size:** 64
- **Epochs:** 100
- **Training Time:** 1 hr 36 min

## Evaluation Metrics

The model was evaluated using the CIFake test dataset, with the following metrics:

- **Accuracy:** 92%
- **F1 Score:** 0.89
- **Precision:** 0.85
- **Recall:** 0.88

| Model         | Accuracy | F1-Score | Precision | Recall |
|---------------|----------|----------|-----------|--------|
| Baseline      | 85%      | 0.82     | 0.78      | 0.80   |
| Augmented     | 88%      | 0.85     | 0.83      | 0.84   |
| Fine-tuned ViT| **92%**  | **0.89** | **0.85**  | **0.88**|

## Evaluation Fig:
![image/png](https://cdn-uploads.huggingface.co/production/uploads/640ed1fb06c3b5ca883d5ad5/vmiE8IhMLUwJIOLK-Q9dT.png)

## System Workflow

- **Frontend:** ReactJS
- **Backend:** Python Flask
- **Database:** PostgreSQL(Supabase)
- **Model:** Deployed via Pytorch and TensorFlow frameworks

## Strengths and Limitations

### Strengths:
- **High Accuracy:** Achieves state-of-the-art performance in distinguishing real and synthetic images.
- **Pretrained on ImageNet-21k:** Allows for efficient transfer learning and robust generalization.

### Limitations:
- **Synthetic Image Diversity:** The model may underperform on novel or unseen synthetic images that are significantly different from the training data.
- **Data Bias:** Like all machine learning models, its predictions may reflect biases present in the training data.

## Conclusion and Future Work

This model provides a highly effective tool for detecting AI-generated synthetic images and has promising applications in content moderation, digital forensics, and trust preservation. Future improvements may include:
- **Hybrid Architectures:** Combining transformers with convolutional layers for improved performance.
- **Multimodal Detection:** Incorporating additional modalities (e.g., metadata or contextual information) for more comprehensive detection.