|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- Hemg/cifake-real-and-ai-generated-synthetic-images |
|
language: |
|
- en |
|
metrics: |
|
- accuracy |
|
library_name: transformers |
|
tags: |
|
- Diffusors |
|
- GanDetectors |
|
- Cifake |
|
base_model: |
|
- google/vit-base-patch16-224 |
|
inference: True |
|
--- |
|
# AI Guard Vision Model Card |
|
|
|
[](LICENSE) |
|
|
|
## Overview |
|
|
|
This model, **AI Guard Vision**, is a Vision Transformer (ViT)-based architecture designed for image classification tasks. Its primary objective is to accurately distinguish between real and AI-generated synthetic images. The model addresses the growing challenge of detecting manipulated or fake visual content to preserve trust and integrity in digital media. |
|
|
|
## Model Summary |
|
|
|
- **Model Type:** Vision Transformer (ViT) – `vit-base-patch16-224` |
|
- **Objective:** Real vs. AI-generated image classification |
|
- **License:** Apache 2.0 |
|
- **Fine-tuned From:** `google/vit-base-patch16-224` |
|
- **Training Dataset:** [CIFake Dataset](https://www.kaggle.com/datasets/birdy654/cifake-real-and-ai-generated-synthetic-images) |
|
- **Developer:** Aashish Kumar, IIIT Manipur |
|
|
|
## Applications & Use Cases |
|
|
|
- **Content Moderation:** Identifying AI-generated images across media platforms. |
|
- **Digital Forensics:** Verifying the authenticity of visual content for investigative purposes. |
|
- **Trust Preservation:** Helping maintain the integrity of digital ecosystems by combating misinformation spread through fake images. |
|
|
|
## How to Use the Model |
|
|
|
```python |
|
from transformers import AutoImageProcessor, ViTForImageClassification |
|
import torch |
|
from PIL import Image |
|
from pillow_heif import register_heif_opener, register_avif_opener |
|
|
|
register_heif_opener() |
|
register_avif_opener() |
|
|
|
def get_prediction(img): |
|
image = Image.open(img).convert('RGB') |
|
image_processor = AutoImageProcessor.from_pretrained("AashishKumar/AIvisionGuard-v2") |
|
model = ViTForImageClassification.from_pretrained("AashishKumar/AIvisionGuard-v2") |
|
inputs = image_processor(image, return_tensors="pt") |
|
|
|
with torch.no_grad(): |
|
logits = model(**inputs).logits |
|
|
|
top2_labels = logits.topk(2).indices.squeeze().tolist() |
|
top2_scores = logits.topk(2).values.squeeze().tolist() |
|
|
|
response = [{"label": model.config.id2label[label], "score": score} for label, score in zip(top2_labels, top2_scores)] |
|
return response |
|
``` |
|
|
|
## Dataset Information |
|
|
|
The model was fine-tuned on the **CIFake dataset**, which contains both real and AI-generated synthetic images: |
|
- **Real Images:** Collected from the CIFAR-10 dataset. |
|
- **Fake Images:** Generated using Stable Diffusion 1.4. |
|
- **Training Data:** 100,000 images (50,000 per class). |
|
- **Testing Data:** 20,000 images (10,000 per class). |
|
|
|
## Model Architecture |
|
|
|
- **Transformer Encoder Layers:** Utilizes self-attention mechanisms. |
|
- **Positional Encodings:** Helps the model understand image structure. |
|
- **Pretrained Weights:** Pretrained on ImageNet-21k and fine-tuned on ImageNet 2012 for enhanced performance. |
|
|
|
### Why Vision Transformer? |
|
|
|
- **Scalability and Performance:** Excels at high-level global feature extraction. |
|
- **State-of-the-Art Accuracy:** Leverages transformers to outperform traditional CNN models. |
|
|
|
## Training Details |
|
|
|
- **Learning Rate:** 0.0000001 |
|
- **Batch Size:** 64 |
|
- **Epochs:** 100 |
|
- **Training Time:** 1 hr 36 min |
|
|
|
## Evaluation Metrics |
|
|
|
The model was evaluated using the CIFake test dataset, with the following metrics: |
|
|
|
- **Accuracy:** 92% |
|
- **F1 Score:** 0.89 |
|
- **Precision:** 0.85 |
|
- **Recall:** 0.88 |
|
|
|
| Model | Accuracy | F1-Score | Precision | Recall | |
|
|---------------|----------|----------|-----------|--------| |
|
| Baseline | 85% | 0.82 | 0.78 | 0.80 | |
|
| Augmented | 88% | 0.85 | 0.83 | 0.84 | |
|
| Fine-tuned ViT| **92%** | **0.89** | **0.85** | **0.88**| |
|
|
|
## Evaluation Fig: |
|
 |
|
|
|
## System Workflow |
|
|
|
- **Frontend:** ReactJS |
|
- **Backend:** Python Flask |
|
- **Database:** PostgreSQL(Supabase) |
|
- **Model:** Deployed via Pytorch and TensorFlow frameworks |
|
|
|
## Strengths and Limitations |
|
|
|
### Strengths: |
|
- **High Accuracy:** Achieves state-of-the-art performance in distinguishing real and synthetic images. |
|
- **Pretrained on ImageNet-21k:** Allows for efficient transfer learning and robust generalization. |
|
|
|
### Limitations: |
|
- **Synthetic Image Diversity:** The model may underperform on novel or unseen synthetic images that are significantly different from the training data. |
|
- **Data Bias:** Like all machine learning models, its predictions may reflect biases present in the training data. |
|
|
|
## Conclusion and Future Work |
|
|
|
This model provides a highly effective tool for detecting AI-generated synthetic images and has promising applications in content moderation, digital forensics, and trust preservation. Future improvements may include: |
|
- **Hybrid Architectures:** Combining transformers with convolutional layers for improved performance. |
|
- **Multimodal Detection:** Incorporating additional modalities (e.g., metadata or contextual information) for more comprehensive detection. |