File size: 3,141 Bytes
8b904e9
 
 
 
 
 
 
69270f9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8b904e9
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
---
license: apache-2.0
language:
- en
base_model:
- HuggingFaceTB/SmolLM2-360M
---
# SMOLLM_VISON_Image_Captioner

## Overview
This project implements an image captioning model using OpenAI's CLIP model and a causal language model (LLM). The model extracts image features using CLIP and generates captions using a fine-tuned LLM. It is trained with the Flickr-8k dataset.

## Requirements
Before running the code, ensure you have installed the necessary dependencies:
```bash
pip install transformers==4.47.0 torch opencv-python matplotlib pillow requests
```

## Model and Token Configuration
The code utilizes the following models:
- CLIP: `openai/clip-vit-large-patch14`
- LLM: `alibidaran/SMOLL_image_captioner`
- Tokenizer: `HuggingFaceTB/SmolLM2-360M`

## Installation and Setup
### Load Necessary Libraries
```python
from PIL import Image
import requests
from transformers import CLIPProcessor, CLIPModel
import cv2
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import matplotlib.pyplot as plt
```

### Load CLIP Model
```python
clip_model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14").to('cuda:0')
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
print(torch.cuda.is_available())
```

### Load Tokenizer and LLM Model
```python
device = 'cuda' if torch.cuda.is_available() else 'cpu'
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-360M")

llm_model = AutoModelForCausalLM.from_pretrained("alibidaran/SMOLL_image_captioner").to('cuda')
```

### Download Pretrained Model Weights
```bash
wget https://huggingface.co/alibidaran/SMOLL_image_captioner/resolve/main/content/SMOLL_image_captioner.pt
```

## Image Captioning Model

### Load Model Weights
```python
from SMOLLM_VisionModel import SMOLLm_VISION_ImageCaptioning,SmoLLM_processor

image_captioning_model = SMOLLm_VISION_ImageCaptioning(llm_model=llm_model, hidden_dim=4096).to('cuda')
model = image_captioning_model
processor=SmoLLM_processor(image_model=clip_model,image_processor=clip_processor)
saved_model = torch.load('/content/SMOLL_image_captioner.pt', map_location=torch.device('cuda'))
```

## Image Caption Generation
### Load Image and Extract Features
```python
import cv2
import matplotlib.pyplot as plt

image_url = '/content/54322546688_71515f8335_w.jpg'
image_features = processor.get_features(image_url, device='cuda')
```

### Generate Caption
```python
tokenizer.pad_token = tokenizer.eos_token
prompt = """
        ##User <image> Write a caption
        ##Assitant:"""

# Tokenize input
tokenized = tokenizer(prompt, return_tensors='pt')
label = tokenized['input_ids'].to('cuda')
att = tokenized['attention_mask'].to('cuda')

# Generate caption
with torch.no_grad():
    _, embeds = model(image_features.unsqueeze(0).to('cuda'), label, att)
    generate_kwargs = {
        "input_ids": None,
        "inputs_embeds": embeds,
        "max_new_tokens": 50,
    }
    output = saved_model.llm_model.generate(**generate_kwargs, do_sample=True, temperature=0.8, top_p=0.99, top_k=10)

# Decode and display result
print(tokenizer.decode(output[0]))
plt.imshow(image)
```