|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
base_model: |
|
- HuggingFaceTB/SmolLM2-360M |
|
--- |
|
# SMOLLM_VISON_Image_Captioner |
|
|
|
## Overview |
|
This project implements an image captioning model using OpenAI's CLIP model and a causal language model (LLM). The model extracts image features using CLIP and generates captions using a fine-tuned LLM. It is trained with the Flickr-8k dataset. |
|
|
|
## Requirements |
|
Before running the code, ensure you have installed the necessary dependencies: |
|
```bash |
|
pip install transformers==4.47.0 torch opencv-python matplotlib pillow requests |
|
``` |
|
|
|
## Model and Token Configuration |
|
The code utilizes the following models: |
|
- CLIP: `openai/clip-vit-large-patch14` |
|
- LLM: `alibidaran/SMOLL_image_captioner` |
|
- Tokenizer: `HuggingFaceTB/SmolLM2-360M` |
|
|
|
## Installation and Setup |
|
### Load Necessary Libraries |
|
```python |
|
from PIL import Image |
|
import requests |
|
from transformers import CLIPProcessor, CLIPModel |
|
import cv2 |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
import torch |
|
import matplotlib.pyplot as plt |
|
``` |
|
|
|
### Load CLIP Model |
|
```python |
|
clip_model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14").to('cuda:0') |
|
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14") |
|
print(torch.cuda.is_available()) |
|
``` |
|
|
|
### Load Tokenizer and LLM Model |
|
```python |
|
device = 'cuda' if torch.cuda.is_available() else 'cpu' |
|
tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-360M") |
|
|
|
llm_model = AutoModelForCausalLM.from_pretrained("alibidaran/SMOLL_image_captioner").to('cuda') |
|
``` |
|
|
|
### Download Pretrained Model Weights |
|
```bash |
|
wget https://huggingface.co/alibidaran/SMOLL_image_captioner/resolve/main/content/SMOLL_image_captioner.pt |
|
``` |
|
|
|
## Image Captioning Model |
|
|
|
### Load Model Weights |
|
```python |
|
from SMOLLM_VisionModel import SMOLLm_VISION_ImageCaptioning,SmoLLM_processor |
|
|
|
image_captioning_model = SMOLLm_VISION_ImageCaptioning(llm_model=llm_model, hidden_dim=4096).to('cuda') |
|
model = image_captioning_model |
|
processor=SmoLLM_processor(image_model=clip_model,image_processor=clip_processor) |
|
saved_model = torch.load('/content/SMOLL_image_captioner.pt', map_location=torch.device('cuda')) |
|
``` |
|
|
|
## Image Caption Generation |
|
### Load Image and Extract Features |
|
```python |
|
import cv2 |
|
import matplotlib.pyplot as plt |
|
|
|
image_url = '/content/54322546688_71515f8335_w.jpg' |
|
image_features = processor.get_features(image_url, device='cuda') |
|
``` |
|
|
|
### Generate Caption |
|
```python |
|
tokenizer.pad_token = tokenizer.eos_token |
|
prompt = """ |
|
##User <image> Write a caption |
|
##Assitant:""" |
|
|
|
# Tokenize input |
|
tokenized = tokenizer(prompt, return_tensors='pt') |
|
label = tokenized['input_ids'].to('cuda') |
|
att = tokenized['attention_mask'].to('cuda') |
|
|
|
# Generate caption |
|
with torch.no_grad(): |
|
_, embeds = model(image_features.unsqueeze(0).to('cuda'), label, att) |
|
generate_kwargs = { |
|
"input_ids": None, |
|
"inputs_embeds": embeds, |
|
"max_new_tokens": 50, |
|
} |
|
output = saved_model.llm_model.generate(**generate_kwargs, do_sample=True, temperature=0.8, top_p=0.99, top_k=10) |
|
|
|
# Decode and display result |
|
print(tokenizer.decode(output[0])) |
|
plt.imshow(image) |
|
``` |