alibidaran
/

SMOLL_image_captioner

Model card Files Files and versions Community

SMOLL_image_captioner / README.md

alibidaran's picture

Update README.md

8b904e9 verified 14 days ago

|

history blame contribute delete

3.14 kB

	---
	license: apache-2.0
	language:
	- en
	base_model:
	- HuggingFaceTB/SmolLM2-360M
	---
	# SMOLLM_VISON_Image_Captioner

	## Overview
	This project implements an image captioning model using OpenAI's CLIP model and a causal language model (LLM). The model extracts image features using CLIP and generates captions using a fine-tuned LLM. It is trained with the Flickr-8k dataset.

	## Requirements
	Before running the code, ensure you have installed the necessary dependencies:
	```bash
	pip install transformers==4.47.0 torch opencv-python matplotlib pillow requests
	```

	## Model and Token Configuration
	The code utilizes the following models:
	- CLIP: `openai/clip-vit-large-patch14`
	- LLM: `alibidaran/SMOLL_image_captioner`
	- Tokenizer: `HuggingFaceTB/SmolLM2-360M`

	## Installation and Setup
	### Load Necessary Libraries
	```python
	from PIL import Image
	import requests
	from transformers import CLIPProcessor, CLIPModel
	import cv2
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch
	import matplotlib.pyplot as plt
	```

	### Load CLIP Model
	```python
	clip_model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14").to('cuda:0')
	clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
	print(torch.cuda.is_available())
	```

	### Load Tokenizer and LLM Model
	```python
	device = 'cuda' if torch.cuda.is_available() else 'cpu'
	tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-360M")

	llm_model = AutoModelForCausalLM.from_pretrained("alibidaran/SMOLL_image_captioner").to('cuda')
	```

	### Download Pretrained Model Weights
	```bash
	wget https://huggingface.co/alibidaran/SMOLL_image_captioner/resolve/main/content/SMOLL_image_captioner.pt
	```

	## Image Captioning Model

	### Load Model Weights
	```python
	from SMOLLM_VisionModel import SMOLLm_VISION_ImageCaptioning,SmoLLM_processor

	image_captioning_model = SMOLLm_VISION_ImageCaptioning(llm_model=llm_model, hidden_dim=4096).to('cuda')
	model = image_captioning_model
	processor=SmoLLM_processor(image_model=clip_model,image_processor=clip_processor)
	saved_model = torch.load('/content/SMOLL_image_captioner.pt', map_location=torch.device('cuda'))
	```

	## Image Caption Generation
	### Load Image and Extract Features
	```python
	import cv2
	import matplotlib.pyplot as plt

	image_url = '/content/54322546688_71515f8335_w.jpg'
	image_features = processor.get_features(image_url, device='cuda')
	```

	### Generate Caption
	```python
	tokenizer.pad_token = tokenizer.eos_token
	prompt = """
	##User <image> Write a caption
	##Assitant:"""

	# Tokenize input
	tokenized = tokenizer(prompt, return_tensors='pt')
	label = tokenized['input_ids'].to('cuda')
	att = tokenized['attention_mask'].to('cuda')

	# Generate caption
	with torch.no_grad():
	_, embeds = model(image_features.unsqueeze(0).to('cuda'), label, att)
	generate_kwargs = {
	"input_ids": None,
	"inputs_embeds": embeds,
	"max_new_tokens": 50,
	}
	output = saved_model.llm_model.generate(**generate_kwargs, do_sample=True, temperature=0.8, top_p=0.99, top_k=10)

	# Decode and display result
	print(tokenizer.decode(output[0]))
	plt.imshow(image)
	```