--- license: apache-2.0 language: - en base_model: - HuggingFaceTB/SmolLM2-360M --- # SMOLLM_VISON_Image_Captioner ## Overview This project implements an image captioning model using OpenAI's CLIP model and a causal language model (LLM). The model extracts image features using CLIP and generates captions using a fine-tuned LLM. It is trained with the Flickr-8k dataset. ## Requirements Before running the code, ensure you have installed the necessary dependencies: ```bash pip install transformers==4.47.0 torch opencv-python matplotlib pillow requests ``` ## Model and Token Configuration The code utilizes the following models: - CLIP: `openai/clip-vit-large-patch14` - LLM: `alibidaran/SMOLL_image_captioner` - Tokenizer: `HuggingFaceTB/SmolLM2-360M` ## Installation and Setup ### Load Necessary Libraries ```python from PIL import Image import requests from transformers import CLIPProcessor, CLIPModel import cv2 from transformers import AutoTokenizer, AutoModelForCausalLM import torch import matplotlib.pyplot as plt ``` ### Load CLIP Model ```python clip_model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14").to('cuda:0') clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14") print(torch.cuda.is_available()) ``` ### Load Tokenizer and LLM Model ```python device = 'cuda' if torch.cuda.is_available() else 'cpu' tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-360M") llm_model = AutoModelForCausalLM.from_pretrained("alibidaran/SMOLL_image_captioner").to('cuda') ``` ### Download Pretrained Model Weights ```bash wget https://huggingface.co/alibidaran/SMOLL_image_captioner/resolve/main/content/SMOLL_image_captioner.pt ``` ## Image Captioning Model ### Load Model Weights ```python from SMOLLM_VisionModel import SMOLLm_VISION_ImageCaptioning,SmoLLM_processor image_captioning_model = SMOLLm_VISION_ImageCaptioning(llm_model=llm_model, hidden_dim=4096).to('cuda') model = image_captioning_model processor=SmoLLM_processor(image_model=clip_model,image_processor=clip_processor) saved_model = torch.load('/content/SMOLL_image_captioner.pt', map_location=torch.device('cuda')) ``` ## Image Caption Generation ### Load Image and Extract Features ```python import cv2 import matplotlib.pyplot as plt image_url = '/content/54322546688_71515f8335_w.jpg' image_features = processor.get_features(image_url, device='cuda') ``` ### Generate Caption ```python tokenizer.pad_token = tokenizer.eos_token prompt = """ ##User Write a caption ##Assitant:""" # Tokenize input tokenized = tokenizer(prompt, return_tensors='pt') label = tokenized['input_ids'].to('cuda') att = tokenized['attention_mask'].to('cuda') # Generate caption with torch.no_grad(): _, embeds = model(image_features.unsqueeze(0).to('cuda'), label, att) generate_kwargs = { "input_ids": None, "inputs_embeds": embeds, "max_new_tokens": 50, } output = saved_model.llm_model.generate(**generate_kwargs, do_sample=True, temperature=0.8, top_p=0.99, top_k=10) # Decode and display result print(tokenizer.decode(output[0])) plt.imshow(image) ```