Image-Captioning-Llama-3.2-1B / README.md

Update README.md

7697704 verified 22 days ago

4.53 kB

	---
	license: mit
	datasets:
	- AnyModal/flickr30k
	base_model:
	- meta-llama/Llama-3.2-1B
	- google/vit-base-patch16-224
	language:
	- en
	pipeline_tag: image-to-text
	library_name: AnyModal
	tags:
	- vlm
	- vision
	- multimodal
	- AnyModal
	---
	# AnyModal/Image-Captioning-Llama-3.2-1B

	AnyModal/Image-Captioning-Llama-3.2-1B is an image captioning model built within the [AnyModal](https://github.com/ritabratamaiti/AnyModal) framework. It integrates a Vision Transformer (ViT) encoder with the Llama 3.2-1B language model and has been trained on the Flickr30k dataset. The model demonstrates the integration of pre-trained vision and language components for generating descriptive captions from natural images.

	---

	## Trained On

	This model was trained on the [Flickr30k Dataset](https://huggingface.co/datasets/AnyModal/flickr30k):

	From Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Inference Over Event Descriptions
	Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, Svetlana Lazebnik

	The dataset contains 31,000 images collected from Flickr, each annotated with five descriptive sentences written by human annotators, covering a variety of real-world scenes and events.

	---

	## How to Use

	### Installation

	Install the necessary dependencies:

	```bash
	pip install torch transformers torchvision huggingface_hub tqdm matplotlib Pillow
	```

	### Inference

	Below is an example of generating captions for an image using this model:

	```python
	import llm
	import anymodal
	import torch
	import vision
	from torch.utils.data import DataLoader
	import numpy as np
	import os
	from PIL import Image
	from huggingface_hub import hf_hub_download

	# Load language model and tokenizer
	llm_tokenizer, llm_model = llm.get_llm(
	"meta-llama/Llama-3.2-1B",
	access_token="GET_YOUR_OWN_TOKEN_FROM_HUGGINGFACE",
	use_peft=False,
	)
	llm_hidden_size = llm.get_hidden_size(llm_tokenizer, llm_model)

	# Load vision model components
	image_processor, vision_model, vision_hidden_size = vision.get_image_encoder("google/vit-base-patch16-224", use_peft=False)

	# Initialize vision tokenizer and encoder
	vision_encoder = vision.VisionEncoder(vision_model)
	vision_tokenizer = vision.Projector(vision_hidden_size, llm_hidden_size, num_hidden=1)

	# Initialize MultiModalModel
	multimodal_model = anymodal.MultiModalModel(
	input_processor=None,
	input_encoder=vision_encoder,
	input_tokenizer=vision_tokenizer,
	language_tokenizer=llm_tokenizer,
	language_model=llm_model,
	input_start_token="<\|imstart\|>",
	input_end_token="<\|imend\|>",
	prompt_text="The description of the given image is: ",
	)

	# Download pre-trained model weights
	if not os.path.exists("image_captioning_model"):
	os.makedirs("image_captioning_model")

	hf_hub_download("AnyModal/Image-Captioning-Llama-3.2-1B", filename="input_tokenizer.pt", local_dir="image_captioning_model")
	multimodal_model._load_model("image_captioning_model")

	# Generate caption for an image
	image_path = "example_image.jpg" # Path to your image
	image = Image.open(image_path).convert("RGB")
	processed_image = image_processor(image, return_tensors="pt")
	processed_image = {key: val.squeeze(0) for key, val in processed_image.items()} # Remove batch dimension

	# Generate caption
	generated_caption = multimodal_model.generate(processed_image, max_new_tokens=120)
	print("Generated Caption:", generated_caption)
	```

	---

	## Project and Training Scripts

	This model is part of the [AnyModal Image Captioning Project](https://github.com/ritabratamaiti/AnyModal/tree/main/Image%20Captioning).

	- Training Script: [train.py](https://github.com/ritabratamaiti/AnyModal/blob/main/Image%20Captioning/train.py)
	- Inference Script: [inference.py](https://github.com/ritabratamaiti/AnyModal/blob/main/Image%20Captioning/inference.py)

	Refer to the project repository for further implementation details and customization.

	---

	## Project Details

	- Vision Encoder: Pre-trained Vision Transformer (ViT) model for visual feature extraction.
	- Projector Network: Projects visual features into a token space compatible with Llama 3.2-1B using a dense network.
	- Language Model: Llama 3.2-1B, a pre-trained causal language model for text generation.

	This implementation serves as a proof of concept, combining a ViT-based image encoder and a small language model. Future iterations could achieve improved performance by incorporating text-conditioned image encoders and larger-scale language models.