ritabratamaiti commited on
Commit
c650833
1 Parent(s): 9d61fa6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +115 -1
README.md CHANGED
@@ -5,4 +5,118 @@ datasets:
5
  base_model:
6
  - meta-llama/Llama-3.2-1B
7
  - google/vit-base-patch16-224
8
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  base_model:
6
  - meta-llama/Llama-3.2-1B
7
  - google/vit-base-patch16-224
8
+ language:
9
+ - en
10
+ pipeline_tag: image-to-text
11
+ library_name: AnyModal
12
+ tags:
13
+ - vlm
14
+ - vision
15
+ - multimodal
16
+ ---
17
+ ```markdown
18
+ # AnyModal/Image-Captioning-Llama-3.2-1B
19
+
20
+ **AnyModal/Image-Captioning-Llama-3.2-1B** explores the potential of combining advanced visual feature extraction and language modeling techniques to generate descriptive captions for natural images. Built within the [AnyModal](https://github.com/ritabratamaiti/AnyModal) framework, this model integrates a Vision Transformer (ViT) encoder with the Llama 3.2-1B language model, fine-tuned on the Flickr30k dataset. The model demonstrates a promising approach to bridging visual and textual modalities.
21
+
22
+ ---
23
+
24
+ ## Trained On
25
+
26
+ This model was trained on the [Flickr30k Dataset](https://huggingface.co/datasets/AnyModal/flickr30k):
27
+
28
+ **From Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Inference Over Event Descriptions**
29
+ *Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, Svetlana Lazebnik*
30
+
31
+ The dataset comprises 31,000 images collected from Flickr, each annotated with five descriptive sentences written by human annotators. These annotations offer diverse perspectives on real-world scenes and actions, forming a robust basis for image captioning experiments.
32
+
33
+ ---
34
+
35
+ ## How to Use
36
+
37
+ ### Installation
38
+
39
+ Install the necessary dependencies:
40
+
41
+ ```bash
42
+ pip install torch transformers torchvision huggingface_hub tqdm matplotlib Pillow
43
+ ```
44
+
45
+ ### Inference
46
+
47
+ Below is an example of generating captions for an image using this model:
48
+
49
+ ```python
50
+ import llm
51
+ import anymodal
52
+ import torch
53
+ import vision
54
+ from torch.utils.data import DataLoader
55
+ import numpy as np
56
+ import os
57
+ from PIL import Image
58
+ from huggingface_hub import hf_hub_download
59
+
60
+ # Load language model and tokenizer
61
+ llm_tokenizer, llm_model = llm.get_llm(
62
+ "meta-llama/Llama-3.2-1B",
63
+ access_token="GET_YOUR_OWN_TOKEN_FROM_HUGGINGFACE",
64
+ use_peft=False,
65
+ )
66
+ llm_hidden_size = llm.get_hidden_size(llm_tokenizer, llm_model)
67
+
68
+ # Load vision model components
69
+ image_processor, vision_model, vision_hidden_size = vision.get_image_encoder("google/vit-base-patch16-224", use_peft=False)
70
+
71
+ # Initialize vision tokenizer and encoder
72
+ vision_encoder = vision.VisionEncoder(vision_model)
73
+ vision_tokenizer = vision.Projector(vision_hidden_size, llm_hidden_size, num_hidden=1)
74
+
75
+ # Initialize MultiModalModel
76
+ multimodal_model = anymodal.MultiModalModel(
77
+ input_processor=None,
78
+ input_encoder=vision_encoder,
79
+ input_tokenizer=vision.Projector(vision_hidden_size, llm_hidden_size, num_hidden=1),
80
+ language_tokenizer=llm_tokenizer,
81
+ language_model=llm_model,
82
+ prompt_text="The description of the given image is: ")
83
+
84
+ # Download pre-trained model weights
85
+ if not os.path.exists("image_captioning_model"):
86
+ os.makedirs("image_captioning_model")
87
+
88
+ hf_hub_download("AnyModal/Image-Captioning-Llama-3.2-1B", filename="input_tokenizer.pt", local_dir="image_captioning_model")
89
+ multimodal_model._load_model("image_captioning_model")
90
+
91
+ # Generate caption for an image
92
+ image_path = "example_image.jpg" # Path to your image
93
+ image = Image.open(image_path).convert("RGB")
94
+ processed_image = image_processor(image, return_tensors="pt")
95
+ processed_image = {key: val.squeeze(0) for key, val in processed_image.items()} # Remove batch dimension
96
+
97
+ # Generate caption
98
+ generated_caption = multimodal_model.generate(processed_image, max_new_tokens=120)
99
+ print("Generated Caption:", generated_caption)
100
+ ```
101
+
102
+ ---
103
+
104
+ ## Project and Training Scripts
105
+
106
+ This model is part of the [AnyModal Image Captioning Project](https://github.com/ritabratamaiti/AnyModal/tree/main/Image%20Captioning).
107
+
108
+ - **Training Script**: [train.py](https://github.com/ritabratamaiti/AnyModal/blob/main/Image%20Captioning/train.py)
109
+ - **Inference Script**: [inference.py](https://github.com/ritabratamaiti/AnyModal/blob/main/Image%20Captioning/inference.py)
110
+
111
+ Explore the full project repository for additional details and potential customization.
112
+
113
+ ---
114
+
115
+ ## Project Details
116
+
117
+ - **Vision Encoder**: Uses a pre-trained Vision Transformer (ViT) model for feature extraction, offering a strong baseline for processing visual information.
118
+ - **Projector Network**: Maps visual features into a token space that aligns with the Llama 3.2-1B language model.
119
+ - **Language Model**: Utilizes Llama 3.2-1B, a pre-trained causal language model, to construct coherent and context-sensitive captions.
120
+
121
+ While trained on the Flickr30k dataset, the model's design highlights the possibilities for integrating vision and language models for captioning tasks, showcasing the feasibility of this approach within the AnyModal framework.
122
+ ```