ritabratamaiti commited on
Commit
d9893c6
1 Parent(s): b3b84f0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -9
README.md CHANGED
@@ -16,7 +16,7 @@ tags:
16
  ---
17
  # AnyModal/Image-Captioning-Llama-3.2-1B
18
 
19
- **AnyModal/Image-Captioning-Llama-3.2-1B** explores the potential of combining visual feature extraction and language modeling techniques to generate descriptive captions for natural images. Built within the [AnyModal](https://github.com/ritabratamaiti/AnyModal) framework, this model integrates a Vision Transformer (ViT) encoder with the Llama 3.2-1B language model, fine-tuned on the Flickr30k dataset. The model demonstrates a promising approach to bridging visual and textual modalities.
20
 
21
  ---
22
 
@@ -27,7 +27,7 @@ This model was trained on the [Flickr30k Dataset](https://huggingface.co/dataset
27
  **From Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Inference Over Event Descriptions**
28
  *Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, Svetlana Lazebnik*
29
 
30
- The dataset comprises 31,000 images collected from Flickr, each annotated with five descriptive sentences written by human annotators. These annotations offer diverse perspectives on real-world scenes and actions, forming a robust basis for image captioning experiments.
31
 
32
  ---
33
 
@@ -75,10 +75,13 @@ vision_tokenizer = vision.Projector(vision_hidden_size, llm_hidden_size, num_hid
75
  multimodal_model = anymodal.MultiModalModel(
76
  input_processor=None,
77
  input_encoder=vision_encoder,
78
- input_tokenizer=vision.Projector(vision_hidden_size, llm_hidden_size, num_hidden=1),
79
  language_tokenizer=llm_tokenizer,
80
  language_model=llm_model,
81
- prompt_text="The description of the given image is: ")
 
 
 
82
 
83
  # Download pre-trained model weights
84
  if not os.path.exists("image_captioning_model"):
@@ -107,14 +110,15 @@ This model is part of the [AnyModal Image Captioning Project](https://github.com
107
  - **Training Script**: [train.py](https://github.com/ritabratamaiti/AnyModal/blob/main/Image%20Captioning/train.py)
108
  - **Inference Script**: [inference.py](https://github.com/ritabratamaiti/AnyModal/blob/main/Image%20Captioning/inference.py)
109
 
110
- Explore the full project repository for additional details and potential customization.
111
 
112
  ---
113
 
114
  ## Project Details
115
 
116
- - **Vision Encoder**: Uses a pre-trained Vision Transformer (ViT) model for feature extraction, offering a strong baseline for processing visual information.
117
- - **Projector Network**: Maps visual features into a token space that aligns with the Llama 3.2-1B language model.
118
- - **Language Model**: Utilizes Llama 3.2-1B, a pre-trained causal language model, to construct coherent and context-sensitive captions.
 
 
119
 
120
- While trained on the Flickr30k dataset, the model's design highlights the possibilities for integrating vision and language models for captioning tasks, showcasing the feasibility of this approach within the AnyModal framework.
 
16
  ---
17
  # AnyModal/Image-Captioning-Llama-3.2-1B
18
 
19
+ **AnyModal/Image-Captioning-Llama-3.2-1B** is an image captioning model built within the [AnyModal](https://github.com/ritabratamaiti/AnyModal) framework. It integrates a Vision Transformer (ViT) encoder with the Llama 3.2-1B language model and has been trained on the Flickr30k dataset. The model demonstrates the integration of pre-trained vision and language components for generating descriptive captions from natural images.
20
 
21
  ---
22
 
 
27
  **From Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Inference Over Event Descriptions**
28
  *Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, Svetlana Lazebnik*
29
 
30
+ The dataset contains 31,000 images collected from Flickr, each annotated with five descriptive sentences written by human annotators, covering a variety of real-world scenes and events.
31
 
32
  ---
33
 
 
75
  multimodal_model = anymodal.MultiModalModel(
76
  input_processor=None,
77
  input_encoder=vision_encoder,
78
+ input_tokenizer=vision_tokenizer,
79
  language_tokenizer=llm_tokenizer,
80
  language_model=llm_model,
81
+ input_start_token="<|imstart|>",
82
+ input_end_token="<|imend|>",
83
+ prompt_text="The description of the given image is: ",
84
+ )
85
 
86
  # Download pre-trained model weights
87
  if not os.path.exists("image_captioning_model"):
 
110
  - **Training Script**: [train.py](https://github.com/ritabratamaiti/AnyModal/blob/main/Image%20Captioning/train.py)
111
  - **Inference Script**: [inference.py](https://github.com/ritabratamaiti/AnyModal/blob/main/Image%20Captioning/inference.py)
112
 
113
+ Refer to the project repository for further implementation details and customization.
114
 
115
  ---
116
 
117
  ## Project Details
118
 
119
+ - **Vision Encoder**: Pre-trained Vision Transformer (ViT) model for visual feature extraction.
120
+ - **Projector Network**: Projects visual features into a token space compatible with Llama 3.2-1B.
121
+ - **Language Model**: Llama 3.2-1B, a pre-trained causal language model for text generation.
122
+
123
+ This implementation serves as a proof of concept, combining a ViT-based image encoder and a small language model. Future iterations could achieve improved performance by incorporating text-conditioned image encoders and larger-scale language models.
124