ritabratamaiti
commited on
Commit
•
d9893c6
1
Parent(s):
b3b84f0
Update README.md
Browse files
README.md
CHANGED
@@ -16,7 +16,7 @@ tags:
|
|
16 |
---
|
17 |
# AnyModal/Image-Captioning-Llama-3.2-1B
|
18 |
|
19 |
-
**AnyModal/Image-Captioning-Llama-3.2-1B**
|
20 |
|
21 |
---
|
22 |
|
@@ -27,7 +27,7 @@ This model was trained on the [Flickr30k Dataset](https://huggingface.co/dataset
|
|
27 |
**From Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Inference Over Event Descriptions**
|
28 |
*Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, Svetlana Lazebnik*
|
29 |
|
30 |
-
The dataset
|
31 |
|
32 |
---
|
33 |
|
@@ -75,10 +75,13 @@ vision_tokenizer = vision.Projector(vision_hidden_size, llm_hidden_size, num_hid
|
|
75 |
multimodal_model = anymodal.MultiModalModel(
|
76 |
input_processor=None,
|
77 |
input_encoder=vision_encoder,
|
78 |
-
input_tokenizer=
|
79 |
language_tokenizer=llm_tokenizer,
|
80 |
language_model=llm_model,
|
81 |
-
|
|
|
|
|
|
|
82 |
|
83 |
# Download pre-trained model weights
|
84 |
if not os.path.exists("image_captioning_model"):
|
@@ -107,14 +110,15 @@ This model is part of the [AnyModal Image Captioning Project](https://github.com
|
|
107 |
- **Training Script**: [train.py](https://github.com/ritabratamaiti/AnyModal/blob/main/Image%20Captioning/train.py)
|
108 |
- **Inference Script**: [inference.py](https://github.com/ritabratamaiti/AnyModal/blob/main/Image%20Captioning/inference.py)
|
109 |
|
110 |
-
|
111 |
|
112 |
---
|
113 |
|
114 |
## Project Details
|
115 |
|
116 |
-
- **Vision Encoder**:
|
117 |
-
- **Projector Network**:
|
118 |
-
- **Language Model**:
|
|
|
|
|
119 |
|
120 |
-
While trained on the Flickr30k dataset, the model's design highlights the possibilities for integrating vision and language models for captioning tasks, showcasing the feasibility of this approach within the AnyModal framework.
|
|
|
16 |
---
|
17 |
# AnyModal/Image-Captioning-Llama-3.2-1B
|
18 |
|
19 |
+
**AnyModal/Image-Captioning-Llama-3.2-1B** is an image captioning model built within the [AnyModal](https://github.com/ritabratamaiti/AnyModal) framework. It integrates a Vision Transformer (ViT) encoder with the Llama 3.2-1B language model and has been trained on the Flickr30k dataset. The model demonstrates the integration of pre-trained vision and language components for generating descriptive captions from natural images.
|
20 |
|
21 |
---
|
22 |
|
|
|
27 |
**From Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Inference Over Event Descriptions**
|
28 |
*Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, Svetlana Lazebnik*
|
29 |
|
30 |
+
The dataset contains 31,000 images collected from Flickr, each annotated with five descriptive sentences written by human annotators, covering a variety of real-world scenes and events.
|
31 |
|
32 |
---
|
33 |
|
|
|
75 |
multimodal_model = anymodal.MultiModalModel(
|
76 |
input_processor=None,
|
77 |
input_encoder=vision_encoder,
|
78 |
+
input_tokenizer=vision_tokenizer,
|
79 |
language_tokenizer=llm_tokenizer,
|
80 |
language_model=llm_model,
|
81 |
+
input_start_token="<|imstart|>",
|
82 |
+
input_end_token="<|imend|>",
|
83 |
+
prompt_text="The description of the given image is: ",
|
84 |
+
)
|
85 |
|
86 |
# Download pre-trained model weights
|
87 |
if not os.path.exists("image_captioning_model"):
|
|
|
110 |
- **Training Script**: [train.py](https://github.com/ritabratamaiti/AnyModal/blob/main/Image%20Captioning/train.py)
|
111 |
- **Inference Script**: [inference.py](https://github.com/ritabratamaiti/AnyModal/blob/main/Image%20Captioning/inference.py)
|
112 |
|
113 |
+
Refer to the project repository for further implementation details and customization.
|
114 |
|
115 |
---
|
116 |
|
117 |
## Project Details
|
118 |
|
119 |
+
- **Vision Encoder**: Pre-trained Vision Transformer (ViT) model for visual feature extraction.
|
120 |
+
- **Projector Network**: Projects visual features into a token space compatible with Llama 3.2-1B.
|
121 |
+
- **Language Model**: Llama 3.2-1B, a pre-trained causal language model for text generation.
|
122 |
+
|
123 |
+
This implementation serves as a proof of concept, combining a ViT-based image encoder and a small language model. Future iterations could achieve improved performance by incorporating text-conditioned image encoders and larger-scale language models.
|
124 |
|
|