ashok2216 commited on
Commit
a59c44b
·
verified ·
1 Parent(s): 4361822

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -8
README.md CHANGED
@@ -14,16 +14,16 @@ tags:
14
  - image-captioning
15
  ---
16
 
17
- vit-gpt2-image-captioning_COCO_FineTuned
18
  This repository contains the fine-tuned ViT-GPT2 model for image captioning, trained on the COCO dataset. The model combines a Vision Transformer (ViT) for image feature extraction and GPT-2 for text generation to create descriptive captions from images.
19
 
20
- Model Overview
21
  Model Type: Vision Transformer (ViT) + GPT-2
22
  Dataset: COCO (Common Objects in Context)
23
  Task: Image Captioning
24
  This model generates captions for input images based on the objects and contexts identified within the images. It has been fine-tuned on the COCO dataset, which includes a wide variety of images with detailed annotations, making it suitable for diverse image captioning tasks.
25
 
26
- Model Details
27
  The model architecture consists of two main components:
28
 
29
  Vision Transformer (ViT): A powerful image encoder that extracts feature maps from input images.
@@ -35,7 +35,8 @@ Generate grammatically correct and contextually accurate captions.
35
  Usage
36
  You can use this model for image captioning tasks with the Hugging Face transformers library. Below is a sample code to load the model and generate captions for input images.
37
 
38
- Installation
 
39
  To use this model, you need to install the following libraries:
40
 
41
  bash
@@ -49,15 +50,18 @@ import torch
49
  from PIL import Image
50
 
51
  # Load the fine-tuned model and tokenizer
 
52
  model = VisionEncoderDecoderModel.from_pretrained("ashok2216/vit-gpt2-image-captioning_COCO_FineTuned")
53
  processor = ViTImageProcessor.from_pretrained("ashok2216/vit-gpt2-image-captioning_COCO_FineTuned")
54
  tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
55
 
56
  # Preprocess the image
 
57
  image = Image.open("path_to_image.jpg")
58
  inputs = processor(images=image, return_tensors="pt")
59
 
60
  # Generate caption
 
61
  pixel_values = inputs.pixel_values
62
  output = model.generate(pixel_values)
63
  caption = tokenizer.decode(output[0], skip_special_tokens=True)
@@ -69,12 +73,12 @@ Output: A text string representing the generated caption for the image.
69
  Example
70
  For an input image, the model might generate a caption like:
71
 
72
- Input Image:
73
 
74
  Generated Caption:
75
  "A group of people walking down the street with umbrellas in their hands."
76
 
77
- Fine-Tuning Details
78
  Dataset: COCO dataset (common objects in context)
79
  Image Size: 224x224 pixels
80
  Training Time: ~12 hours on a GPU (depending on batch size and hardware)
@@ -82,13 +86,13 @@ Fine-Tuning Strategy: We fine-tuned the ViT-GPT2 model for 5 epochs using the CO
82
  Model Performance
83
  This model performs well on various image captioning benchmarks. However, its performance is highly dependent on the diversity and quality of the input image. It is recommended to fine-tune or retrain the model further for more specific domains if necessary.
84
 
85
- Limitations
86
  The model might struggle with generating accurate captions for highly ambiguous or abstract images.
87
  It is trained primarily on the COCO dataset and might perform better on images with similar contexts to the training data.
88
  License
89
  This model is licensed under the MIT License.
90
 
91
- Acknowledgments
92
  COCO Dataset: The model was trained on the COCO dataset, which is widely used for image captioning tasks.
93
  Hugging Face: For providing the platform to share models and facilitate easy usage of transformer-based models.
94
  Contact
 
14
  - image-captioning
15
  ---
16
 
17
+ # vit-gpt2-image-captioning_COCO_FineTuned
18
  This repository contains the fine-tuned ViT-GPT2 model for image captioning, trained on the COCO dataset. The model combines a Vision Transformer (ViT) for image feature extraction and GPT-2 for text generation to create descriptive captions from images.
19
 
20
+ # Model Overview
21
  Model Type: Vision Transformer (ViT) + GPT-2
22
  Dataset: COCO (Common Objects in Context)
23
  Task: Image Captioning
24
  This model generates captions for input images based on the objects and contexts identified within the images. It has been fine-tuned on the COCO dataset, which includes a wide variety of images with detailed annotations, making it suitable for diverse image captioning tasks.
25
 
26
+ # Model Details
27
  The model architecture consists of two main components:
28
 
29
  Vision Transformer (ViT): A powerful image encoder that extracts feature maps from input images.
 
35
  Usage
36
  You can use this model for image captioning tasks with the Hugging Face transformers library. Below is a sample code to load the model and generate captions for input images.
37
 
38
+ # Installation
39
+
40
  To use this model, you need to install the following libraries:
41
 
42
  bash
 
50
  from PIL import Image
51
 
52
  # Load the fine-tuned model and tokenizer
53
+
54
  model = VisionEncoderDecoderModel.from_pretrained("ashok2216/vit-gpt2-image-captioning_COCO_FineTuned")
55
  processor = ViTImageProcessor.from_pretrained("ashok2216/vit-gpt2-image-captioning_COCO_FineTuned")
56
  tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
57
 
58
  # Preprocess the image
59
+
60
  image = Image.open("path_to_image.jpg")
61
  inputs = processor(images=image, return_tensors="pt")
62
 
63
  # Generate caption
64
+
65
  pixel_values = inputs.pixel_values
66
  output = model.generate(pixel_values)
67
  caption = tokenizer.decode(output[0], skip_special_tokens=True)
 
73
  Example
74
  For an input image, the model might generate a caption like:
75
 
76
+ # Input Image:
77
 
78
  Generated Caption:
79
  "A group of people walking down the street with umbrellas in their hands."
80
 
81
+ # Fine-Tuning Details
82
  Dataset: COCO dataset (common objects in context)
83
  Image Size: 224x224 pixels
84
  Training Time: ~12 hours on a GPU (depending on batch size and hardware)
 
86
  Model Performance
87
  This model performs well on various image captioning benchmarks. However, its performance is highly dependent on the diversity and quality of the input image. It is recommended to fine-tune or retrain the model further for more specific domains if necessary.
88
 
89
+ # Limitations
90
  The model might struggle with generating accurate captions for highly ambiguous or abstract images.
91
  It is trained primarily on the COCO dataset and might perform better on images with similar contexts to the training data.
92
  License
93
  This model is licensed under the MIT License.
94
 
95
+ # Acknowledgments
96
  COCO Dataset: The model was trained on the COCO dataset, which is widely used for image captioning tasks.
97
  Hugging Face: For providing the platform to share models and facilitate easy usage of transformer-based models.
98
  Contact