Vision Encoder Decoder (ViT + GPT2) model that fine-tuned on flickr8k-dataset for image-to-text task.

Example:

from transformers import VisionEncoderDecoderModel, ViTImageProcessor, AutoTokenizer
import torch
from PIL import Image

# load models
feature_extractor = ViTImageProcessor.from_pretrained("atasoglu/vit-gpt2-flickr8k")
tokenizer = AutoTokenizer.from_pretrained("atasoglu/vit-gpt2-flickr8k")
model = VisionEncoderDecoderModel.from_pretrained("atasoglu/vit-gpt2-flickr8k")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# load image
img = Image.open("example.jpg")

# encode (extracting features)
pixel_values = feature_extractor(images=[img], return_tensors="pt").pixel_values
pixel_values = pixel_values.to(device)

# generate caption
output_ids = model.generate(pixel_values)

# decode
preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)

print(preds)

For more, see this awesome blog.

Downloads last month
28
Safetensors
Model size
264M params
Tensor type
F32
·
BOOL
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train atasoglu/vit-gpt2-flickr8k