功能介绍

该模型功能主要是对图片生成文字描述。模型结构使用Encoder-Decoder结构,其中Encoder端使用BEiT模型,Decoder使用GPT模型。

使用中文Muge数据集训练语料,训练5k步,最终验证集loss为0.3737,rouge1为20.419,rouge2为7.3553,rougeL为17.3753,rougeLsum为17.376。

Github项目地址

如何使用

from transformers import VisionEncoderDecoderModel, ViTFeatureExtractor, AutoTokenizer
from PIL import Image

pretrained = "Maciel/Muge-Image-Caption"
model = VisionEncoderDecoderModel.from_pretrained(pretrained)
feature_extractor = ViTFeatureExtractor.from_pretrained(pretrained)
tokenizer = AutoTokenizer.from_pretrained(pretrained)

image_path = "https://huggingface.co/Maciel/Muge-Image-Caption/blob/main/%E9%AB%98%E8%B7%9F%E9%9E%8B.jpg"
image = Image.open(image_path)
if image.mode != "RGB":
        image = image.convert("RGB")
pixel_values = feature_extractor(images=image, return_tensors="pt").pixel_values

output_ids = model.generate(pixel_values, **gen_kwargs)
preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
preds = [pred.strip() for pred in preds]
print(preds)
Downloads last month
23
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Space using Maciel/Muge-Image-Caption 1