|
--- |
|
license: cc-by-nc-4.0 |
|
language: |
|
- ja |
|
pipeline_tag: image-to-text |
|
tags: |
|
- vision |
|
- image-captioning |
|
- VQA |
|
--- |
|
|
|
# Chat-Vector-LLaVA-v1.5-7b-JA Model Card |
|
## Model detail |
|
**Model type:** |
|
Chat-Vector-LLaVA-v1.5-7b-JA is a vision-language model that can converse about input images in Japanese.<br> |
|
This model was created by adding and subtracting the weights of the [llava-v1.5-7b](https://huggingface.co/liuhaotian/llava-v1.5-7b), [Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf), and [ELYZA-japanese-Llama-2-7b](https://huggingface.co/elyza/ELYZA-japanese-Llama-2-7b) models using the Chat Vector method as follows. |
|
``` |
|
ELYZA-japanese-Llama-2-7b + (llava-v1.5-7b - Llama-2-7b-hf) |
|
``` |
|
|
|
Chat-Vector-LLaVA-v1.5-7b-JAは、入力画像について日本語で会話できるvision-language modelです。<br> |
|
このモデルはChat Vectorの手法で[llava-v1.5-7b](https://huggingface.co/liuhaotian/llava-v1.5-7b)と[Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf)と[ELYZA-japanese-Llama-2-7b](https://huggingface.co/elyza/ELYZA-japanese-Llama-2-7b)のモデルの重みを以下の通り加減算することで作成しました。 |
|
``` |
|
ELYZA-japanese-Llama-2-7b + (llava-v1.5-7b - Llama-2-7b-hf) |
|
``` |
|
|
|
**Comparing VLMs** |
|
|Model|JA-VG-VQA-500<br>(ROUGE-L)|JA-VLM-Bench-In-the-Wild<br>(ROUGE-L)|Heron-Bench(Detail)|Heron-Bench(Conv)|Heron-Bench(Complex)|Heron-Bench(Average) |
|
|-|-|-|-|-|-|-| |
|
|[Japanese Stable VLM](https://huggingface.co/stabilityai/japanese-stable-vlm)|-|40.50|25.15|51.23|37.84|38.07| |
|
|[EvoVLM-JP-v1-7B](https://huggingface.co/SakanaAI/EvoVLM-JP-v1-7B)|**19.70**|**51.25**|50.31|44.42|40.47|45.07| |
|
|[Heron BLIP Japanese StableLM Base 7B llava-620k](https://huggingface.co/turing-motors/heron-chat-blip-ja-stablelm-base-7b-v1-llava-620k)|14.51|33.26|49.09|41.51|45.72|45.44| |
|
|[Heron GIT Japanese StableLM Base 7B](https://huggingface.co/turing-motors/heron-chat-git-ja-stablelm-base-7b-v1)|15.18|37.82|42.77|**54.20**|43.53|46.83| |
|
|[llava-jp-1.3b-v1.0-620k](https://huggingface.co/toshi456/llava-jp-1.3b-v1.0-620k)|12.69|44.58|51.21|41.05|45.95|44.84| |
|
|[llava-jp-1.3b-v1.1](https://huggingface.co/toshi456/llava-jp-1.3b-v1.1)|13.33|44.40|50.00|51.83|**48.98**|**50.39**| |
|
|[chat-vector-llava-v1.5-7b-ja](https://huggingface.co/toshi456/chat-vector-llava-v1.5-7b-ja)|18.64|42.23|**53.61**|44.36|44.48|46.10| |
|
|
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/630af71ffaaea618ebc973db/jSW9RYPccrxaqrxntwtUb.png) |
|
|
|
## How to use the model |
|
> [!WARNING] |
|
> The code for the demo worked with 4.34.1 of transformers, but did not work properly with 4.37.2. We have not tested the code in between versions or in the latest version.<br><br> |
|
> デモ用のコードはtransformersの4.34.1では動作しましたが、4.37.2では正常に動作しませんでした。間のバージョンや最新のバージョンでは動作確認していません。 |
|
|
|
**1. Download dependencies** |
|
``` |
|
git clone https://github.com/tosiyuki/vlm-chat-vector-ja.git |
|
``` |
|
|
|
**2. Inference** |
|
```python |
|
import requests |
|
import torch |
|
import transformers |
|
from PIL import Image |
|
|
|
from transformers.generation.streamers import TextStreamer |
|
from llava.constants import DEFAULT_IMAGE_TOKEN, IMAGE_TOKEN_INDEX |
|
from llava.conversation import conv_templates, SeparatorStyle |
|
from llava.model.language_model.llava_llama import LlavaLlamaForCausalLM |
|
from llava.mm_utils import tokenizer_image_token, process_images |
|
|
|
|
|
if __name__ == "__main__": |
|
model_path = 'toshi456/chat-vector-llava-v1.5-7b-ja' |
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
torch_dtype = torch.bfloat16 if device=="cuda" else torch.float32 |
|
|
|
model = LlavaLlamaForCausalLM.from_pretrained( |
|
model_path, |
|
device_map=device, |
|
low_cpu_mem_usage=True, |
|
use_safetensors=True, |
|
torch_dtype=torch.float16, |
|
).eval() |
|
tokenizer = transformers.AutoTokenizer.from_pretrained( |
|
model_path, |
|
model_max_length=1024, |
|
padding_side="right", |
|
use_fast=False, |
|
) |
|
model.get_model().vision_tower.load_model() |
|
model = model.to(device) |
|
|
|
eos_token_id_list = [ |
|
tokenizer.eos_token_id, |
|
tokenizer.bos_token_id, |
|
] |
|
|
|
# image pre-process |
|
image_url = "https://huggingface.co/rinna/bilingual-gpt-neox-4b-minigpt4/resolve/main/sample.jpg" |
|
image = Image.open(requests.get(image_url, stream=True).raw).convert('RGB') |
|
|
|
if not isinstance(image, list): |
|
image = [image] |
|
|
|
image_tensor = process_images(image, model.get_model().vision_tower.image_processor, model.config) |
|
if type(image_tensor) is list: |
|
image_tensor = [image.to(model.device, dtype=torch.float16) for image in image_tensor] |
|
else: |
|
image_tensor = image_tensor.to(model.device, dtype=torch.float16) |
|
|
|
# create prompt |
|
# ユーザー: <image>\n{prompt} |
|
conv_mode = "llava_llama_2" |
|
conv = conv_templates[conv_mode].copy() |
|
prompt = "猫の隣には何がありますか?" |
|
inp = DEFAULT_IMAGE_TOKEN + '\n' + prompt |
|
conv.append_message(conv.roles[0], inp) |
|
conv.append_message(conv.roles[1], None) |
|
prompt = conv.get_prompt() |
|
|
|
input_ids = tokenizer_image_token( |
|
prompt, |
|
tokenizer, |
|
IMAGE_TOKEN_INDEX, |
|
return_tensors='pt' |
|
).unsqueeze(0) |
|
if device == "cuda": |
|
input_ids = input_ids.to(device) |
|
|
|
stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2 |
|
keywords = [stop_str] |
|
streamer = TextStreamer(tokenizer, skip_prompt=True, timeout=20.0) |
|
|
|
# parameter |
|
temperature = 0.0 |
|
top_p = 1.0 |
|
max_new_tokens=256 |
|
|
|
# predict |
|
with torch.inference_mode(): |
|
model.generate( |
|
inputs=input_ids, |
|
images=image_tensor, |
|
do_sample=True if temperature > 0 else False, |
|
temperature=temperature, |
|
top_p=top_p, |
|
max_new_tokens=max_new_tokens, |
|
streamer=streamer, |
|
use_cache=True, |
|
eos_token_id=eos_token_id_list, |
|
) |
|
|
|
"""猫の隣には、コンピューター(パソコン)があります。<s>""" |
|
|
|
``` |
|
|
|
## Acknowledgement |
|
- [LLaVA](https://llava-vl.github.io/) |
|
- [Chat Vector](https://arxiv.org/abs/2310.04799) |
|
|
|
## License |
|
cc-by-nc-4.0 |