Update README.md

2dd284e verified 7 months ago

4.58 kB

	---
	inference: false
	language:
	- th
	- en
	library_name: transformers
	tags:
	- instruct
	- chat
	license: llama3
	---

	# Typhoon-Vision Research Preview

	llama-3-typhoon-v1.5-8b-vision-preview is a 🇹🇭 Thai vision-language model. It supports both text and image input modalities natively while the output is text. This version (August 2024) is our first vision-language model as a part of our multimodal effort, and it is a research preview version. The base language model is our [llama-3-typhoon-v1.5-8b-instruct](https://huggingface.co/scb10x/llama-3-typhoon-v1.5-8b-instruct).

	More details can be found in our [release blog](). *To acknowledge Meta's effort in creating the foundation model and to comply with the license, we explicitly include "llama-3" in the model name.

	# Model Description
	Here we provide Llama3 Typhoon Instruct Vision Preview which is built upon [Llama-3-Typhoon-1.5-8B-instruct](https://huggingface.co/scb10x/llama-3-typhoon-v1.5-8b-instruct) and [SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384).

	We base off our architecture from [Bunny by BAAI](https://github.com/BAAI-DCAI/Bunny).

	- Model type: A 8B instruct decoder-only model with vision encoder based on Llama architecture.
	- Requirement: transformers 4.38.0 or newer.
	- Primary Language(s): Thai 🇹🇭 and English 🇬🇧
	- License: [Llama 3 Community License](https://llama.meta.com/llama3/license/)

	# Quickstart

	Here we show a code snippet to show you how to use the model with transformers.

	Before running the snippet, you need to install the following dependencies:

	```shell
	pip install torch transformers accelerate pillow
	```

	```python
	import torch
	import transformers
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from PIL import Image
	import warnings
	import io
	import requests

	# disable some warnings
	transformers.logging.set_verbosity_error()
	transformers.logging.disable_progress_bar()
	warnings.filterwarnings('ignore')

	# Set Device
	device = 'cuda' # or cpu
	torch.set_default_device(device)

	# Create Model
	model = AutoModelForCausalLM.from_pretrained(
	'scb10x/llama-3-typhoon-v1.5-8b-instruct-vision-preview',
	torch_dtype=torch.float16, # float32 for cpu
	device_map='auto',
	trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained(
	'scb10x/llama-3-typhoon-v1.5-8b-instruct-vision-preview',
	trust_remote_code=True)

	def prepare_inputs(text, has_image=False, device='cuda'):
	messages = [
	{"role": "system", "content": "You are a helpful vision-capable assistant who eagerly converses with the user in their language."},
	]

	if has_image:
	messages.append({"role": "user", "content": "<\|image\|>\n" + text})
	else:
	messages.append({"role": "user", "content": text})

	inputs_formatted = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=False
	)

	if has_image:
	text_chunks = [tokenizer(chunk).input_ids for chunk in inputs_formatted.split('<\|image\|>')]
	input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1][1:], dtype=torch.long).unsqueeze(0).to(device)
	attention_mask = torch.ones_like(input_ids).to(device)
	else:
	input_ids = torch.tensor(tokenizer(inputs_formatted).input_ids, dtype=torch.long).unsqueeze(0).to(device)
	attention_mask = torch.ones_like(input_ids).to(device)

	return input_ids, attention_mask

	# Example Inputs (try replacing with your own url)
	prompt = 'บอกทุกอย่างที่เห็นในรูป'
	img_url = "https://img.traveltriangle.com/blog/wp-content/uploads/2020/01/cover-for-Thailand-In-May_27th-Jan.jpg"
	image = Image.open(io.BytesIO(requests.get(img_url).content))
	image_tensor = model.process_images([image], model.config).to(dtype=model.dtype, device=device)
	input_ids, attention_mask = prepare_inputs(prompt, has_image=True, device=device)

	# Generate
	output_ids = model.generate(
	input_ids,
	images=image_tensor,
	max_new_tokens=1000,
	use_cache=True,
	temperature=0.2,
	top_p=0.2,
	repetition_penalty=1.0 # increase this to avoid chattering,
	)[0]

	print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())
	```

	# Intended Uses & Limitations
	This model is experimental and might not be fully evaluated for all use cases. Developers should assess risks in the context of their specific applications.

	# Follow us
	https://twitter.com/opentyphoon

	# Support
	https://discord.gg/CqyBscMFpg