scb10x
/

llama-3-typhoon-v1.5-8b-vision-preview

@@ -12,16 +12,15 @@ license: llama3
 # **Typhoon-Vision Research Preview**
-This is the research preview of Typhoon Vision.
-Typhoon Vision is family of Vision Language Models (VLM) specificially built for the 🇹🇭 Thai Language and Thai culture.
 Here we provide **Llama3 Typhoon Instruct Vision Preview** which is built upon [Llama-3-Typhoon-1.5-8B-instruct](https://huggingface.co/scb10x/llama-3-typhoon-v1.5-8b-instruct) and [SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384).
 We base off our architecture from [Bunny by BAAI](https://github.com/BAAI-DCAI/Bunny).
-# **Model Description**
 - **Model type**: A 8B instruct decoder-only model with vision encoder based on Llama architecture.
 - **Requirement**: transformers 4.38.0 or newer.
 - **Primary Language(s)**: Thai 🇹🇭 and English 🇬🇧
@@ -37,9 +36,6 @@ Before running the snippet, you need to install the following dependencies:
 pip install torch transformers accelerate pillow
 ```
-If the CUDA memory is enough, it would be faster to execute this snippet by setting `CUDA_VISIBLE_DEVICES=0`.
 ```python
 import torch
 import transformers
@@ -54,11 +50,11 @@ transformers.logging.set_verbosity_error()
 transformers.logging.disable_progress_bar()
 warnings.filterwarnings('ignore')
-# set device
 device = 'cuda'  # or cpu
 torch.set_default_device(device)
-# create model
 model = AutoModelForCausalLM.from_pretrained(
     'scb10x/llama-3-typhoon-v1.5-8b-instruct-vision-preview',
     torch_dtype=torch.float16, # float32 for cpu
@@ -94,13 +90,14 @@ def prepare_inputs(text, has_image=False, device='cuda'):
     return input_ids, attention_mask
 prompt = 'บอกทุกอย่างที่เห็นในรูป'
 img_url = "https://img.traveltriangle.com/blog/wp-content/uploads/2020/01/cover-for-Thailand-In-May_27th-Jan.jpg"
 image = Image.open(io.BytesIO(requests.get(img_url).content))
 image_tensor = model.process_images([image], model.config).to(dtype=model.dtype, device=device)
 input_ids, attention_mask = prepare_inputs(prompt, has_image=True, device=device)
-# generate
 output_ids = model.generate(
     input_ids,
     images=image_tensor,

 # **Typhoon-Vision Research Preview**
+**llama-3-typhoon-v1.5-8b-vision-preview** is a 🇹🇭 Thai *vision-language* model. It supports both text and image input modalities natively while the output is text. This version (August 2024) is our first vision-language model as a part of our multimodal effort, and it is a research *preview* version. The base language model is our [llama-3-typhoon-v1.5-8b-instruct](https://huggingface.co/scb10x/llama-3-typhoon-v1.5-8b-instruct).
+More details can be found in our [release blog](). *To acknowledge Meta's effort in creating the foundation model and to comply with the license, we explicitly include "llama-3" in the model name.
+# **Model Description**
 Here we provide **Llama3 Typhoon Instruct Vision Preview** which is built upon [Llama-3-Typhoon-1.5-8B-instruct](https://huggingface.co/scb10x/llama-3-typhoon-v1.5-8b-instruct) and [SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384).
 We base off our architecture from [Bunny by BAAI](https://github.com/BAAI-DCAI/Bunny).
 - **Model type**: A 8B instruct decoder-only model with vision encoder based on Llama architecture.
 - **Requirement**: transformers 4.38.0 or newer.
 - **Primary Language(s)**: Thai 🇹🇭 and English 🇬🇧
 pip install torch transformers accelerate pillow
 ```
 ```python
 import torch
 import transformers
 transformers.logging.disable_progress_bar()
 warnings.filterwarnings('ignore')
+# Set Device
 device = 'cuda'  # or cpu
 torch.set_default_device(device)
+# Create Model
 model = AutoModelForCausalLM.from_pretrained(
     'scb10x/llama-3-typhoon-v1.5-8b-instruct-vision-preview',
     torch_dtype=torch.float16, # float32 for cpu
     return input_ids, attention_mask
+# Example Inputs (try replacing with your own url)
 prompt = 'บอกทุกอย่างที่เห็นในรูป'
 img_url = "https://img.traveltriangle.com/blog/wp-content/uploads/2020/01/cover-for-Thailand-In-May_27th-Jan.jpg"
 image = Image.open(io.BytesIO(requests.get(img_url).content))
 image_tensor = model.process_images([image], model.config).to(dtype=model.dtype, device=device)
 input_ids, attention_mask = prepare_inputs(prompt, has_image=True, device=device)
+# Generate
 output_ids = model.generate(
     input_ids,
     images=image_tensor,