File size: 5,811 Bytes
c115a01
 
 
 
 
 
 
 
 
5ebe62e
5a0a166
5ebe62e
2dd284e
5ebe62e
04e0de9
5ebe62e
2dd284e
5ebe62e
 
3d5345a
36d2fa9
5ebe62e
 
 
48e6322
5ebe62e
 
ab46723
5ebe62e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2dd284e
5ebe62e
 
 
2dd284e
5ebe62e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c2024bb
 
 
 
 
 
 
5ebe62e
 
 
2dd284e
5ebe62e
 
 
 
 
 
2dd284e
5ebe62e
 
 
 
 
 
 
 
 
 
 
53fe1c1
 
5a0a166
 
 
 
 
 
 
 
53fe1c1
c115a01
53fe1c1
5a0a166
0594587
 
ccc54ba
 
0594587
5a0a166
 
57ad436
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
---
inference: false
language:
- th
- en
library_name: transformers
license: llama3
pipeline_tag: text-generation
---

# **Typhoon-Vision Preview**

**llama-3-typhoon-v1.5-8b-vision-preview** is a 🇹🇭 Thai *vision-language* model. It supports both text and image input modalities natively while the output is text. This version (August 2024) is our first vision-language model as a part of our multimodal effort, and it is a research *preview* version. The base language model is our [llama-3-typhoon-v1.5-8b-instruct](https://huggingface.co/scb10x/llama-3-typhoon-v1.5-8b-instruct). 

More details can be found in our [release blog](https://medium.com/opentyphoon/typhoon-vision-preview-release-0bdef028ca55) and technical report (coming soon). *To acknowledge Meta's effort in creating the foundation model and to comply with the license, we explicitly include "llama-3" in the model name.*

# **Model Description**
Here we provide **Llama3 Typhoon Instruct Vision Preview** which is built upon [Llama-3-Typhoon-1.5-8B-instruct](https://huggingface.co/scb10x/llama-3-typhoon-v1.5-8b-instruct) and [SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384).

We base off our training recipe from [Bunny by BAAI](https://github.com/BAAI-DCAI/Bunny).

- **Model type**: A 8B instruct decoder-only model with vision encoder based on Llama architecture.
- **Requirement**: transformers 4.38.0 or newer.
- **Primary Language(s)**: Thai 🇹🇭 and English 🇬🇧
- **Demo:** [https://vision.opentyphoon.ai/](https://vision.opentyphoon.ai/)
- **License**: [Llama 3 Community License](https://llama.meta.com/llama3/license/)

# **Quickstart**

Here we show a code snippet to show you how to use the model with transformers.

Before running the snippet, you need to install the following dependencies:

```shell
pip install torch transformers accelerate pillow
```

```python
import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
import warnings
import io
import requests

# disable some warnings
transformers.logging.set_verbosity_error()
transformers.logging.disable_progress_bar()
warnings.filterwarnings('ignore')

# Set Device
device = 'cuda'  # or cpu
torch.set_default_device(device)

# Create Model
model = AutoModelForCausalLM.from_pretrained(
    'scb10x/llama-3-typhoon-v1.5-8b-instruct-vision-preview',
    torch_dtype=torch.float16, # float32 for cpu
    device_map='auto',
    trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(
    'scb10x/llama-3-typhoon-v1.5-8b-instruct-vision-preview',
    trust_remote_code=True)

def prepare_inputs(text, has_image=False, device='cuda'):
    messages = [
        {"role": "system", "content": "You are a helpful vision-capable assistant who eagerly converses with the user in their language."},
    ]
    
    if has_image:
        messages.append({"role": "user", "content": "<|image|>\n" + text})
    else:
        messages.append({"role": "user", "content": text})
    
    inputs_formatted = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=False
    )

    if has_image:
        text_chunks = [tokenizer(chunk).input_ids for chunk in inputs_formatted.split('<|image|>')]
        input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1][1:], dtype=torch.long).unsqueeze(0).to(device)
        attention_mask = torch.ones_like(input_ids).to(device)
    else:
        input_ids = torch.tensor(tokenizer(inputs_formatted).input_ids, dtype=torch.long).unsqueeze(0).to(device)
        attention_mask = torch.ones_like(input_ids).to(device)

    return input_ids, attention_mask

# Example Inputs (try replacing with your own url)
prompt = 'บอกทุกอย่างที่เห็นในรูป'
img_url = "https://img.traveltriangle.com/blog/wp-content/uploads/2020/01/cover-for-Thailand-In-May_27th-Jan.jpg"
image = Image.open(io.BytesIO(requests.get(img_url).content))
image_tensor = model.process_images([image], model.config).to(dtype=model.dtype, device=device)
input_ids, attention_mask = prepare_inputs(prompt, has_image=True, device=device)

# Generate
output_ids = model.generate(
    input_ids,
    images=image_tensor,
    max_new_tokens=1000,
    use_cache=True,
    temperature=0.2,
    top_p=0.2,
    repetition_penalty=1.0 # increase this to avoid chattering,
)[0]

print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())
```

# Evaluation Results
| Model | MMBench (Dev) | Pope | GQA | GQA (Thai) |
|:--|:--|:--|:--|:--|
| Typhoon-Vision 8B Preview | 70.9 | 84.8 | 62.0 | 43.6 |
| SeaLMMM 7B v0.1 | 64.8 | 86.3 | 61.4 | 25.3 |
| Bunny Llama3 8B Vision | 76.0 | 86.9 | 64.8 | 24.0 |
| GPT-4o Mini | 69.8 | 45.4 | 42.6 | 18.1 |

# Intended Uses & Limitations
This model is experimental and may not always follow human instructions accurately, making it prone to generating hallucinations. Additionally, the model lacks moderation mechanisms and may produce harmful or inappropriate responses. Developers should carefully assess potential risks based on their specific applications.

# Follow Us & Support
- https://twitter.com/opentyphoon
- https://discord.gg/CqyBscMFpg

# Acknowledgements
We would like to thank the Bunny team for open-sourcing their code and data, and thanks to the Google Team for releasing the fine-tuned SigLIP which we adopt for our vision encoder. Thanks to many other open-source projects for their useful knowledge sharing, data, code, and model weights.

## Typhoon Team
*Parinthapat Pengpun*, Potsawee Manakul, Sittipong Sripaisarnmongkol, Natapong Nitarach, Warit Sirichotedumrong, Adisai Na-Thalang, Phatrasek Jirabovonvisut, 
Krisanapong Jirayoot, Pathomporn Chokchainant, Kasima Tharnpipitchai, *Kunat Pipatanakul*