File size: 6,696 Bytes
252d011
936cfec
 
 
 
252d011
936cfec
 
 
 
 
 
252d011
 
936cfec
252d011
936cfec
252d011
936cfec
252d011
936cfec
 
 
 
 
 
 
 
163fec3
936cfec
 
 
 
 
 
 
 
 
 
 
 
 
 
252d011
 
 
 
 
936cfec
 
 
 
 
 
 
 
 
252d011
936cfec
252d011
936cfec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fab76c5
936cfec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
---
language:
- en
- ko
license: cc-by-nc-4.0
library_name: transformers
tags:
- mergekit
- merge
base_model:
- mistral-community/pixtral-12b
pipeline_tag: image-text-to-text
---

# Pixtral-12b-korean-preview

Finetunned with korean, english data for improving korean performance.

# Model Card for Model ID

Merged model using [mergekit](https://github.com/arcee-ai/mergekit/tree/main/mergekit)

This model hasn't been fully tested, so your feedback will be invaluable in improving it.

## Merge Format

```yaml
models:
  - model: spow12/Pixtral-12b-korean-base(private)
    layer_range: [0, 40]
  - model: mistral-community/pixtral-12b
    layer_range: [0, 40]
merge_method: slerp
base_model: mistral-community/pixtral-12b
parameters:
  t:
    - filter: self_attn
      value: [0, 0.5, 0.3, 0.7, 1]
    - filter: mlp
      value: [1, 0.5, 0.7, 0.3, 0]
    - value: 0.5 # fallback for rest of tensors
dtype: bfloat16
```

## Model Details

### Model Description

- **Developed by:** spow12(yw_nam)
- **Shared by :** spow12(yw_nam)
- **Model type:** LLaVA
- **Language(s) (NLP):** Korean, English
- **Finetuned from model :** [mistral-community/pixtral-12b](https://huggingface.co/mistral-community/pixtral-12b)

## Usage

### Single image inference 

![image](https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSXVmCeFm5GRrciuGCM502uv9xXVSrS9zDJZ1umCfoMero2MLxT)

```python
from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image

model_id =  'spow12/Pixtral-12b-korean-preview'
model = AutoModelForVision2Seq.from_pretrained(
    model_id, 
    device_map='auto', 
    torch_dtype = torch.bfloat16, 
).eval()
model.tie_weights()
processor = AutoProcessor.from_pretrained(model_id)

system = "You are helpful assistant create by Yw nam"


chat = [
    {
        'content': system,
        'role': 'system'
    },
    {
        "role": "user", "content": [
        {"type": "image"},  
        {"type": "text", "content": "이 이미지에 λ‚˜μ™€μžˆλŠ” 풍경을 μ„€λͺ…ν•΄μ€˜"}, 
        ]
    }
]
url = "https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSXVmCeFm5GRrciuGCM502uv9xXVSrS9zDJZ1umCfoMero2MLxT"
image = Image.open(requests.get(url, stream=True).raw)

images = [[image]]
prompt = processor.apply_chat_template(chat, tokenize=False)

inputs = processor(text=prompt, images=images, return_tensors="pt").to(model.device)
generate_ids = model.generate(**inputs, max_new_tokens=500,do_sample=True,min_p=0.1, temperature=0.9)
output = processor.batch_decode(generate_ids, skip_special_tokens=True,clean_up_tokenization_spaces=False)
print(output[0])

#Output
"""이 μ΄λ―Έμ§€λŠ” λ°”μœ„ ν•΄μ•ˆμ— μœ„μΉ˜ν•œ μž‘μ€ 섬에 μœ„μΉ˜ν•œ κ³ μš”ν•œ ν•΄μ•ˆ 경치λ₯Ό λ³΄μ—¬μ€λ‹ˆλ‹€. 이 섬은 ν‘Έλ₯Έ 물둜 λ‘˜λŸ¬μ‹Έμ—¬ 있으며, κ·Έ μœ„μ—λŠ” 뢉은 지뢕이 μžˆλŠ” ν•˜μ–€ λ“±λŒ€κ°€ μ„œ μžˆμŠ΅λ‹ˆλ‹€. λ“±λŒ€λŠ” μ„¬μ˜ 쀑앙에 μœ„μΉ˜ν•΄ 있으며, λ°”μœ„ 절벽과 μ—°κ²°λœ λŒλ‹€λ¦¬κ°€ 이어져 μžˆμ–΄ μ ‘κ·Όν•  수 μžˆμŠ΅λ‹ˆλ‹€. λ“±λŒ€ μ£Όλ³€μ˜ λ°”μœ„ μ ˆλ²½μ€ νŒŒλ„κ°€ λΆ€λ”ͺ히며 μž₯면에 역동적인 μš”μ†Œλ₯Ό λ”ν•©λ‹ˆλ‹€. λ“±λŒ€ λ„ˆλ¨Έλ‘œλŠ” ν•˜λŠ˜μ΄ 맑고 ν‘Έλ₯΄λ©°, 전체적인 μž₯면은 평화둭고 κ³ μš”ν•œ λΆ„μœ„κΈ°λ₯Ό μžμ•„λƒ…λ‹ˆλ‹€."""
```

### Multi image inference


<p align="center">
  <img src="https://cloud.shopback.com/c_fit,h_750,w_750/store-service-tw/assets/20185/0476e480-b6c3-11ea-b541-2ba549204a69.png"  width="300" style="display:inline-block;"/>
  <img src="https://pbs.twimg.com/profile_images/1268196215587397634/sgD5ZWuO_400x400.png"  width="300" style="display:inline-block;"/>
</p>

```python
url_apple = "https://cloud.shopback.com/c_fit,h_750,w_750/store-service-tw/assets/20185/0476e480-b6c3-11ea-b541-2ba549204a69.png"
image_1 = Image.open(requests.get(url_apple, stream=True).raw)
url_microsoft = "https://pbs.twimg.com/profile_images/1268196215587397634/sgD5ZWuO_400x400.png"
image_2 = Image.open(requests.get(url_microsoft, stream=True).raw)
chat = [
    {
        'content': system,
        'role': 'system'
    },
    {
        "role": "user", "content": [
        {"type": "image"},  
        {"type": "image"},  
        {"type": "text", "content": "두 기업에 λŒ€ν•΄μ„œ μ•„λŠ”κ±Έ μ„€λͺ…ν•΄μ€˜."}, 
        ]
    }
]

images = [[image_1, image_2] ]
prompt = processor.apply_chat_template(chat, tokenize=False)
inputs = processor(text=prompt, images=images, return_tensors="pt").to(model.device)
generate_ids = model.generate(**inputs, max_new_tokens=1024, do_sample=True, temperature=0.7, min_p=0.1)
output = processor.batch_decode(generate_ids, skip_special_tokens=True,clean_up_tokenization_spaces=False)
print(output[0])


#Output
"""두 기업은 각각 Appleκ³Ό Microsoftμž…λ‹ˆλ‹€.

1. μ• ν”Œ:
μ• ν”Œμ€ 1976년에 μŠ€ν‹°λΈŒ 작슀, μŠ€ν‹°λΈŒ μ›Œμ¦ˆλ‹ˆμ•…, λ‘œλ„λ“œ μ›¨μΈμ—κ²Œ μ„€λ¦½λœ 미ꡭ의 닀ꡭ적 기술 κΈ°μ—…μž…λ‹ˆλ‹€. μ• ν”Œμ˜ μ£Όμš” μ œν’ˆμœΌλ‘œλŠ” iPhone, iPad, Mac, Apple Watchκ°€ μžˆμŠ΅λ‹ˆλ‹€. 이 νšŒμ‚¬λŠ” ν˜μ‹ μ μΈ λ””μžμΈ, μ‚¬μš©μž μΉœν™”μ μΈ μΈν„°νŽ˜μ΄μŠ€, κ³ ν’ˆμ§ˆμ˜ ν•˜λ“œμ›¨μ–΄λ‘œ 유λͺ…ν•©λ‹ˆλ‹€. μ• ν”Œμ€ λ˜ν•œ Apple Music, iCloud, App Store와 같은 λ‹€μ–‘ν•œ μ†Œν”„νŠΈμ›¨μ–΄ μ„œλΉ„μŠ€μ™€ ν”Œλž«νΌμ„ μ œκ³΅ν•©λ‹ˆλ‹€. μ• ν”Œμ€ ν˜μ‹ μ μΈ μ œν’ˆκ³Ό κ°•λ ₯ν•œ λΈŒλžœλ“œλ‘œ 잘 μ•Œλ €μ Έ 있으며, 2010λ…„λŒ€ 이후 μ„Έκ³„μ—μ„œ κ°€μž₯ κ°€μΉ˜ μžˆλŠ” κΈ°μ—… 쀑 ν•˜λ‚˜λ‘œ μžλ¦¬λ§€κΉ€ν–ˆμŠ΅λ‹ˆλ‹€.

2. λ§ˆμ΄ν¬λ‘œμ†Œν”„νŠΈ:
λ§ˆμ΄ν¬λ‘œμ†Œν”„νŠΈλŠ” 1975년에 빌 κ²Œμ΄μΈ μ™€ 폴 μ•Œλ Œμ— μ˜ν•΄ μ„€λ¦½λœ 미ꡭ의 닀ꡭ적 기술 κΈ°μ—…μž…λ‹ˆλ‹€. 이 νšŒμ‚¬λŠ” 운영 체제, μ†Œν”„νŠΈμ›¨μ–΄, 개인용 컴퓨터, μ „μžμ œν’ˆ κ°œλ°œμ— 쀑점을 λ‘‘λ‹ˆλ‹€. λ§ˆμ΄ν¬λ‘œμ†Œν”„νŠΈμ˜ μ£Όμš” μ œν’ˆμœΌλ‘œλŠ” Windows 운영 체제, Microsoft Office μ œν’ˆκ΅°, Xbox κ²Œμž„ μ½˜μ†”μ΄ μžˆμŠ΅λ‹ˆλ‹€. 이 νšŒμ‚¬λŠ” μ†Œν”„νŠΈμ›¨μ–΄ 개발, ν΄λΌμš°λ“œ μ»΄ν“¨νŒ…, 인곡지λŠ₯ 연ꡬ와 같은 λΆ„μ•Όμ—μ„œλ„ μ€‘μš”ν•œ 역할을 ν•˜κ³  μžˆμŠ΅λ‹ˆλ‹€. λ§ˆμ΄ν¬λ‘œμ†Œν”„νŠΈλŠ” ν˜μ‹ μ μΈ 기술과 κ°•λ ₯ν•œ λΉ„μ¦ˆλ‹ˆμŠ€ μ†”λ£¨μ…˜μœΌλ‘œ 잘 μ•Œλ €μ Έ 있으며, μ„Έκ³„μ—μ„œ κ°€μž₯ κ°€μΉ˜ μžˆλŠ” κΈ°μ—… 쀑 ν•˜λ‚˜λ‘œ μžλ¦¬λ§€κΉ€ν–ˆμŠ΅λ‹ˆλ‹€"""
```

## Limitation

Overall, the performance seems reasonable.

However, it declines when processing images with non enlgish image. 

This is likely because the model was trained primarily on English text and landscapes. 

Adding Korean data in the future is expected to enhance performance.

## Citation

```bibtex
@misc {spow12/Pixtral-12b-korean-preview,
    author       = { YoungWoo Nam },
    title        = { spow12/Pixtral-12b-korean-preview },
    year         = 2024,
    url          = { https://huggingface.co/spow12/Pixtral-12b-korean-preview },
    publisher    = { Hugging Face }
}
```