Feature Extraction

Improve model card

#1
by nielsr HF staff - opened
Files changed (1) hide show
  1. README.md +134 -3
README.md CHANGED
@@ -1,3 +1,134 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: feature-extraction
4
+ ---
5
+
6
+ # UniTok: A Unified Tokenizer for Visual Generation and Understanding
7
+
8
+ This repository contains UniTok, a unified visual tokenizer for both image generation and understanding tasks, as presented in [UniTok: A Unified Tokenizer for Visual Generation and Understanding](https://hf.co/papers/2502.20321).
9
+
10
+ Project Page: https://foundationvision.github.io/UniTok/
11
+
12
+ Code: https://github.com/FoundationVision/UniTok
13
+
14
+ ![teaser](assets/teaser.png)
15
+
16
+ UniTok encodes fine-grained details for generation and captures high-level semantics for understanding. It's compatible with autoregressive generative models (e.g., LlamaGen), multimodal understanding models (e.g., LLaVA), and unified MLLMs (e.g., Chameleon and Liquid).
17
+
18
+
19
+ Built upon UniTok, we construct an MLLM capable of both multimodal generation and understanding, which sets a new state-of-the-art among unified autoregressive MLLMs. The weights of our MLLM will be released soon.
20
+
21
+ ![teaser](assets/samples.png)
22
+
23
+ ## Performance
24
+
25
+ <table>
26
+ <thead>
27
+ <tr>
28
+ <th>Method</th>
29
+ <th>#Tokens</th>
30
+ <th>rFID &darr;</th>
31
+ <th>Accuracy</th>
32
+ </tr>
33
+ </thead>
34
+ <tbody>
35
+ <tr>
36
+ <td colspan="4"><i>VQVAE Model</i></td>
37
+ </tr>
38
+ <tr align="center">
39
+ <td>VQ-GAN</td>
40
+ <td>256</td>
41
+ <td>4.98</td>
42
+ <td>--</td>
43
+ </tr>
44
+ <tr align="center">
45
+ <td>RQ-VAE</td>
46
+ <td>256</td>
47
+ <td>1.30</td>
48
+ <td>--</td>
49
+ </tr>
50
+ <tr align="center">
51
+ <td>VAR</td>
52
+ <td>680</td>
53
+ <td>0.90</td>
54
+ <td>--</td>
55
+ </tr>
56
+ <tr>
57
+ <td colspan="4"><i>CLIP Model</i></td>
58
+ </tr>
59
+ <tr align="center">
60
+ <td>CLIP</td>
61
+ <td>256</td>
62
+ <td>--</td>
63
+ <td>76.2</td>
64
+ </tr>
65
+ <tr align="center">
66
+ <td>SigLIP</td>
67
+ <td>256</td>
68
+ <td>--</td>
69
+ <td>80.5</td>
70
+ </tr>
71
+ <tr align="center">
72
+ <td>ViTamin</td>
73
+ <td>256</td>
74
+ <td>--</td>
75
+ <td>81.2</td>
76
+ </tr>
77
+ <tr>
78
+ <td colspan="4"><i>Unified Model</i></td>
79
+ </tr>
80
+ <tr align="center">
81
+ <td>TokenFlow &dagger;</td>
82
+ <td>680</td>
83
+ <td>1.37</td>
84
+ <td>--</td>
85
+ </tr>
86
+ <tr align="center">
87
+ <td>VILA-U &dagger;</td>
88
+ <td>256</td>
89
+ <td>1.80</td>
90
+ <td>73.3</td>
91
+ </tr>
92
+ <tr align="center">
93
+ <td>UniTok</td>
94
+ <td>256</td>
95
+ <td>0.39</td>
96
+ <td>70.5</td>
97
+ </tr>
98
+ <tr align="center">
99
+ <td>UniTok &dagger;</td>
100
+ <td>256</td>
101
+ <td>0.38</td>
102
+ <td>78.6</td>
103
+ </tr>
104
+ </tbody>
105
+ </table>
106
+
107
+
108
+ &dagger; indicates the model uses pretrained CLIP weights for initialization. Although CLIP weight initialization boosts ImageNet zero-shot accuracy,
109
+ we notice that random initialization leads to better downstream understanding performance.
110
+ We thus release the model checkpoint of UniTok that is trained from scratch.
111
+
112
+
113
+
114
+ ## Model Weights
115
+
116
+ | Model | Res. | #Token | Code Shape | rFID | Checkpoint |
117
+ |:------------:|:----:|:------:|:-------------------------:|:----:|:------------:|
118
+ | UniTok-Large | 256 | 256 | 16 $\times$ 16 $\times$ 8 | 0.39 | [Download](https://huggingface.co/FoundationVision/UniTok/blob/main/unitok_tokenizer.pth) |
119
+
120
+
121
+ ## Usage
122
+
123
+ (... rest of README content ...)
124
+
125
+ ## Citation
126
+
127
+ ```bibtex
128
+ @article{unitok,
129
+ title={UniTok: A Unified Tokenizer for Visual Generation and Understanding},
130
+ author={Ma, Chuofan and Jiang, Yi and Wu, Junfeng and Yang, Jihan and Yu, Xin and Yuan, Zehuan and Peng, Bingyue and Qi, Xiaojuan},
131
+ journal={arXiv preprint arXiv:2502.20321},
132
+ year={2025}
133
+ }
134
+ ```