Update README.md
Browse files
README.md
CHANGED
@@ -3,3 +3,99 @@ license: other
|
|
3 |
license_name: license-seed-x-17b
|
4 |
license_link: LICENSE
|
5 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
license_name: license-seed-x-17b
|
4 |
license_link: LICENSE
|
5 |
---
|
6 |
+
|
7 |
+
# SEED-X
|
8 |
+
[![arXiv](https://img.shields.io/badge/arXiv-2404.14396-b31b1b.svg)](https://arxiv.org/abs/2404.14396)
|
9 |
+
[![Demo](https://img.shields.io/badge/Gradio-Demo-orange)](https://139a5c1d085953f17b.gradio.live/)
|
10 |
+
[![Static Badge](https://img.shields.io/badge/Model-Huggingface-yellow)](https://huggingface.co/AILab-CVC/SEED-X-17B)
|
11 |
+
|
12 |
+
We introduce SEED-X, a unified and versatile foundation model, which can serve as various multimodal AI assistants **in the real world** after different instruction tuning, capable of responding to a variety of user needs through unifying **multi-granularity comprehension and generation**.
|
13 |
+
|
14 |
+
All models and inference code are released!
|
15 |
+
|
16 |
+
## News
|
17 |
+
**2024-04-22** :hugs: We release the [models](https://huggingface.co/AILab-CVC/SEED-X-17B) including the pre-trained foundation model **SEED-X**, the general instruction-tuned model **SEED-X-I**, the editing model **SEED-X-Edit**, and our de-tokenier, which can generate realistic images from ViT features (w/o or w/ a condition image).
|
18 |
+
|
19 |
+
**2024-04-22** :hugs: We release an online [gradio demo](https://139a5c1d085953f17b.gradio.live/) of a general instruction-tuned model SEED-X-I. SEED-X-I can follow multimodal instruction (including images with dynamic resolutions) and make responses with images, texts and bounding boxes in multi-turn conversation. SEED-X-I **does not support image manipulation**. If you want to experience SEED-X-Edit for high-precision image editing, the inference code and model will be released soon.
|
20 |
+
|
21 |
+
## TODOs
|
22 |
+
- [x] Release the multimodal foundation model SEED-X.
|
23 |
+
- [x] Release the instruction-tuned model SEED-X-Edit for high-precision image editing.
|
24 |
+
- [ ] Release 3.7M in-house image editing data.
|
25 |
+
|
26 |
+
![image](https://github.com/AILab-CVC/SEED-X/blob/main/demos/teaser.jpg?raw=true)
|
27 |
+
|
28 |
+
![image](https://github.com/AILab-CVC/SEED-X/blob/main/demos/case_example.jpg?raw=true)
|
29 |
+
|
30 |
+
|
31 |
+
## Usage
|
32 |
+
|
33 |
+
### Dependencies
|
34 |
+
- Python >= 3.8 (Recommend to use [Anaconda](https://www.anaconda.com/download/#linux))
|
35 |
+
- [PyTorch >=2.0.1](https://pytorch.org/)
|
36 |
+
- NVIDIA GPU + [CUDA](https://developer.nvidia.com/cuda-downloads)
|
37 |
+
|
38 |
+
### Installation
|
39 |
+
Clone the repo and install dependent packages
|
40 |
+
|
41 |
+
```bash
|
42 |
+
git clone https://github.com/AILab-CVC/SEED-X.git
|
43 |
+
cd SEED-X
|
44 |
+
pip install -r requirements.txt
|
45 |
+
```
|
46 |
+
|
47 |
+
### Model Weights
|
48 |
+
We release the pretrained De-Tokenizer, the pre-trained foundation model **SEED-X**, the general instruction-tuned model **SEED-X-I**, the editing model **SEED-X-Edit** in in [SEED-X-17B Hugging Face](https://huggingface.co/AILab-CVC/SEED-X-17B).
|
49 |
+
|
50 |
+
Please download the checkpoints and save them under the folder `./pretrained`. For example, `./pretrained/seed_x`.
|
51 |
+
|
52 |
+
You also need to download [stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) and [Qwen-VL-Chat](https://huggingface.co/Qwen/Qwen-VL-Chat), and save them under the folder `./pretrained`. Please use the following script to extract the weights of visual encoder in Qwen-VL-Chat.
|
53 |
+
```bash
|
54 |
+
python3 src/tools/reload_qwen_vit.py
|
55 |
+
```
|
56 |
+
### Inference with SEED-X De-tokenizer
|
57 |
+
```bash
|
58 |
+
# For image reconstruction with ViT image features
|
59 |
+
python3 src/inference/eval_seed_x_detokenizer.py
|
60 |
+
# For image reconstruction with ViT image features and conditional image
|
61 |
+
python3 src/inference/eval_seed_x_detokenizer_with_condition.py
|
62 |
+
```
|
63 |
+
|
64 |
+
### Inference with pre-trained model SEED-X
|
65 |
+
```bash
|
66 |
+
# For image comprehension and detection
|
67 |
+
python3 src/inference/eval_img2text_seed_x.py
|
68 |
+
# For image generation
|
69 |
+
python3 src/inference/eval_text2img_seed_x.py
|
70 |
+
```
|
71 |
+
|
72 |
+
### Inference with the general instruction-tuned model SEED-X-I
|
73 |
+
```bash
|
74 |
+
# For image comprehension and detection
|
75 |
+
python3 src/inference/eval_img2text_seed_x_i.py
|
76 |
+
# For image generation
|
77 |
+
python3 src/inference/eval_text2img_seed_x_i.py
|
78 |
+
```
|
79 |
+
|
80 |
+
### Inference with the editing model SEED-X-Edit
|
81 |
+
```bash
|
82 |
+
# For image editing
|
83 |
+
python3 src/inference/eval_img2edit_seed_x_edit.py
|
84 |
+
```
|
85 |
+
|
86 |
+
## Citation
|
87 |
+
If you find the work helpful, please consider citing:
|
88 |
+
```bash
|
89 |
+
@article{ge2024seed,
|
90 |
+
title={SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation},
|
91 |
+
author={Ge, Yuying and Zhao, Sijie and Zhu, Jinguo and Ge, Yixiao and Yi, Kun and Song, Lin and Li, Chen and Ding, Xiaohan and Shan, Ying},
|
92 |
+
journal={arXiv preprint arXiv:2404.14396},
|
93 |
+
year={2024}
|
94 |
+
}
|
95 |
+
```
|
96 |
+
|
97 |
+
|
98 |
+
## License
|
99 |
+
`SEED` is licensed under the Apache License Version 2.0 except for the third-party components listed in [License](License_Seed-X.txt).
|
100 |
+
|
101 |
+
During training SEED-X, we freeze the original parameters of LLaMA2 and optimize the LoRA module.
|