jasonfang3900 commited on
Commit
a1f9f57
1 Parent(s): e1048ad

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +68 -3
README.md CHANGED
@@ -1,3 +1,68 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ # O2-MAGVIT2
5
+ <div align="center">
6
+ <figure>
7
+ <img src="attachments/reconstruction.gif">
8
+
9
+ <span style="font-style:italic">Video reconstruction with O2-MAGVIT2-preview (under 720p)</span>
10
+ <figure>
11
+ </div>
12
+
13
+ ## Introduction
14
+ We present an open-source pytorch implementation of Google's MAGVIT-v2 visual tokenizer named O2-MAGVIT2, which stands for handling dual modality (Image and Video) tokenization with a single Tokenizer. O2-MAGVIT2 is aligned with MAGVIT-v2 to a large extent. It uses Lookup-free quantizer(LFQ) with codebook size of $2^{18}$ and the exact same architecture of the encoder, decoder and discriminator described in the original paper. To facilitate training, we use huggingface's accelerate to wrap the trainer.
15
+ We also release a preview version of the video tokenizer trained on a Panda-70M subset to validate its performance.
16
+
17
+ ## Architecture
18
+ We re-implemented the MAGVIT-v2's architecture exactly. Below is from the [magvit-v2](https://arxiv.org/pdf/2310.05737)'s attachments.
19
+
20
+ <div align="center">
21
+ <img src="attachments/architecture.png" width="80%">
22
+ </div>
23
+
24
+
25
+ ## Quick start
26
+ - **Inference**: edit the arguments in `scripts/run_inference.sh` and run the following command to see the reconstruction result:
27
+ ```bash
28
+ bash scripts/run_inference.sh
29
+ ```
30
+ > run `python inference.py -h` for more details.
31
+
32
+ - **Training**: edit the config under the `configs/` then run the following command to train the model:
33
+ ```bash
34
+ NODE_RANK=0
35
+ MASTER_ADDR=localhost:25001
36
+ NUM_NODES=1
37
+ NUM_GPUS=8
38
+
39
+ bash scripts/run_train_3d.sh $NODE_RANK $MASTER_ADDR $NUM_NODES $NUM_GPUS
40
+ ```
41
+
42
+
43
+
44
+ ## Training Procedure
45
+ The whole training includes two stages. In stage I, we train an image tokenizer with OpenImage dataset (which contains 8M training samples) for 10 epochs with batch size 256. For stage II, we random sampled 9.3M samples from panda-70M and train the video tokenizer for 1 epoch with batch size 128.
46
+
47
+ ## Hyper parameters
48
+ We adopt almost the same hyper-parameter setting as MAGVIT-v2 with minimal change. See `configs/magvit2_3d_model_config.yaml` for model setup details and `configs/magvit2_3d_train_config.yaml` for training setup.
49
+
50
+ ## Pretrained Models
51
+ We release a pretrained checkpoint of the video tokenizer on huggingface as a preview. Note that due to much fewer training steps, the model is certainly under-trained and thus may not provide a good enough performance if you use it directly. We recommend considering it a step stone to continue training to get better results.
52
+
53
+ The checkpoint of O2-MAGVIT2-preview can be found [here](https://huggingface.co/CofeAI/O2-MAGVIT2-preview).
54
+
55
+ ## Acknowledgement
56
+ We refer some ideas and implementations from [MAGVIT](https://github.com/google-research/magvit), [vector-quantize-pytorch](https://github.com/lucidrains/vector-quantize-pytorch), [praxis](https://github.com/google/praxis), [LlamaGen](https://github.com/FoundationVision/LlamaGen), [pytorch-image-models](https://github.com/huggingface/pytorch-image-models), and [VQGAN](https://github.com/CompVis/taming-transformers). Thanks a lot for their excellent work.
57
+
58
+ ## Citation
59
+ If you found our work interesting, please cite the following references and give us a star.
60
+ ```
61
+ @misc{Fang_O2-MAGVIT2,
62
+ author = {Fang, Xuezhi and Yao, Yiqun and Jiang, Xin and Li, Xiang and Yu, Naitong and Wang, Yequan},
63
+ license = {Apache-2.0},
64
+ title = {O2-MAGVIT2},
65
+ year = {2024},
66
+ url = {https://github.com/cofe-ai/O2-MAGVIT2}
67
+ }
68
+ ```