BAAI
/

Emu3-Stage1

@@ -36,6 +36,10 @@ We introduce **Emu3**, a new suite of state-of-the-art multimodal models trained
 - **Emu3** simply generates a video causally by predicting the next token in a video sequence, unlike the video diffusion model as in Sora. With a video in context, Emu3 can also naturally extend the video and predict what will happen next.
 #### Quickstart

 - **Emu3** simply generates a video causally by predicting the next token in a video sequence, unlike the video diffusion model as in Sora. With a video in context, Emu3 can also naturally extend the video and predict what will happen next.
+### Model Information
+The **Emu3-Stage1** model is the pre-trained weights of the first stage of the pre-training process of Emu3. The pre-training process of Emu3 is conducted in two stages. In the first stage, **which does not utilize video data**, training begins from scratch with a context length of 5120 for text and image data. The model supports image captioning and can generate images at a resolution of 512x512. You can use our [training scripts](https://github.com/baaivision/Emu3/tree/main/scripts) for further instruction tuning for **image generation and perception tasks**.
 #### Quickstart