Model Architecture for Gemini
Has anyone seen any post , explaining or talking about model architecture of Gemini ?
They did not really mentioned much about the model architecture of Gemini in the Technical Report paper. Mostly information are known such as train on Transformers-based decoder with optimization for inferences on TPU
Yeah, there's not much info.
But, there are four input modalities, text (encoded with Sentencepiece), images (ViT), video (multiple image frames), and audio (Universal Speech Model features @16kHz). I suspect it uses PaLI-style encoding because out of the 3 papers it says its inspired by (Flamingo, CoCA, PaLI), PaLI is the most recent and simplest to implement.
There are two output modalities, text and image (paper cites DALL-E and Parti, probably uses Parti because it's auto-regressive)
It also uses multi-query attention and possibly other new efficient transformer techniques.
The paper discusses that they trained explicitly based on multimodality. So, what I feel like is they trained the vision, text, and audio decoders from scratch. Now as far as I can remember the CLIP paper they also did something similar like this one.
So, my guess is they used a contrastive learning for training the network following the loss function similar to that of CLIP.
Then I guess they used RLHF to further fine tune the model.
For the dataset they mentioned they used data such as web documents, images, etc. What I think is something like LAION-400M could be a good starting point for our open-source replication.
Let me know what you think guys.
If you read all of the latest papers on LMMs, e.g. Flamingo, PaLI, Qwen-VL, BLIP-2, Llava-1.5, CogVLM, you'll notice that the vision encoder is probably a pre-trained CLIP model.
I doubt that everything is trained from scratch, given even Google's previous papers on LMMs don't.
The vision encoder is probably a pre-trained SigLIP model, just like PaLI 3. (Hint: Google tends to use other Google models) Lots of other LMMs have used pre-trained Clip-ViT-H weights , so probably a model similar to that. They used USM "features", which doesn't make sense unless they used a pre-trained USM model.
The language model itself is probably trained from scratch, esp. since it has an image output head.
Do you have 8x[A|H]100 GPUs?
I see; they stated it's multimodal from the beginning. Now, I haven't read the FLAMINGO paper, so I won't be able to comment on that. Still, it seems from your conversation that they used two different pre-trained encoders in that paper and then let them learn the joint representation. But, in the Gemini, they explicitly mentioned that the model is "multimodal" by default. Therefore, I concluded that they might have trained the network from scratch.
Apparently, no, I don't have access to 8x[A|H]100 GPUs 😅.