Hugging Face
Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up
4
James Le
khanhnamle1994
Follow
hassan-twelvelabs's profile picture
1 follower
ยท
6 following
https://jameskle.com/
le_james94
khanhnamle1994
AI & ML interests
Multimodal AI, Video Understanding
Recent Activity
upvoted
a
paper
about 2 months ago
Apollo: An Exploration of Video Understanding in Large Multimodal Models
upvoted
a
collection
2 months ago
Nucleotide Transformer
reacted
to
merve
's
post
with ๐
5 months ago
NVIDIA just dropped NVEagle ๐ฆ Super impressive vision language model that comes in 7B, 13B and 13B fine-tuned on chat ๐ฌ Model repositories: https://huggingface.co/collections/merve/nveagle-66d0705108582d73bb235c26 Try it: https://huggingface.co/spaces/NVEagle/Eagle-X5-13B-Chat ๐ฌ (works very well! ๐คฏ) This model essentially explores having different experts (MoE) for image encoder part of vision language model. How? ๐ง The authors concatenate the vision encoder output tokens together, and they apply "pre-alignment" essentially fine-tune experts with frozen text encoder. Then they freeze both experts and the decoder and just train the projection layer, and finally, they unfreeze everything for supervised fine-tuning โจ In the paper, they explore different fusion strategies and vision encoders, extending basic CLIP encoder, and figure out simply concatenating visual tokens works well. Rest of the architecture is quite similar to LLaVA. (see below the architecture)
View all activity
Organizations
Papers
1
arxiv:
2404.14687
models
None public yet
datasets
None public yet