Ahmed Masry PRO

ahmed-masry

AI & ML interests

Multimodal Chart Understanding, Multimodal Document AI, Multimodal Vision - Language Models,

Recent Activity

posted an update 13 minutes ago
Happy to announce AlignVLM πŸ“ – a novel approach to bridging vision and language latent spaces for multimodal understanding in Vision-Language Models (VLMs) πŸŒπŸ“„πŸ–Ό πŸ”— Read the paper: https://huggingface.co/papers/2502.01341 🧐 What’s the challenge? Aligning visual features with language embeddings remains a major bottleneck in VLMs. Existing connectors such as Multi-layer perceptron (MLPs) often introduce noise that degrades performance. ❌ 🎯 Our Solution: ALIGN Connector We propose AlignVLM, a method that maps vision features into a weighted average of LLM text embeddings, ensuring they remain in a space that the LLM can effectively interpret. βœ… πŸ”¬ How does it perform? We compared ALIGN against common connectors like MLPs, Perceiver Resampler, and Ovis trained under similar configurations. The results? ALIGN outperforms them all πŸ† on diverse document understanding tasks πŸ“„. πŸ“Š Meet the AlignVLM Model Family! We trained Llama 3.1 (1B, 3B, 8B) using our connector and benchmarked them against various models. The results: βœ… AlignVLM surpasses all Base VLMs trained under similar configurations. βœ… Our models also perform competitively against Instruct VLMs such as Qwen2-VL and InternVL-2.5 πŸš€. πŸ€” What about robustness to noise? We injected Gaussian noise (ΞΌ=0, Οƒ=3) into the vision encoder’s outputs before feeding them to the connector: βœ… ALIGN Connector: Minimal drop (↓1.67%) – proving its high robustness! ❌ MLP Connector: Severe degradation (↓25.54%) – struggling with noisy inputs. Code & model weights coming soon! Stay tuned! πŸ”₯
View all activity

Articles

Organizations

Visualizations + NLP's profile picture

Posts 4

view post
Post
Happy to announce AlignVLM πŸ“ – a novel approach to bridging vision and language latent spaces for multimodal understanding in Vision-Language Models (VLMs) πŸŒπŸ“„πŸ–Ό

πŸ”— Read the paper: AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding (2502.01341)

🧐 What’s the challenge?
Aligning visual features with language embeddings remains a major bottleneck in VLMs. Existing connectors such as Multi-layer perceptron (MLPs) often introduce noise that degrades performance. ❌

🎯 Our Solution: ALIGN Connector
We propose AlignVLM, a method that maps vision features into a weighted average of LLM text embeddings, ensuring they remain in a space that the LLM can effectively interpret. βœ…

πŸ”¬ How does it perform?
We compared ALIGN against common connectors like MLPs, Perceiver Resampler, and Ovis trained under similar configurations. The results? ALIGN outperforms them all πŸ† on diverse document understanding tasks πŸ“„.

πŸ“Š Meet the AlignVLM Model Family!
We trained Llama 3.1 (1B, 3B, 8B) using our connector and benchmarked them against various models. The results:
βœ… AlignVLM surpasses all Base VLMs trained under similar configurations. βœ… Our models also perform competitively against Instruct VLMs such as Qwen2-VL and InternVL-2.5 πŸš€.

πŸ€” What about robustness to noise?
We injected Gaussian noise (ΞΌ=0, Οƒ=3) into the vision encoder’s outputs before feeding them to the connector:
βœ… ALIGN Connector: Minimal drop (↓1.67%) – proving its high robustness!
❌ MLP Connector: Severe degradation (↓25.54%) – struggling with noisy inputs.

Code & model weights coming soon! Stay tuned! πŸ”₯
view post
Post
1478
πŸš€ Introducing ColFlor: An Efficient, OCR-Free Vision-Language Document Retrieval Model 🌟

Earlier this year, ColPali revolutionized document retrieval by eliminating the need for error-prone OCR pipelines. Instead, it directly processes the document images. However, with its 3 billion parameters, ColPali is computationally heavy for large-scale applications.

That’s where ColFlor comes inβ€”a smaller, faster alternative! πŸŽ‰ At 17x smaller than ColPali, ColFlor offers a more efficient, OCR-free document retrieval solution, making it ideal for users with limited computing resources (GPU Poor). πŸ’‘

Key Highlights:
🧠 174M parameters (vs. 3B for ColPali)
⚑ 9.8x faster query encoding, 5.25x faster image encoding
πŸ“‰ Only 1.8% performance drop on text-rich English documents

Check out the full blog post for more insights on modeling, training, and evaluations across various document retrieval tasks! πŸš€
Also, feel free to try our demo on huggingface πŸ€—

πŸ”— Resources:
πŸ“„ Blog post: https://huggingface.co/blog/ahmed-masry/colflor
🧠 Model: ahmed-masry/ColFlor
πŸ’» Demo: ahmed-masry/ColFlor-Demo
πŸ‹οΈβ€β™‚οΈ Training code: https://github.com/AhmedMasryKU/colflor
πŸ“Š Evaluation code: https://github.com/AhmedMasryKU/vidore-benchmark-colflor