Happy to announce AlignVLM π β a novel approach to bridging vision and language latent spaces for multimodal understanding in Vision-Language Models (VLMs) πππΌ
π§ Whatβs the challenge? Aligning visual features with language embeddings remains a major bottleneck in VLMs. Existing connectors such as Multi-layer perceptron (MLPs) often introduce noise that degrades performance. β
π― Our Solution: ALIGN Connector We propose AlignVLM, a method that maps vision features into a weighted average of LLM text embeddings, ensuring they remain in a space that the LLM can effectively interpret. β
π¬ How does it perform? We compared ALIGN against common connectors like MLPs, Perceiver Resampler, and Ovis trained under similar configurations. The results? ALIGN outperforms them all π on diverse document understanding tasks π.
π Meet the AlignVLM Model Family! We trained Llama 3.1 (1B, 3B, 8B) using our connector and benchmarked them against various models. The results: β AlignVLM surpasses all Base VLMs trained under similar configurations. β Our models also perform competitively against Instruct VLMs such as Qwen2-VL and InternVL-2.5 π.
π€ What about robustness to noise? We injected Gaussian noise (ΞΌ=0, Ο=3) into the vision encoderβs outputs before feeding them to the connector: β ALIGN Connector: Minimal drop (β1.67%) β proving its high robustness! β MLP Connector: Severe degradation (β25.54%) β struggling with noisy inputs.
Code & model weights coming soon! Stay tuned! π₯
π Introducing ColFlor: An Efficient, OCR-Free Vision-Language Document Retrieval Model π
Earlier this year, ColPali revolutionized document retrieval by eliminating the need for error-prone OCR pipelines. Instead, it directly processes the document images. However, with its 3 billion parameters, ColPali is computationally heavy for large-scale applications.
Thatβs where ColFlor comes inβa smaller, faster alternative! π At 17x smaller than ColPali, ColFlor offers a more efficient, OCR-free document retrieval solution, making it ideal for users with limited computing resources (GPU Poor). π‘ Key Highlights: π§ 174M parameters (vs. 3B for ColPali) β‘ 9.8x faster query encoding, 5.25x faster image encoding π Only 1.8% performance drop on text-rich English documents
Check out the full blog post for more insights on modeling, training, and evaluations across various document retrieval tasks! π Also, feel free to try our demo on huggingface π€