You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Image

HaploVL - A Single-Transformer Baseline for Multi-Modal Understanding

Project page 

HaploVL is a multimodal understanding foundation model that delivers comprehensive cross-modal understanding capabilities for text, images, and video inputs through a single transformer architecture.

Highlights

This repository contains the PyTorch implementation, model weights, and training code for Haplo.

Image

🌟 Unified Architecture: Single transformer model supporting early fusion of multi-modal inputs and auto-regressive response generation
🌟 Efficient Training: Optimized training recipe leveraging pre-trained knowledge with reduced resource consumption
🌟 Scalable Design: Flexible framework supporting both Ascend NPU and GPU environments
🌟 Extended Capabilities: Native support for multiple image understanding and video processing

Getting Started

Installation

# Option1:
pip install git+https://github.com/Tencent/HaploVLM.git

# Option2:
git clone https://github.com/Tencent/HaploVLM.git
cd HaploVLM
pip install -e . -v

Quick Start

Basic usage example:

from haplo import HaploProcessor, HaploForConditionalGeneration

processor = HaploProcessor.from_pretrained('stevengrove/Haplo-7B-Pro')
model = HaploForConditionalGeneration.from_pretrained(
    'stevengrove/Haplo-7B-Pro',
    torch_dtype=torch.bfloat16
).to('cuda')

conversation = [
    {'role': 'user', 'content': [
        {'type': 'text', 'text': 'Describe this image.'},
        {'type': 'image', 'path': 'assets/example-image.png'}
    ]}
]

inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    return_tensors='pt'
).to('cuda')

outputs = model.generate(inputs)
print(processor.decode(outputs[0]))

Acknowledgement

@article{yang2024haplo,
  title={HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding},
  author={Yang, Rui and Song, Lin and Xiao, Yicheng and Huang, Runhui and Ge, Yixiao and Shan, Ying and Zhao, Hengshuang},
  journal={arXiv preprint arXiv:xxxx.xxxxx},
  year={2025}
}
Downloads last month
4
Safetensors
Model size
7.64B params
Tensor type
F32
Β·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Collection including stevengrove/Haplo-7B-Pro