File size: 2,319 Bytes
0c72ccc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 |
---
license: apache-2.0
---

<div align="center">
# HaploVL - A Single-Transformer Baseline for Multi-Modal Understanding
[](https://haplo-vl.github.io/)
</div>
HaploVL is a multimodal understanding foundation model that delivers comprehensive cross-modal understanding capabilities for text, images, and video inputs through a single transformer architecture.
## Highlights
This repository contains the PyTorch implementation, model weights, and training code for **Haplo**.

π **Unified Architecture**: Single transformer model supporting early fusion of multi-modal inputs and auto-regressive response generation
π **Efficient Training**: Optimized training recipe leveraging pre-trained knowledge with reduced resource consumption
π **Scalable Design**: Flexible framework supporting both Ascend NPU and GPU environments
π **Extended Capabilities**: Native support for multiple image understanding and video processing
## Getting Started
### Installation
```bash
# Option1:
pip install git+https://github.com/Tencent/HaploVLM.git
# Option2:
git clone https://github.com/Tencent/HaploVLM.git
cd HaploVLM
pip install -e . -v
```
### Quick Start
Basic usage example:
```python
from haplo import HaploProcessor, HaploForConditionalGeneration
processor = HaploProcessor.from_pretrained('stevengrove/Haplo-7B-Pro-Video')
model = HaploForConditionalGeneration.from_pretrained(
'stevengrove/Haplo-7B-Pro-Video',
torch_dtype=torch.bfloat16
).to('cuda')
conversation = [
{'role': 'user', 'content': [
{'type': 'text', 'text': 'Describe this image.'},
{'type': 'image', 'path': 'assets/example-image.png'}
]}
]
inputs = processor.apply_chat_template(
conversation,
add_generation_prompt=True,
return_tensors='pt'
).to('cuda')
outputs = model.generate(inputs)
print(processor.decode(outputs[0]))
```
## Acknowledgement
```bibtex
@article{yang2024haplo,
title={HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding},
author={Yang, Rui and Song, Lin and Xiao, Yicheng and Huang, Runhui and Ge, Yixiao and Shan, Ying and Zhao, Hengshuang},
journal={arXiv preprint arXiv:xxxx.xxxxx},
year={2025}
}
``` |