File size: 2,319 Bytes
0c72ccc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
---
license: apache-2.0
---
![Image](assets/logo.jpeg)

<div align="center">

# HaploVL - A Single-Transformer Baseline for Multi-Modal Understanding

[![Project page](https://img.shields.io/badge/Project_page-green)](https://haplo-vl.github.io/)&nbsp;

</div>

HaploVL is a multimodal understanding foundation model that delivers comprehensive cross-modal understanding capabilities for text, images, and video inputs through a single transformer architecture.

## Highlights
This repository contains the PyTorch implementation, model weights, and training code for **Haplo**.

![Image](assets/framework.png)

🌟 **Unified Architecture**: Single transformer model supporting early fusion of multi-modal inputs and auto-regressive response generation  
🌟 **Efficient Training**: Optimized training recipe leveraging pre-trained knowledge with reduced resource consumption  
🌟 **Scalable Design**: Flexible framework supporting both Ascend NPU and GPU environments  
🌟 **Extended Capabilities**: Native support for multiple image understanding and video processing

## Getting Started

### Installation

```bash
# Option1:
pip install git+https://github.com/Tencent/HaploVLM.git

# Option2:
git clone https://github.com/Tencent/HaploVLM.git
cd HaploVLM
pip install -e . -v
```

### Quick Start
Basic usage example:
```python
from haplo import HaploProcessor, HaploForConditionalGeneration

processor = HaploProcessor.from_pretrained('stevengrove/Haplo-7B-Pro-Video')
model = HaploForConditionalGeneration.from_pretrained(
    'stevengrove/Haplo-7B-Pro-Video',
    torch_dtype=torch.bfloat16
).to('cuda')

conversation = [
    {'role': 'user', 'content': [
        {'type': 'text', 'text': 'Describe this image.'},
        {'type': 'image', 'path': 'assets/example-image.png'}
    ]}
]

inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    return_tensors='pt'
).to('cuda')

outputs = model.generate(inputs)
print(processor.decode(outputs[0]))
```

## Acknowledgement

```bibtex
@article{yang2024haplo,
  title={HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding},
  author={Yang, Rui and Song, Lin and Xiao, Yicheng and Huang, Runhui and Ge, Yixiao and Shan, Ying and Zhao, Hengshuang},
  journal={arXiv preprint arXiv:xxxx.xxxxx},
  year={2025}
}
```