File size: 1,470 Bytes
614db1a
 
f3b0957
614db1a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
---
license: bsd-3-clause
pipeline_tag: video-text-to-text
---

# E.T. Chat

[arXiv](https://arxiv.org/abs/2409.18111) | [Project Page](https://polyu-chenlab.github.io/etbench) | [GitHub](https://github.com/PolyU-ChenLab/ETBench)

E.T. Chat is a novel time-sensitive Video-LLM that reformulates timestamp prediction as an embedding matching problem, serving as a strong baseline on E.T. Bench. E.T. Chat consists of a visual encoder, a frame compressor, and a LLM. A special token \<vid\> is introduced to trigger frame embedding matching for timestamp prediction.

## ๐Ÿ”– Model Details

### Model Description

- **Developed by:** Ye Liu
- **Model type:** Multi-modal Large Language Model
- **Language(s):** English
- **License:** BSD-3-Clause

### Training Data

The stage-1 checkpoint of E.T. Chat was trained from [WebVid](https://maxbain.com/webvid-dataset/) and [LCS-558K](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) datasets.

### More Details

Please refer to our [GitHub Repository](https://github.com/PolyU-ChenLab/ETBench) for more details about this model.

## ๐Ÿ“– Citation

Please kindly cite our paper if you find this project helpful.

```
@inproceedings{liu2024etbench,
  title={E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding},
  author={Liu, Ye and Ma, Zongyang and Qi, Zhongang and Wu, Yang and Chen, Chang Wen and Shan, Ying},
  booktitle={Neural Information Processing Systems (NeurIPS)},
  year={2024}
}
```