Jiqing commited on
Commit
cd32d7b
·
1 Parent(s): a78e902

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +139 -0
README.md CHANGED
@@ -1,3 +1,142 @@
1
  ---
 
 
 
2
  license: other
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: en
3
+ tags:
4
+ - tvp
5
  license: other
6
+ datasets:
7
+ - charades
8
  ---
9
+
10
+ # TVP base model
11
+
12
+ The TVP model was proposed in [Text-Visual Prompting for Efficient 2D Temporal Video Grounding](https://arxiv.org/abs/2209.14156) by Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding. The goal of
13
+ this model is to incorporate trainable prompts into both visual inputs and textual features to temporal video grounding(TVG) problems. It was introduced in
14
+ [this paper](https://arxiv.org/pdf/2303.04995.pdf).
15
+
16
+ TVP got accepted to [CVPR'23](https://cvpr2023.thecvf.com/) conference.
17
+
18
+ ## Model description
19
+
20
+ The abstract from the paper is the following:
21
+ In this paper, we study the problem of temporal video grounding (TVG), which aims to predict the starting/ending time points of moments described by a text sentence within a long untrimmed video. Benefiting from fine-grained 3D visual features, the TVG techniques have achieved remarkable progress in recent years. However, the high complexity of 3D convolutional neural networks (CNNs) makes extracting dense 3D visual features time-consuming, which calls for intensive memory and computing resources. Towards efficient TVG, we propose a novel text-visual prompting (TVP) framework, which incorporates optimized perturbation patterns (that we call ‘prompts’) into both visual inputs and textual features of a TVG model. In sharp contrast to 3D CNNs, we show that TVP allows us to effectively co-train vision encoder and language encoder in a 2D TVG model and improves the performance of cross-modal feature fusion using only low-complexity sparse 2D visual features. Further, we propose a Temporal-Distance IoU (TDIoU) loss for efficient learning of TVG. Experiments on two benchmark datasets, Charades-STA and ActivityNet Captions datasets, empirically show that the proposed TVP significantly boosts the performance of 2D TVG (e.g., 9.79% improvement on Charades-STA and 30.77% improvement on ActivityNet Captions) and achieves 5× inference acceleration over TVG using 3D visual features.
22
+
23
+ ## Intended uses & limitations(TODO)
24
+
25
+ You can use the raw model for temporal video grounding.
26
+
27
+ ### How to use
28
+
29
+ Here is how to use this model to get the logits of a given video and text in PyTorch:
30
+ ```python
31
+ import av
32
+ import numpy as np
33
+ import torch
34
+ from huggingface_hub import hf_hub_download
35
+ from transformers import AutoProcessor, AutoModel
36
+
37
+
38
+ def pyav_decode(container, sampling_rate, num_frames, clip_idx, num_clips, target_fps):
39
+ """
40
+ Convert the video from its original fps to the target_fps and decode the video with PyAV decoder.
41
+ Returns:
42
+ frames (tensor): decoded frames from the video. Return None if the no
43
+ video stream was found.
44
+ fps (float): the number of frames per second of the video.
45
+ """
46
+ fps = float(container.streams.video[0].average_rate)
47
+ clip_size = sampling_rate * num_frames / target_fps * fps
48
+ delta = max(container.streams.video[0].frames - clip_size, 0)
49
+ start_idx = delta * clip_idx / num_clips
50
+ end_idx = start_idx + clip_size - 1
51
+ timebase = container.streams.video[0].duration / container.streams.video[0].frames
52
+ video_start_pts = int(start_idx * timebase)
53
+ video_end_pts = int(end_idx * timebase)
54
+ stream_name = {"video": 0}
55
+ seek_offset = max(video_start_pts - 1024, 0)
56
+ container.seek(seek_offset, any_frame=False, backward=True, stream=container.streams.video[0])
57
+ frames = {}
58
+ for frame in container.decode(**stream_name):
59
+ if frame.pts < video_start_pts:
60
+ continue
61
+ if frame.pts <= video_end_pts:
62
+ frames[frame.pts] = frame
63
+ else:
64
+ frames[frame.pts] = frame
65
+ break
66
+ frames = [frames[pts] for pts in sorted(frames)]
67
+
68
+ return frames, fps
69
+
70
+
71
+ def decode(container, sampling_rate, num_frames, clip_idx, num_clips, target_fps):
72
+ """
73
+ Decode the video and perform temporal sampling.
74
+ Args:
75
+ container (container): pyav container.
76
+ sampling_rate (int): frame sampling rate (interval between two sampled frames).
77
+ num_frames (int): number of frames to sample.
78
+ clip_idx (int): if clip_idx is -1, perform random temporal sampling.
79
+ If clip_idx is larger than -1, uniformly split the video to num_clips
80
+ clips, and select the clip_idx-th video clip.
81
+ num_clips (int): overall number of clips to uniformly sample from the given video.
82
+ target_fps (int): the input video may have different fps, convert it to
83
+ the target video fps before frame sampling.
84
+ Returns:
85
+ frames (tensor): decoded frames from the video.
86
+ """
87
+ assert clip_idx >= -2, "Not valied clip_idx {}".format(clip_idx)
88
+ frames, fps = pyav_decode(container, sampling_rate, num_frames, clip_idx, num_clips, target_fps)
89
+ clip_size = sampling_rate * num_frames / target_fps * fps
90
+ index = torch.linspace(0, clip_size - 1, num_frames)
91
+ index = torch.clamp(index, 0, len(frames) - 1).long().tolist()
92
+ frames = [frames[idx] for idx in index]
93
+ frames = [frame.to_rgb().to_ndarray() for frame in frames]
94
+ frames = torch.from_numpy(np.stack(frames))
95
+
96
+ return frames
97
+
98
+
99
+ file = hf_hub_download(repo_id="Intel/tvp_demo", filename="3MSZA.mp4", repo_type="dataset")
100
+ decoder_kwargs = dict(
101
+ container=av.open(file, metadata_errors="ignore"),
102
+ sampling_rate=1,
103
+ num_frames=48,
104
+ clip_idx=0,
105
+ num_clips=1,
106
+ target_fps=3,
107
+ )
108
+ raw_sampled_frms = decode(**decoder_kwargs)
109
+ raw_sampled_frms = raw_sampled_frms.permute(0, 3, 1, 2)
110
+
111
+ processor = AutoProcessor.from_pretrained("Intel/tvp-base")
112
+ data = processor(
113
+ text=["person turn a light on."], videos=list(raw_sampled_frms.numpy()), return_tensors="pt", max_text_length=100
114
+ )
115
+ model = AutoModel.from_pretrained("Intel/tvp-base")
116
+ output = model(**data)
117
+
118
+ print(output)
119
+ ```
120
+
121
+ ### Limitations and bias
122
+
123
+ TODO
124
+
125
+ ## Training data
126
+
127
+ The TVP model was pretrained on public datasets:
128
+ - [charades](https://prior.allenai.org/projects/charades),
129
+
130
+ ## Training procedure
131
+
132
+ ### Preprocessing
133
+
134
+ TODO
135
+
136
+ ### Pretraining
137
+
138
+ TODO
139
+
140
+ ## Evaluation results
141
+
142
+ Please refer to [Table 2](https://arxiv.org/pdf/2303.04995.pdf) for TVP's performance on Temporal Video Grounding task.