Edit model card

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

VideoMAE

image/jpeg

Paper Colab HF Space HF Hub
arXiv Open In Colab HugginFace badge HugginFace badge

Video masked autoencoders (VideoMAE) are seen as data-efficient learners for self-supervised video pre-training (SSVP). Inspiration was drawn from the recent ImageMAE, and customized video tube masking with an extremely high ratio was proposed. Due to this simple design, video reconstruction is made a more challenging self-supervision task, leading to the extraction of more effective video representations during this pre-training process. Some hightlights of VideoMAE:

  • Masked Video Modeling for Video Pre-Training
  • A Simple, Efficient and Strong Baseline in SSVP
  • High performance, but NO extra data required

This is a unofficial Keras reimplementation of VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training model. The official PyTorch implementation can be found here.

Model Zoo

The pre-trained and fine-tuned models are listed in MODEL_ZOO.md. Following are some hightlights.

Kinetics-400

For Kinetrics-400, VideoMAE is trained around 1600 epoch without any extra data. The following checkpoints are available in both tensorflow SavedModel and h5 format.

Backbone #Frame Top-1 Top-5 Params [FT] MB Params [PT] MB) FLOPs
ViT-S 16x5x3 79.0 93.8 22 24 57G
ViT-B 16x5x3 81.5 95.1 87 94 181G
ViT-L 16x5x3 85.2 96.8 304 343 -
ViT-H 16x5x3 86.6 97.1 632 ? -

?* Official ViT-H backbone of VideoMAE has weight issue in pretrained model, details https://github.com/MCG-NJU/VideoMAE/issues/89. The FLOPs of encoder models (FT) are reported only.

Something-Something V2

For SSv2, VideoMAE is trained around 2400 epoch without any extra data.

Backbone #Frame Top-1 Top-5 Params [FT] MB Params [PT] MB FLOPs
ViT-S 16x2x3 66.8 90.3 22 24 57G
ViT-B 16x2x3 70.8 92.4 86 94 181G

UCF101

For UCF101, VideoMAE is trained around 3200 epoch without any extra data.

Backbone #Frame Top-1 Top-5 Params [FT] MB Params [PT] MB FLOPS
ViT-B 16x5x3 91.3 98.5 86 94 181G
Downloads last month
0
Inference API
Inference API (serverless) does not yet support tf-keras models for this pipeline type.