Video Swin Transformer : VideoSwin

image/png

Paper Colab HF Space HF Hub
arXiv Open In Colab HugginFace badge HugginFace badge

VideoSwin is a pure transformer based video modeling algorithm, attained top accuracy on the major video recognition benchmarks. In this model, the author advocates an inductive bias of locality in video transformers, which leads to a better speed-accuracy trade-off compared to previous approaches which compute self-attention globally even with spatial-temporal factorization. The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain, while continuing to leverage the power of pre-trained image models.

This is a unofficial Keras implementation of Video Swin transformers. The official PyTorch implementation is here based on mmaction2.

Model Zoo

The 3D swin-video checkpoints are listed in MODEL_ZOO.md. Following are some hightlights.

Kinetics 400

In the training phase, the video swin mdoels are initialized with the pretrained weights of image swin models. In that case, IN referes to ImageNet.

Backbone Pretrain Top-1 Top-5 #params FLOPs config
Swin-T IN-1K 78.8 93.6 28M ? swin-t
Swin-S IN-1K 80.6 94.5 50M ? swin-s
Swin-B IN-1K 80.6 94.6 88M ? swin-b
Swin-B IN-22K 82.7 95.5 88M ? swin-b

Kinetics 600

Backbone Pretrain Top-1 Top-5 #params FLOPs config
Swin-B IN-22K 84.0 96.5 88M ? swin-b

Something-Something V2

Backbone Pretrain Top-1 Top-5 #params FLOPs config
Swin-B Kinetics 400 69.6 92.7 89M ? swin-b
Downloads last month
0
Inference API
Inference API (serverless) does not yet support tf-keras models for this pipeline type.