|
--- |
|
title: Video Vision Transformer on medmnist |
|
emoji: π§ββοΈ |
|
colorFrom: red |
|
colorTo: green |
|
sdk: gradio |
|
app_file: app.py |
|
pinned: false |
|
license: apache-2.0 |
|
library_name: keras |
|
--- |
|
|
|
## Keras Implementation of Video Vision Transformer on medmnist |
|
|
|
This repo contains the model [to this Keras example on Video Vision Transformer](https://keras.io/examples/vision/vivit/). |
|
|
|
## Background Information |
|
This example implements [ViViT: A Video Vision Transformer](https://arxiv.org/abs/2103.15691) by Arnab et al., a pure Transformer-based model for video classification. The authors propose a novel embedding scheme and a number of Transformer variants to model video clips. |
|
|
|
## Datasets |
|
We use the [MedMNIST v2: A Large-Scale Lightweight Benchmark for 2D and 3D Biomedical Image Classification](https://medmnist.com/) dataset. |
|
|
|
## Training Parameters |
|
``` |
|
# DATA |
|
DATASET_NAME = "organmnist3d" |
|
BATCH_SIZE = 32 |
|
AUTO = tf.data.AUTOTUNE |
|
INPUT_SHAPE = (28, 28, 28, 1) |
|
NUM_CLASSES = 11 |
|
|
|
# OPTIMIZER |
|
LEARNING_RATE = 1e-4 |
|
WEIGHT_DECAY = 1e-5 |
|
|
|
# TRAINING |
|
EPOCHS = 80 |
|
|
|
# TUBELET EMBEDDING |
|
PATCH_SIZE = (8, 8, 8) |
|
NUM_PATCHES = (INPUT_SHAPE[0] // PATCH_SIZE[0]) ** 2 |
|
|
|
# ViViT ARCHITECTURE |
|
LAYER_NORM_EPS = 1e-6 |
|
PROJECTION_DIM = 128 |
|
NUM_HEADS = 8 |
|
NUM_LAYERS = 8 |
|
``` |