shivalikasingh commited on
Commit
f3b63d1
·
1 Parent(s): f98f640

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +55 -4
README.md CHANGED
@@ -7,18 +7,69 @@ tags:
7
 
8
  ## Model description
9
 
10
- More information needed
 
11
 
12
- ## Intended uses & limitations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
 
14
- More information needed
15
 
16
  ## Training and evaluation data
17
 
18
- More information needed
 
 
 
 
19
 
20
  ## Training procedure
21
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
  ### Training hyperparameters
23
 
24
  The following hyperparameters were used during training:
 
7
 
8
  ## Model description
9
 
10
+ This model is intended to be used for the task of classifying videos.
11
+ A video is an ordered sequence of frames. An individual frame of a video has spatial information whereas a sequence of video frames have temporal information.
12
 
13
+ In order to capture both the spatial and temporal information present within a video, this model is made up of a hybrid architecture consisting of a Transformer Encoder operating on top of CNN feature maps.
14
+ The CNN helps in capturing the spatial information present in the videos. For this purpose, a pretrained CNN (DenseNet121) is used to generate feature maps for the videos.
15
+
16
+ The temporal information corresponding to the ordering of the video frames can't be captured by the self-attention layers of a Transformer alone as they are order-agnostic by default. Therefore, this ordering related information has to be injected into the model with the help of a Positional Embedding.
17
+
18
+ The positional embeddings are added to the pre-computed CNN feature maps and finally fed as input to the Transformer Encoder.
19
+
20
+ The has nearly 4.23 million parameters. It works best with large datasets and longer training schedules.
21
+
22
+ ## Intended uses
23
+
24
+ The model can be used for the purpose of classifying videos belonging to different categories. Currently, the model recognises the following 5 classes:
25
+
26
+ | Classes |
27
+ | :-- |
28
+ | CricketShot |
29
+ | PlayingCello |
30
+ | Punch |
31
+ | ShavingBeard |
32
+ | TennisSwing |
33
 
 
34
 
35
  ## Training and evaluation data
36
 
37
+ The dataset used for training the model is a subsampled version of the [UCF101](https://www.crcv.ucf.edu/data/UCF101.php) dataset.
38
+ UCF101 is an action recognition dataset of realistic action videos collected from YouTube.
39
+ The original UCF101 dataset has videos of 101 categories. However, the model was trained on a smaller subset of the original dataset which consisted of only 5 classes.
40
+
41
+ 594 videos were used for training and 224 videos were used for testing.
42
 
43
  ## Training procedure
44
 
45
+ 1. Data Preparation:
46
+
47
+ - Image size was kept as 128*128 to aid with faster computation
48
+ - Since a video is an ordered sequence of frames, the frames were extracted and put as a 3D tensor. But the no. of frames differed from video to video so the shorter videos were padded so that all videos had the same frame count. This helped in stacking the frames easily into batches.
49
+ - A pre-trained DenseNet121 model was then used to extract useful features from the extracted video frames.
50
+
51
+ 2. Building the Transformer-based Model:
52
+
53
+ - A positional embedding layer is defined to take the CNN feature maps generated by the DenseNet model and add the ordering (positional) information about the video frames to it.
54
+ - The transformer encoder is defined to process the CNN feature maps along with positional embeddings
55
+ - Layers corresponding to GlobalMaxPooling and Dropout along with a classifier head are attached to the transformer encoder to build the final model.
56
+
57
+ 3. Model Training:
58
+
59
+ The model is then trained using the following config:
60
+
61
+ | Training Config | Value |
62
+ | :-- | :-- |
63
+ | Optimizer | Adam |
64
+ | Loss Function | sparse_categorical_crossentropy|
65
+ | Metric | Accuracy |
66
+ | Epochs | 5 |
67
+
68
+ 4. Model Testing:
69
+
70
+ The model is tested on the test data post training achieving an accuracy of ~90%.
71
+
72
+
73
  ### Training hyperparameters
74
 
75
  The following hyperparameters were used during training: