shivalikasingh commited on
Commit
dadfe9f
·
1 Parent(s): 69ac7ef

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -12
README.md CHANGED
@@ -7,7 +7,7 @@ tags:
7
 
8
  ## Model description
9
 
10
- ShiftViT is a variation of the Vision Transformer (ViT) where the attention operation has been swapped with a shifting operation.
11
 
12
  ShiftViT model was proposed as part of the paper [When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism](https://arxiv.org/abs/2201.10801).
13
  Vision Transformers have lately become very popular for computer vision problems and a lot researchers attribute their success to the attention layers. The authors of the ShiftViT paper have tried to show via the ShiftViT model that even without the attention operation, ViTs can reach SoTA results.
@@ -16,38 +16,44 @@ Vision Transformers have lately become very popular for computer vision problems
16
 
17
  The architecture for ShiftViT is inspired by the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030)
18
 
 
 
19
  | ![ShiftViT Architecture](https://i.imgur.com/CHU40HX.png) |
20
  | :--: |
21
  | Figure 1: The entire architecutre of ShiftViT.
22
  [Source](https://arxiv.org/abs/2201.10801) |
23
 
24
- Here the authors propose a modular architecture with 4 stages. Each stage works on its own spatial size, creating a hierarchical architecture.
25
-
26
-
27
  Each stage in the ShiftViT architecture comprises of a Shift Block as shown in Fig 2.
28
 
29
  | ![ShiftViT block](https://i.imgur.com/IDe35vo.gif) |
30
  | :--: |
31
  | Figure 2: From the Model to a Shift Block. |
32
 
33
- The Shift Block as shown in Fig. 3, comprises of the following:
34
 
35
- Shift Operation
36
- Linear Normalization
37
- MLP Layer: stack of densely-connected layers
38
 
39
 
 
40
  1. Split the channels
41
  2. Select each of the first four spilts and shift and pad them in the respective directions.
42
  3. After shifting and padding, we concatenate the channel back.
43
 
44
  | ![Manim rendered animation for shift operation](https://i.imgur.com/PReeULP.gif) |
45
  | :--: |
46
- | Figure 4: The TensorFlow style shifting |
47
 
48
  ## Intended uses
49
 
50
- The ShiftViT model architecture has been designed to
 
 
 
 
 
 
51
 
52
  ## Training and evaluation data
53
 
@@ -76,11 +82,11 @@ No. of samples used for training and evaluation are:
76
 
77
  1. Data Preparation:
78
 
79
- - Data Augmentation: The augmentation steps include: rescaling, resizing, cropping and horizontal flipping.
80
 
81
  2. Building the ShiftViT Model:
82
 
83
- -
84
 
85
  3. Model Training:
86
 
 
7
 
8
  ## Model description
9
 
10
+ ShiftViT is a variation of the Vision Transformer (ViT) where the attention operation has been replaced with a shifting operation.
11
 
12
  ShiftViT model was proposed as part of the paper [When Shift Operation Meets Vision Transformer: An Extremely Simple Alternative to Attention Mechanism](https://arxiv.org/abs/2201.10801).
13
  Vision Transformers have lately become very popular for computer vision problems and a lot researchers attribute their success to the attention layers. The authors of the ShiftViT paper have tried to show via the ShiftViT model that even without the attention operation, ViTs can reach SoTA results.
 
16
 
17
  The architecture for ShiftViT is inspired by the paper [Swin Transformer: Hierarchical Vision Transformer using Shifted Windows](https://arxiv.org/abs/2103.14030)
18
 
19
+ Here the authors propose a modular architecture with 4 stages. Each stage works on its own spatial size, creating a hierarchical architecture.
20
+
21
  | ![ShiftViT Architecture](https://i.imgur.com/CHU40HX.png) |
22
  | :--: |
23
  | Figure 1: The entire architecutre of ShiftViT.
24
  [Source](https://arxiv.org/abs/2201.10801) |
25
 
 
 
 
26
  Each stage in the ShiftViT architecture comprises of a Shift Block as shown in Fig 2.
27
 
28
  | ![ShiftViT block](https://i.imgur.com/IDe35vo.gif) |
29
  | :--: |
30
  | Figure 2: From the Model to a Shift Block. |
31
 
32
+ **The Shift Block as shown in Fig. 2, comprises of the following:**
33
 
34
+ - Shift Operation
35
+ - Linear Normalization
36
+ - MLP Layer: stack of densely-connected layers
37
 
38
 
39
+ **How shift operation works:**
40
  1. Split the channels
41
  2. Select each of the first four spilts and shift and pad them in the respective directions.
42
  3. After shifting and padding, we concatenate the channel back.
43
 
44
  | ![Manim rendered animation for shift operation](https://i.imgur.com/PReeULP.gif) |
45
  | :--: |
46
+ | Figure 3: The TensorFlow style shifting |
47
 
48
  ## Intended uses
49
 
50
+ This ShiftViT model is trained to be used for image classification task.
51
+
52
+ However, the ShiftViT architecture can be used for a variety of visual recognition tasks.
53
+ The authors of the [ShiftViT paper](https://arxiv.org/abs/2201.10801) tested the model on the following tasks:
54
+ - Image Classification on ImageNet-1k dataset
55
+ - Object Detection on COCO dataset
56
+ - Semantic Segmentation on ADE20k dataset
57
 
58
  ## Training and evaluation data
59
 
 
82
 
83
  1. Data Preparation:
84
 
85
+ - Data Augmentation: The augmentation steps used for data preparation include: rescaling, resizing, cropping and horizontal flipping.
86
 
87
  2. Building the ShiftViT Model:
88
 
89
+ - The steps for constructing the ShiftViT model have been covered extensively in this [Keras example](https://keras.io/examples/vision/shiftvit/)
90
 
91
  3. Model Training:
92