Jiawei Yang commited on
Commit
442a9cf
·
1 Parent(s): 618c30f

upload models

Browse files
README.md CHANGED
@@ -1,3 +1,112 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+
4
+ language:
5
+ - en
6
+ library_name: DVT
7
+ tags:
8
+ - denoising vision transformer
9
+ - ViT artifacts
10
  ---
11
+
12
+ # Denoising Vision Transformer (DVT)
13
+
14
+ ## Introduction
15
+ We study a crucial yet often overlooked issue inherent to Vision Transformers (ViTs): feature maps of these models exhibit grid-like artifacts (“Original features” in the teaser), which hurt the performance of ViTs in downstream dense prediction tasks such as semantic segmentation, depth prediction, and object discovery. We trace this issue down to the positional embeddings at the input stage. To mitigate this, we propose a two-stage denoising approach, termed Denoising Vision Transformers (DVT). In the first stage, we separate the clean features from those contaminated by positional artifacts by enforcing cross-view feature consistency with neural fields on a per-image basis. This per-image optimization process extracts artifact-free features from raw ViT outputs, providing clean feature estimates for offline applications. In the second stage, we train a lightweight transformer block to predict clean features from raw ViT outputs, leveraging the derived estimates of the clean features as supervision. Our method, DVT, does not require re-training the existing pre-trained ViTs, and is immediately applicable to any Vision Transformer architecture. We evaluate our method on a variety of representative ViTs (DINO, DeiT-III, EVA02, CLIP, DINOv2, DINOv2-reg) and demonstrate that DVT consistently improves existing state-of-the-art general-purpose models in semantic and geometric tasks across multiple datasets (Fig. 1, right, Tabs. 2 to 4). We hope our study will encourage a re-evaluation of ViT design, especially regarding the naive use of positional embeddings. Our code and checkpoints are publicly available.
16
+
17
+
18
+
19
+ ## Model Summary
20
+ We include 4 versions of models in this space:
21
+ - `voc_denoised`: These are single-layer Transformer models that are trained to denoise the output of the original ViT models. These models are trained on the VOC dataset.
22
+ - `voc_distilled`: These are models distilled from the denoiser using the ImageNet-1k dataset, where all model parameters are jointly fine-tuned. The distillation process involves three stages:
23
+ 1. Stage 1: Perform per-image denoising on the VOC datasets.
24
+ 2. Stage 2: Train the denoiser using the features obtained from the per-image denoising in Stage 1 on the VOC datasets.
25
+ 3. Stage 3: Fine-tune the entire model on the ImageNet-1k dataset, using the outputs from the Stage 2 denoiser as supervision.
26
+ - `imgnet_denoised`: The same as `voc_denoised`, but trained on the ImageNet-1k dataset.
27
+ - `imgnet_distilled`: The same as `voc_distilled`, but trained on the ImageNet-1k dataset, including the denoiser and the distilled model.
28
+
29
+ ## Performance Summary
30
+ - Baseline: The original ViT models.
31
+
32
+ | Model | VOC_mIoU | VOC_mAcc | ADE_mIoU | ADE_mAcc | NYU_RMSE | NYU_abs_rel | NYU_a1 |
33
+ |--------------------------------------------------------------|----------|----------|----------|----------|-----------|-------------|---------|
34
+ | vit_small_patch14_dinov2.lvd142m | 81.78 | 88.44 | 44.05 | 55.53 | 0.4340 | 0.1331 | 84.49% |
35
+ | vit_base_patch14_dinov2.lvd142m | 83.52 | 90.60 | 47.02 | 58.45 | 0.3965 | 0.1197 | 87.59% |
36
+ | vit_large_patch14_dinov2.lvd142m | 83.43 | 90.38 | 47.53 | 59.64 | 0.3831 | 0.1145 | 88.89% |
37
+ | vit_small_patch14_reg4_dinov2.lvd142m | 80.88 | 88.69 | 44.36 | 55.90 | 0.4328 | 0.1303 | 85.00% |
38
+ | vit_base_patch14_reg4_dinov2.lvd142m | 83.48 | 90.95 | 47.73 | 60.17 | 0.3967 | 0.1177 | 87.92% |
39
+ | vit_large_patch14_reg4_dinov2.lvd142m | 83.21 | 90.67 | 48.44 | 61.28 | 0.3852 | 0.1139 | 88.53% |
40
+ | deit3_base_patch16_224.fb_in1k | 71.03 | 80.67 | 32.84 | 42.79 | 0.5837 | 0.1772 | 73.03% |
41
+ | vit_base_patch16_clip_384.laion2b_ft_in12k_in1k | 77.75 | 86.68 | 40.50 | 52.81 | 0.5585 | 0.1678 | 74.30% |
42
+ | vit_base_patch16_224.dino | 62.92 | 75.98 | 31.03 | 40.62 | 0.5742 | 0.1694 | 74.55% |
43
+ | vit_base_patch16_224.mae | 50.29 | 63.10 | 23.84 | 32.06 | 0.6629 | 0.2275 | 66.24% |
44
+ | eva02_base_patch16_clip_224.merged2b | 71.49 | 82.69 | 37.89 | 50.31 | - | - | - |
45
+ | vit_base_patch16_384.augreg_in21k_ft_in1k | 73.51 | 83.60 | 36.46 | 48.65 | 0.6360 | 0.1898 | 69.10% |
46
+
47
+ - DVT (voc_denoised): The denoised models trained on the VOC dataset.
48
+
49
+ | Model | VOC_mIoU | VOC_mAcc | ADE_mIoU | ADE_mAcc | NYU_RMSE | NYU_abs_rel | NYU_a1 |
50
+ |--------------------------------------------------------------|----------|----------|----------|----------|-----------|-------------|---------|
51
+ | vit_small_patch14_dinov2.lvd142m | 82.78 | 90.69 | 45.14 | 56.35 | 0.4368 | 0.1337 | 84.34% |
52
+ | vit_base_patch14_dinov2.lvd142m | 84.92 | 91.74 | 48.54 | 60.21 | 0.3811 | 0.1166 | 88.42% |
53
+ | vit_large_patch14_dinov2.lvd142m | 85.25 | 91.69 | 49.80 | 61.98 | 0.3826 | 0.1118 | 89.32% |
54
+ | vit_small_patch14_reg4_dinov2.lvd142m | 81.93 | 89.54 | 45.55 | 57.52 | 0.4251 | 0.1292 | 85.01% |
55
+ | vit_base_patch14_reg4_dinov2.lvd142m | 84.58 | 91.17 | 49.24 | 61.66 | 0.3898 | 0.1146 | 88.60% |
56
+ | vit_large_patch14_reg4_dinov2.lvd142m | 84.37 | 91.42 | 49.19 | 62.21 | 0.3852 | 0.1141 | 88.45% |
57
+ | deit3_base_patch16_224.fb_in1k | 73.52 | 83.65 | 33.57 | 43.56 | 0.5817 | 0.1774 | 73.05% |
58
+ | vit_base_patch16_clip_384.laion2b_ft_in12k_in1k | 79.50 | 88.43 | 41.33 | 53.54 | 0.5512 | 0.1639 | 75.30% |
59
+ | vit_base_patch16_224.dino | 66.41 | 77.75 | 32.45 | 42.42 | 0.5784 | 0.1738 | 73.75% |
60
+ | vit_base_patch16_224.mae | 50.65 | 62.90 | 23.25 | 31.03 | 0.6651 | 0.2271 | 65.44% |
61
+ | eva02_base_patch16_clip_224.merged2b | 73.76 | 84.50 | 37.99 | 50.40 | 0.6196 | 0.1904 | 69.86% |
62
+ | vit_base_patch16_384.augreg_in21k_ft_in1k | 74.82 | 84.40 | 36.75 | 48.82 | 0.6316 | 0.1921 | 69.37% |
63
+
64
+ - DVT (voc_distilled): The distilled models trained on the VOC dataset.
65
+
66
+ | Model | VOC_mIoU | VOC_mAcc | ADE_mIoU | ADE_mAcc | NYU_RMSE | NYU_abs_rel | NYU_a1 |
67
+ |--------------------------------------------------------------|----------|----------|----------|----------|-----------|-------------|---------|
68
+ | vit_base_patch14_dinov2.lvd142m | 85.10 | 91.41 | 48.57 | 60.35 | 0.3850 | 0.1207 | 88.25% |
69
+ | vit_base_patch14_reg4_dinov2.lvd142m | 84.36 | 90.80 | 49.20 | 61.56 | 0.3838 | 0.1143 | 88.97% |
70
+ | deit3_base_patch16_224.fb_in1k | 73.63 | 82.74 | 34.43 | 44.96 | 0.5712 | 0.1747 | 74.00% |
71
+ | vit_base_patch16_clip_384.laion2b_ft_in12k_in1k | 79.86 | 88.33 | 42.28 | 54.26 | 0.5253 | 0.1571 | 77.23% |
72
+ | vit_base_patch16_224.dino | 66.80 | 78.47 | 32.68 | 42.58 | 0.5750 | 0.1696 | 73.86% |
73
+ | vit_base_patch16_224.mae | 51.91 | 64.67 | 23.73 | 31.88 | 0.6733 | 0.2282 | 65.33% |
74
+ | eva02_base_patch16_clip_224.merged2b | 75.93 | 85.44 | 40.15 | 52.04 | - | - | - |
75
+ | vit_base_patch16_384.augreg_in21k_ft_in1k | 76.26 | 85.14 | 38.62 | 50.61 | 0.5825 | 0.1768 | 73.14% |
76
+
77
+ - DVT (imgnet_denoised) and DVT (imgnet_distilled): The denoised and distilled models trained on the ImageNet-1k dataset.
78
+
79
+ | Model | VOC_mIoU | VOC_mAcc | ADE_mIoU | ADE_mAcc | NYU_RMSE | NYU_abs_rel | NYU_a1 |
80
+ |--------------------------------------------------------------|----------|----------|----------|----------|-----------|-------------|---------|
81
+ | vit_base_patch14_dinov2.lvd142m (denoised) | 85.17 | 91.55 | 48.68 | 60.60 | 0.3832 | 0.1152 | 88.50% |
82
+ | vit_base_patch14_dinov2.lvd142m (distilled) | 85.33 | 91.48 | 48.85 | 60.47 | 0.3704 | 0.1115 | 89.74% |
83
+
84
+
85
+ A summary of DINOv2-base model is shown below:
86
+
87
+ | vit_base_patch14_dinov2.lvd142m | VOC_mIoU | VOC_mAcc | ADE_mIoU | ADE_mAcc | NYU_RMSE | NYU_abs_rel | NYU_a1 |
88
+ |--------------------------------------------------------------|----------|----------|----------|----------|-----------|-------------|---------|
89
+ | baseline | 83.52 | 90.60 | 47.02 | 58.45 | 0.3965 | 0.1197 | 87.59% |
90
+ | `voc_denoised` | 84.92 | 91.74 | 48.54 | 60.21 | 0.3811 | 0.1166 | 88.42% |
91
+ | `voc_distilled` | 85.10 | 91.41 | 48.57 | 60.35 | 0.3850 | 0.1207 | 88.25% |
92
+ | `imgnet_denoised` | 85.17 | 91.55 | 48.68 | 60.60 | 0.3832 | 0.1152 | 88.50% |
93
+ | `imgnet_distilled` | 85.33 | 91.48 | 48.85 | 60.47 | 0.3704 | 0.1115 | 89.74% |
94
+
95
+
96
+ In fact, during our exploration, we find the setting of denoiser training and distillation training can slightly affect the performance of the final model. For example, whether to include the `cls` token in the denoiser's Transformer feedforward layer can affect the depth estimation performance. Our best model during the exploration achieves around 85.56 mIoU on the VOC, 49.02 mIoU on the ADE, and 89.98% a1 on the NYU datasets.
97
+
98
+ However, we do not include this model in the final release because of the additional complexity but non-significant improvement.
99
+
100
+
101
+ ## Citation
102
+
103
+ If you find this project useful, please consider citing:
104
+
105
+ ```bibtex
106
+ @inproceedings{yang2024denoising,
107
+ title={Denoising vision transformers},
108
+ author={Yang, Jiawei and Luo, Katie Z and Li, Jiefeng and Deng, Congyue and Guibas, Leonidas and Krishnan, Dilip and Weinberger, Kilian Q and Tian, Yonglong and Wang, Yue},
109
+ booktitle={ECCV},
110
+ year={2024}
111
+ }
112
+ ```
imgnet_denoised/vit_base_patch14_dinov2.lvd142m.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7ee1487757090078c162172df24bdc065c11ec49ccedbebdc95c95f3a5b47213
3
+ size 32562183
imgnet_distilled/vit_base_patch14_dinov2.lvd142m.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:71e07a1d58fd5e6f259cd4cd5eaf335ab70c670ddc27a59e9ccc7eade74bf274
3
+ size 346390726
voc_denoised/deit3_base_patch16_224.fb_in1k.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:de6367812f2e11da067648b2e3cfab4a63fe136bfb9a27c8996304483e91a94c
3
+ size 31502326
voc_denoised/eva02_base_patch16_clip_224.merged2b.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:913e56f3d5e6b1253d89af4134b3d4d3164daa3722f8519f165b82bd26d92dc4
3
+ size 31502428
voc_denoised/vit_base_patch14_dinov2.lvd142m.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:57ffc275a0f616fe3547a56ab0111d62e8e8a6e1e13374d29fc0e8c7b922153d
3
+ size 32562183
voc_denoised/vit_base_patch14_reg4_dinov2.lvd142m.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6c6073ce2705aae26f1f4c062d4e9cf87e0c75e388ee68282897744cb53fc467
3
+ size 32562268
voc_denoised/vit_base_patch16_224.dino.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5353ad39bbfccd9010bb44b7b4d021e0a29f138ffa26b89ac238ad1730b72f55
3
+ size 31502241
voc_denoised/vit_base_patch16_224.mae.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6e860f872fcc3aca47d3fc1b5b35f74e5a55708b9d1ebc9d5edc993b3c4d65b4
3
+ size 31502224
voc_denoised/vit_base_patch16_384.augreg_in21k_ft_in1k.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cc4cb3a7d9466f727913e502f5c8436fc3cfea1acc9721a141705ce8458e328c
3
+ size 31502513
voc_denoised/vit_base_patch16_clip_384.laion2b_ft_in12k_in1k.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c7cac7e452c26c89db9cc7e3a9c8c84d5084a288d2fca1b40e04df3248d7227b
3
+ size 31502615
voc_denoised/vit_large_patch14_dinov2.lvd142m.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:805f043b726d7feb6bb9b0bc618f34fd8f6f07b283734cf42616c48082d8bf80
3
+ size 55997464
voc_denoised/vit_large_patch14_reg4_dinov2.lvd142m.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:521e90dca6883da3493dfd1f6abb333b70ab067a8840886a59e334e4d30ce03c
3
+ size 55997549
voc_denoised/vit_small_patch14_dinov2.lvd142m.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c066cf26c8f7ad78666ee8ebbb4d299223ffcadd1485acc4d590dc710b322eb7
3
+ size 9205784
voc_denoised/vit_small_patch14_reg4_dinov2.lvd142m.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3f24b0212015c87474ec9d71d01d23956c49a4d0766ff1142f119a68d1d10978
3
+ size 9205869
voc_distilled/deit3_base_patch16_224.fb_in1k.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:86f22a8fc49aca65533cc5bfceaa58c371c233545473b9dffb201e6de7f52ffa
3
+ size 343336980
voc_distilled/eva02_base_patch16_clip_224.merged2b.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:27b225bf4c13b232e0cf305fece880c4f841c79242c6a261c8e413bc1632fac3
3
+ size 343579932
voc_distilled/vit_base_patch14_dinov2.lvd142m.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9bc0d67f659c95b68a0449a98c9a83d8f86de05d741b8e7aee458e576fa2ba16
3
+ size 346390726
voc_distilled/vit_base_patch14_reg4_dinov2.lvd142m.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:accfae64bf7bed4c1733851191eb8227a670e45ff7c6cba12bb9bd4a3d1190eb
3
+ size 346401179
voc_distilled/vit_base_patch16_224.dino.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cca5519478605edd9fc311e66238809bb755e392cd17cb80309423f441a75a7b
3
+ size 343257498
voc_distilled/vit_base_patch16_224.mae.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0d03cfd8f00efdd6b3b5db5f18de56d8e6531febd92924e4365d5d0e521f8a5f
3
+ size 343257344
voc_distilled/vit_base_patch16_384.augreg_in21k_ft_in1k.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4c3dcaaae52d79349e8fd8eaef1af4f7fee83bdb0dd7b9ce9e3145efc791e995
3
+ size 344427322
voc_distilled/vit_base_patch16_clip_384.laion2b_ft_in12k_in1k.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3be7a263532992345d0ab3155610c9837dd20971608ccaac649a8186cfdd5f39
3
+ size 344431676