Jiawei Yang commited on Oct 28, 2024

Commit

442a9cf

1 Parent(s): 618c30f

upload models

Files changed (23) hide show

README.md +109 -0
imgnet_denoised/vit_base_patch14_dinov2.lvd142m.pth +3 -0
imgnet_distilled/vit_base_patch14_dinov2.lvd142m.pth +3 -0
voc_denoised/deit3_base_patch16_224.fb_in1k.pth +3 -0
voc_denoised/eva02_base_patch16_clip_224.merged2b.pth +3 -0
voc_denoised/vit_base_patch14_dinov2.lvd142m.pth +3 -0
voc_denoised/vit_base_patch14_reg4_dinov2.lvd142m.pth +3 -0
voc_denoised/vit_base_patch16_224.dino.pth +3 -0
voc_denoised/vit_base_patch16_224.mae.pth +3 -0
voc_denoised/vit_base_patch16_384.augreg_in21k_ft_in1k.pth +3 -0
voc_denoised/vit_base_patch16_clip_384.laion2b_ft_in12k_in1k.pth +3 -0
voc_denoised/vit_large_patch14_dinov2.lvd142m.pth +3 -0
voc_denoised/vit_large_patch14_reg4_dinov2.lvd142m.pth +3 -0
voc_denoised/vit_small_patch14_dinov2.lvd142m.pth +3 -0
voc_denoised/vit_small_patch14_reg4_dinov2.lvd142m.pth +3 -0
voc_distilled/deit3_base_patch16_224.fb_in1k.pth +3 -0
voc_distilled/eva02_base_patch16_clip_224.merged2b.pth +3 -0
voc_distilled/vit_base_patch14_dinov2.lvd142m.pth +3 -0
voc_distilled/vit_base_patch14_reg4_dinov2.lvd142m.pth +3 -0
voc_distilled/vit_base_patch16_224.dino.pth +3 -0
voc_distilled/vit_base_patch16_224.mae.pth +3 -0
voc_distilled/vit_base_patch16_384.augreg_in21k_ft_in1k.pth +3 -0
voc_distilled/vit_base_patch16_clip_384.laion2b_ft_in12k_in1k.pth +3 -0

README.md CHANGED Viewed

@@ -1,3 +1,112 @@
 ---
 license: mit
 ---

 ---
 license: mit
+language:
+- en
+library_name: DVT
+tags:
+- denoising vision transformer
+- ViT artifacts
 ---
+# Denoising Vision Transformer (DVT)
+## Introduction
+We study a crucial yet often overlooked issue inherent to Vision Transformers (ViTs): feature maps of these models exhibit grid-like artifacts (“Original features” in the teaser), which hurt the performance of ViTs in downstream dense prediction tasks such as semantic segmentation, depth prediction, and object discovery. We trace this issue down to the positional embeddings at the input stage. To mitigate this, we propose a two-stage denoising approach, termed Denoising Vision Transformers (DVT). In the first stage, we separate the clean features from those contaminated by positional artifacts by enforcing cross-view feature consistency with neural fields on a per-image basis. This per-image optimization process extracts artifact-free features from raw ViT outputs, providing clean feature estimates for offline applications. In the second stage, we train a lightweight transformer block to predict clean features from raw ViT outputs, leveraging the derived estimates of the clean features as supervision. Our method, DVT, does not require re-training the existing pre-trained ViTs, and is immediately applicable to any Vision Transformer architecture. We evaluate our method on a variety of representative ViTs (DINO, DeiT-III, EVA02, CLIP, DINOv2, DINOv2-reg) and demonstrate that DVT consistently improves existing state-of-the-art general-purpose models in semantic and geometric tasks across multiple datasets (Fig. 1, right, Tabs. 2 to 4). We hope our study will encourage a re-evaluation of ViT design, especially regarding the naive use of positional embeddings. Our code and checkpoints are publicly available.
+## Model Summary
+We include 4 versions of models in this space:
+- `voc_denoised`: These are single-layer Transformer models that are trained to denoise the output of the original ViT models. These models are trained on the VOC dataset.
+- `voc_distilled`: These are models distilled from the denoiser using the ImageNet-1k dataset, where all model parameters are jointly fine-tuned. The distillation process involves three stages:
+    1. Stage 1: Perform per-image denoising on the VOC datasets.
+    2. Stage 2: Train the denoiser using the features obtained from the per-image denoising in Stage 1 on the VOC datasets.
+    3. Stage 3: Fine-tune the entire model on the ImageNet-1k dataset, using the outputs from the Stage 2 denoiser as supervision.
+- `imgnet_denoised`: The same as `voc_denoised`, but trained on the ImageNet-1k dataset.
+- `imgnet_distilled`: The same as `voc_distilled`, but trained on the ImageNet-1k dataset, including the denoiser and the distilled model.
+## Performance Summary
+- Baseline: The original ViT models.
+| Model                                                        | VOC_mIoU | VOC_mAcc | ADE_mIoU | ADE_mAcc | NYU_RMSE | NYU_abs_rel | NYU_a1  |
+|--------------------------------------------------------------|----------|----------|----------|----------|-----------|-------------|---------|
+| vit_small_patch14_dinov2.lvd142m                             | 81.78    | 88.44    | 44.05    | 55.53    | 0.4340    | 0.1331      | 84.49%  |
+| vit_base_patch14_dinov2.lvd142m                              | 83.52    | 90.60    | 47.02    | 58.45    | 0.3965    | 0.1197      | 87.59%  |
+| vit_large_patch14_dinov2.lvd142m                             | 83.43    | 90.38    | 47.53    | 59.64    | 0.3831    | 0.1145      | 88.89%  |
+| vit_small_patch14_reg4_dinov2.lvd142m                        | 80.88    | 88.69    | 44.36    | 55.90    | 0.4328    | 0.1303      | 85.00%  |
+| vit_base_patch14_reg4_dinov2.lvd142m                         | 83.48    | 90.95    | 47.73    | 60.17    | 0.3967    | 0.1177      | 87.92%  |
+| vit_large_patch14_reg4_dinov2.lvd142m                        | 83.21    | 90.67    | 48.44    | 61.28    | 0.3852    | 0.1139      | 88.53%  |
+| deit3_base_patch16_224.fb_in1k                               | 71.03    | 80.67    | 32.84    | 42.79    | 0.5837    | 0.1772      | 73.03%  |
+| vit_base_patch16_clip_384.laion2b_ft_in12k_in1k              | 77.75    | 86.68    | 40.50    | 52.81    | 0.5585    | 0.1678      | 74.30%  |
+| vit_base_patch16_224.dino                                    | 62.92    | 75.98    | 31.03    | 40.62    | 0.5742    | 0.1694      | 74.55%  |
+| vit_base_patch16_224.mae                                     | 50.29    | 63.10    | 23.84    | 32.06    | 0.6629    | 0.2275      | 66.24%  |
+| eva02_base_patch16_clip_224.merged2b                         | 71.49    | 82.69    | 37.89    | 50.31    | -    | -      | -   |
+| vit_base_patch16_384.augreg_in21k_ft_in1k                    | 73.51    | 83.60    | 36.46    | 48.65    | 0.6360    | 0.1898      | 69.10%  |
+- DVT (voc_denoised): The denoised models trained on the VOC dataset.
+| Model                                                        | VOC_mIoU | VOC_mAcc | ADE_mIoU | ADE_mAcc | NYU_RMSE | NYU_abs_rel | NYU_a1  |
+|--------------------------------------------------------------|----------|----------|----------|----------|-----------|-------------|---------|
+| vit_small_patch14_dinov2.lvd142m                             | 82.78    | 90.69    | 45.14    | 56.35    | 0.4368    | 0.1337      | 84.34%  |
+| vit_base_patch14_dinov2.lvd142m                              | 84.92    | 91.74    | 48.54    | 60.21    | 0.3811    | 0.1166      | 88.42%  |
+| vit_large_patch14_dinov2.lvd142m                             | 85.25    | 91.69    | 49.80    | 61.98    | 0.3826    | 0.1118      | 89.32%  |
+| vit_small_patch14_reg4_dinov2.lvd142m                        | 81.93    | 89.54    | 45.55    | 57.52    | 0.4251    | 0.1292      | 85.01%  |
+| vit_base_patch14_reg4_dinov2.lvd142m                         | 84.58    | 91.17    | 49.24    | 61.66    | 0.3898    | 0.1146      | 88.60%  |
+| vit_large_patch14_reg4_dinov2.lvd142m                        | 84.37    | 91.42    | 49.19    | 62.21    | 0.3852    | 0.1141      | 88.45%  |
+| deit3_base_patch16_224.fb_in1k                               | 73.52    | 83.65    | 33.57    | 43.56    | 0.5817    | 0.1774      | 73.05%  |
+| vit_base_patch16_clip_384.laion2b_ft_in12k_in1k              | 79.50    | 88.43    | 41.33    | 53.54    | 0.5512    | 0.1639      | 75.30%  |
+| vit_base_patch16_224.dino                                    | 66.41    | 77.75    | 32.45    | 42.42    | 0.5784    | 0.1738      | 73.75%  |
+| vit_base_patch16_224.mae                                     | 50.65    | 62.90    | 23.25    | 31.03    | 0.6651    | 0.2271      | 65.44%  |
+| eva02_base_patch16_clip_224.merged2b                         | 73.76    | 84.50    | 37.99    | 50.40    | 0.6196    | 0.1904      | 69.86%  |
+| vit_base_patch16_384.augreg_in21k_ft_in1k                    | 74.82    | 84.40    | 36.75    | 48.82    | 0.6316    | 0.1921      | 69.37%  |
+- DVT (voc_distilled): The distilled models trained on the VOC dataset.
+| Model                                                        | VOC_mIoU | VOC_mAcc | ADE_mIoU | ADE_mAcc | NYU_RMSE | NYU_abs_rel | NYU_a1  |
+|--------------------------------------------------------------|----------|----------|----------|----------|-----------|-------------|---------|
+| vit_base_patch14_dinov2.lvd142m                              | 85.10    | 91.41    | 48.57    | 60.35    | 0.3850    | 0.1207      | 88.25%  |
+| vit_base_patch14_reg4_dinov2.lvd142m                         | 84.36    | 90.80    | 49.20    | 61.56    | 0.3838    | 0.1143      | 88.97%  |
+| deit3_base_patch16_224.fb_in1k                               | 73.63    | 82.74    | 34.43    | 44.96    | 0.5712    | 0.1747      | 74.00%  |
+| vit_base_patch16_clip_384.laion2b_ft_in12k_in1k              | 79.86    | 88.33    | 42.28    | 54.26    | 0.5253    | 0.1571      | 77.23%  |
+| vit_base_patch16_224.dino                                    | 66.80    | 78.47    | 32.68    | 42.58    | 0.5750    | 0.1696      | 73.86%  |
+| vit_base_patch16_224.mae                                     | 51.91    | 64.67    | 23.73    | 31.88    | 0.6733    | 0.2282      | 65.33%  |
+| eva02_base_patch16_clip_224.merged2b                         | 75.93    | 85.44    | 40.15    | 52.04    | -    | -      | -   |
+| vit_base_patch16_384.augreg_in21k_ft_in1k                    | 76.26    | 85.14    | 38.62    | 50.61    | 0.5825    | 0.1768      | 73.14%  |
+- DVT (imgnet_denoised) and DVT (imgnet_distilled): The denoised and distilled models trained on the ImageNet-1k dataset.
+| Model                                                        | VOC_mIoU | VOC_mAcc | ADE_mIoU | ADE_mAcc | NYU_RMSE | NYU_abs_rel | NYU_a1  |
+|--------------------------------------------------------------|----------|----------|----------|----------|-----------|-------------|---------|
+| vit_base_patch14_dinov2.lvd142m (denoised)                   | 85.17    | 91.55    | 48.68    | 60.60    | 0.3832    | 0.1152      | 88.50%  |
+| vit_base_patch14_dinov2.lvd142m (distilled)                  | 85.33    | 91.48    | 48.85    | 60.47    | 0.3704    | 0.1115      | 89.74%  |
+A summary of DINOv2-base model is shown below:
+| vit_base_patch14_dinov2.lvd142m                              | VOC_mIoU | VOC_mAcc | ADE_mIoU | ADE_mAcc | NYU_RMSE | NYU_abs_rel | NYU_a1  |
+|--------------------------------------------------------------|----------|----------|----------|----------|-----------|-------------|---------|
+| baseline                                                     | 83.52    | 90.60    | 47.02    | 58.45    | 0.3965    | 0.1197      | 87.59%  |
+| `voc_denoised`                                               | 84.92    | 91.74    | 48.54    | 60.21    | 0.3811    | 0.1166      | 88.42%  |
+| `voc_distilled`                                              | 85.10    | 91.41    | 48.57    | 60.35    | 0.3850    | 0.1207      | 88.25%  |
+| `imgnet_denoised`                                            | 85.17    | 91.55    | 48.68    | 60.60    | 0.3832    | 0.1152      | 88.50%  |
+| `imgnet_distilled`                                           | 85.33    | 91.48    | 48.85    | 60.47    | 0.3704    | 0.1115      | 89.74%  |
+In fact, during our exploration, we find the setting of denoiser training and distillation training can slightly affect the performance of the final model. For example, whether to include the `cls` token in the denoiser's Transformer feedforward layer can affect the depth estimation performance. Our best model during the exploration achieves around 85.56 mIoU on the VOC, 49.02 mIoU on the ADE, and 89.98% a1 on the NYU datasets.
+However, we do not include this model in the final release because of the additional complexity but non-significant improvement.
+## Citation
+If you find this project useful, please consider citing:
+```bibtex
+@inproceedings{yang2024denoising,
+  title={Denoising vision transformers},
+  author={Yang, Jiawei and Luo, Katie Z and Li, Jiefeng and Deng, Congyue and Guibas, Leonidas and Krishnan, Dilip and Weinberger, Kilian Q and Tian, Yonglong and Wang, Yue},
+  booktitle={ECCV},
+  year={2024}
+}
+```

imgnet_denoised/vit_base_patch14_dinov2.lvd142m.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7ee1487757090078c162172df24bdc065c11ec49ccedbebdc95c95f3a5b47213
+size 32562183

imgnet_distilled/vit_base_patch14_dinov2.lvd142m.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:71e07a1d58fd5e6f259cd4cd5eaf335ab70c670ddc27a59e9ccc7eade74bf274
+size 346390726

voc_denoised/deit3_base_patch16_224.fb_in1k.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:de6367812f2e11da067648b2e3cfab4a63fe136bfb9a27c8996304483e91a94c
+size 31502326

voc_denoised/eva02_base_patch16_clip_224.merged2b.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:913e56f3d5e6b1253d89af4134b3d4d3164daa3722f8519f165b82bd26d92dc4
+size 31502428

voc_denoised/vit_base_patch14_dinov2.lvd142m.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:57ffc275a0f616fe3547a56ab0111d62e8e8a6e1e13374d29fc0e8c7b922153d
+size 32562183

voc_denoised/vit_base_patch14_reg4_dinov2.lvd142m.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6c6073ce2705aae26f1f4c062d4e9cf87e0c75e388ee68282897744cb53fc467
+size 32562268

voc_denoised/vit_base_patch16_224.dino.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5353ad39bbfccd9010bb44b7b4d021e0a29f138ffa26b89ac238ad1730b72f55
+size 31502241

voc_denoised/vit_base_patch16_224.mae.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6e860f872fcc3aca47d3fc1b5b35f74e5a55708b9d1ebc9d5edc993b3c4d65b4
+size 31502224

voc_denoised/vit_base_patch16_384.augreg_in21k_ft_in1k.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cc4cb3a7d9466f727913e502f5c8436fc3cfea1acc9721a141705ce8458e328c
+size 31502513

voc_denoised/vit_base_patch16_clip_384.laion2b_ft_in12k_in1k.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c7cac7e452c26c89db9cc7e3a9c8c84d5084a288d2fca1b40e04df3248d7227b
+size 31502615

voc_denoised/vit_large_patch14_dinov2.lvd142m.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:805f043b726d7feb6bb9b0bc618f34fd8f6f07b283734cf42616c48082d8bf80
+size 55997464

voc_denoised/vit_large_patch14_reg4_dinov2.lvd142m.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:521e90dca6883da3493dfd1f6abb333b70ab067a8840886a59e334e4d30ce03c
+size 55997549

voc_denoised/vit_small_patch14_dinov2.lvd142m.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c066cf26c8f7ad78666ee8ebbb4d299223ffcadd1485acc4d590dc710b322eb7
+size 9205784

voc_denoised/vit_small_patch14_reg4_dinov2.lvd142m.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3f24b0212015c87474ec9d71d01d23956c49a4d0766ff1142f119a68d1d10978
+size 9205869

voc_distilled/deit3_base_patch16_224.fb_in1k.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:86f22a8fc49aca65533cc5bfceaa58c371c233545473b9dffb201e6de7f52ffa
+size 343336980

voc_distilled/eva02_base_patch16_clip_224.merged2b.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:27b225bf4c13b232e0cf305fece880c4f841c79242c6a261c8e413bc1632fac3
+size 343579932

voc_distilled/vit_base_patch14_dinov2.lvd142m.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9bc0d67f659c95b68a0449a98c9a83d8f86de05d741b8e7aee458e576fa2ba16
+size 346390726

voc_distilled/vit_base_patch14_reg4_dinov2.lvd142m.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:accfae64bf7bed4c1733851191eb8227a670e45ff7c6cba12bb9bd4a3d1190eb
+size 346401179

voc_distilled/vit_base_patch16_224.dino.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cca5519478605edd9fc311e66238809bb755e392cd17cb80309423f441a75a7b
+size 343257498

voc_distilled/vit_base_patch16_224.mae.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0d03cfd8f00efdd6b3b5db5f18de56d8e6531febd92924e4365d5d0e521f8a5f
+size 343257344

voc_distilled/vit_base_patch16_384.augreg_in21k_ft_in1k.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4c3dcaaae52d79349e8fd8eaef1af4f7fee83bdb0dd7b9ce9e3145efc791e995
+size 344427322

voc_distilled/vit_base_patch16_clip_384.laion2b_ft_in12k_in1k.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3be7a263532992345d0ab3155610c9837dd20971608ccaac649a8186cfdd5f39
+size 344431676