SPFormer: Enhancing Vision Transformer with Superpixel Representation
Abstract
In this work, we introduce SPFormer, a novel <PRE_TAG>Vision Transformer</POST_TAG> enhanced by <PRE_TAG>superpixel representation</POST_TAG>. Addressing the limitations of traditional Vision Transformers' fixed-size, non-adaptive patch partitioning, SPFormer employs <PRE_TAG>superpixels</POST_TAG> that adapt to the image's content. This approach divides the image into irregular, <PRE_TAG>semantically coherent regions</POST_TAG>, effectively capturing intricate details and applicable at both initial and intermediate feature levels. SPFormer, trainable end-to-end, exhibits superior performance across various benchmarks. Notably, it exhibits significant improvements on the challenging <PRE_TAG>ImageNet benchmark</POST_TAG>, achieving a 1.4% increase over <PRE_TAG>DeiT-T</POST_TAG> and 1.1% over <PRE_TAG>DeiT-S</POST_TAG> respectively. A standout feature of SPFormer is its inherent <PRE_TAG>explainability</POST_TAG>. The superpixel structure offers a window into the model's internal processes, providing valuable insights that enhance the model's <PRE_TAG>interpretability</POST_TAG>. This level of clarity significantly improves SPFormer's robustness, particularly in challenging scenarios such as <PRE_TAG>image rotations</POST_TAG> and <PRE_TAG>occlusions</POST_TAG>, demonstrating its <PRE_TAG>adaptability</POST_TAG> and <PRE_TAG>resilience</POST_TAG>.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper