Spaces:
Runtime error
Runtime error
<!--Copyright 2022 The HuggingFace Team. All rights reserved. | |
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | |
the License. You may obtain a copy of the License at | |
http://www.apache.org/licenses/LICENSE-2.0 | |
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | |
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | |
specific language governing permissions and limitations under the License. | |
--> | |
# Dilated Neighborhood Attention Transformer | |
## Overview | |
DiNAT was proposed in [Dilated Neighborhood Attention Transformer](https://arxiv.org/abs/2209.15001) | |
by Ali Hassani and Humphrey Shi. | |
It extends [NAT](nat) by adding a Dilated Neighborhood Attention pattern to capture global context, | |
and shows significant performance improvements over it. | |
The abstract from the paper is the following: | |
*Transformers are quickly becoming one of the most heavily applied deep learning architectures across modalities, | |
domains, and tasks. In vision, on top of ongoing efforts into plain transformers, hierarchical transformers have | |
also gained significant attention, thanks to their performance and easy integration into existing frameworks. | |
These models typically employ localized attention mechanisms, such as the sliding-window Neighborhood Attention (NA) | |
or Swin Transformer's Shifted Window Self Attention. While effective at reducing self attention's quadratic complexity, | |
local attention weakens two of the most desirable properties of self attention: long range inter-dependency modeling, | |
and global receptive field. In this paper, we introduce Dilated Neighborhood Attention (DiNA), a natural, flexible and | |
efficient extension to NA that can capture more global context and expand receptive fields exponentially at no | |
additional cost. NA's local attention and DiNA's sparse global attention complement each other, and therefore we | |
introduce Dilated Neighborhood Attention Transformer (DiNAT), a new hierarchical vision transformer built upon both. | |
DiNAT variants enjoy significant improvements over strong baselines such as NAT, Swin, and ConvNeXt. | |
Our large model is faster and ahead of its Swin counterpart by 1.5% box AP in COCO object detection, | |
1.3% mask AP in COCO instance segmentation, and 1.1% mIoU in ADE20K semantic segmentation. | |
Paired with new frameworks, our large variant is the new state of the art panoptic segmentation model on COCO (58.2 PQ) | |
and ADE20K (48.5 PQ), and instance segmentation model on Cityscapes (44.5 AP) and ADE20K (35.4 AP) (no extra data). | |
It also matches the state of the art specialized semantic segmentation models on ADE20K (58.2 mIoU), | |
and ranks second on Cityscapes (84.5 mIoU) (no extra data). * | |
Tips: | |
- One can use the [`AutoImageProcessor`] API to prepare images for the model. | |
- DiNAT can be used as a *backbone*. When `output_hidden_states = True`, | |
it will output both `hidden_states` and `reshaped_hidden_states`. The `reshaped_hidden_states` have a shape of `(batch, num_channels, height, width)` rather than `(batch_size, height, width, num_channels)`. | |
Notes: | |
- DiNAT depends on [NATTEN](https://github.com/SHI-Labs/NATTEN/)'s implementation of Neighborhood Attention and Dilated Neighborhood Attention. | |
You can install it with pre-built wheels for Linux by referring to [shi-labs.com/natten](https://shi-labs.com/natten), or build on your system by running `pip install natten`. | |
Note that the latter will likely take time to compile. NATTEN does not support Windows devices yet. | |
- Patch size of 4 is only supported at the moment. | |
<img | |
src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/dilated-neighborhood-attention-pattern.jpg" | |
alt="drawing" width="600"/> | |
<small> Neighborhood Attention with different dilation values. | |
Taken from the <a href="https://arxiv.org/abs/2209.15001">original paper</a>.</small> | |
This model was contributed by [Ali Hassani](https://huggingface.co/alihassanijr). | |
The original code can be found [here](https://github.com/SHI-Labs/Neighborhood-Attention-Transformer). | |
## Resources | |
A list of official Hugging Face and community (indicated by π) resources to help you get started with DiNAT. | |
<PipelineTag pipeline="image-classification"/> | |
- [`DinatForImageClassification`] is supported by this [example script](https://github.com/huggingface/transformers/tree/main/examples/pytorch/image-classification) and [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/image_classification.ipynb). | |
- See also: [Image classification task guide](../tasks/image_classification) | |
If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource. | |
## DinatConfig | |
[[autodoc]] DinatConfig | |
## DinatModel | |
[[autodoc]] DinatModel | |
- forward | |
## DinatForImageClassification | |
[[autodoc]] DinatForImageClassification | |
- forward | |