facebook
/

data2vec-vision-base

+---
+license: apache-2.0
+tags:
+- image-classification
+- vision
+datasets:
+- imagenet
+- imagenet-1k
+---
+# Data2Vec-Vision (base-sized model, pre-trained only)
+BEiT model pre-trained in a self-supervised fashion on ImageNet-1k (1,2 million images, 1000 classes) at resolution 224x224. It was introduced in the paper [data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language](https://arxiv.org/abs/2202.03555) by Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli and first released in [this repository](https://github.com/facebookresearch/data2vec_vision/tree/main/beit).
+Disclaimer: The team releasing Facebook team did not write a model card for this model so this model card has been written by the Hugging Face team.
+## Pre-Training method
+![model image](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/data2vec.png)
+For more information, please take a look at the [official paper](https://arxiv.org/abs/2202.03555).
+## Abstract
+*While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because
+they were developed with a single modality in
+mind. To get us closer to general self-supervised
+learning, we present data2vec, a framework that
+uses the same learning method for either speech,
+NLP or computer vision. The core idea is to predict latent representations of the full input data
+based on a masked view of the input in a selfdistillation setup using a standard Transformer architecture. Instead of predicting modality-specific
+targets such as words, visual tokens or units of
+human speech which are local in nature, data2vec
+predicts contextualized latent representations that
+contain information from the entire input. Experiments on the major benchmarks of speech
+recognition, image classification, and natural language understanding demonstrate a new state of
+the art or competitive performance to predominant approaches.*
+## Intended uses & limitations
+You can use the raw model for image classification. See the [model hub](https://huggingface.co/models?other=data2vec-vision) to look for
+fine-tuned versions on a task that interests you.
+## Training data
+The BEiT model was pretrained on [ImageNet-1k](http://www.image-net.org/), a dataset consisting of 1,2 million images and 1k classes.
+## Training procedure
+### Preprocessing
+The exact details of preprocessing of images during training/validation can be found [here](https://github.com/microsoft/unilm/blob/master/beit/datasets.py).
+Images are resized/rescaled to the same resolution (224x224) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5).
+### Pretraining
+For all pre-training related hyperparameters, we refer to the [original paper](https://arxiv.org/abs/2106.08254) and the [original codebase](https://github.com/facebookresearch/data2vec_vision/tree/main/beit)
+## Evaluation results
+For evaluation results on several image classification benchmarks, we refer to tables 1 of the original paper. Note that for fine-tuning, the best results are obtained with a higher resolution. Of course, increasing the model size will result in better performance.
+### BibTeX entry and citation info
+```bibtex
+@misc{https://doi.org/10.48550/arxiv.2202.03555,
+  doi = {10.48550/ARXIV.2202.03555},
+  url = {https://arxiv.org/abs/2202.03555},
+  author = {Baevski, Alexei and Hsu, Wei-Ning and Xu, Qiantong and Babu, Arun and Gu, Jiatao and Auli, Michael},
+  keywords = {Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
+  title = {data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language},
+  publisher = {arXiv},
+  year = {2022},
+  copyright = {arXiv.org perpetual, non-exclusive license}
+}
+```