|
--- |
|
license: apache-2.0 |
|
tags: |
|
- image-classification |
|
- vision |
|
datasets: |
|
- imagenet |
|
- imagenet-1k |
|
--- |
|
|
|
# Data2Vec-Vision (base-sized model, fine-tuned on ImageNet-1k) |
|
|
|
BEiT model pre-trained in a self-supervised fashion and fine-tuned on ImageNet-1k (1,2 million images, 1000 classes) at resolution 224x224. It was introduced in the paper [data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language](https://arxiv.org/abs/2202.03555) by Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, Michael Auli and first released in [this repository](https://github.com/facebookresearch/data2vec_vision/tree/main/beit). |
|
|
|
Disclaimer: The team releasing Facebook team did not write a model card for this model so this model card has been written by the Hugging Face team. |
|
|
|
## Pre-Training method |
|
|
|
![model image](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/data2vec.png) |
|
|
|
For more information, please take a look at the [official paper](https://arxiv.org/abs/2202.03555). |
|
|
|
## Abstract |
|
|
|
*While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because |
|
they were developed with a single modality in |
|
mind. To get us closer to general self-supervised |
|
learning, we present data2vec, a framework that |
|
uses the same learning method for either speech, |
|
NLP or computer vision. The core idea is to predict latent representations of the full input data |
|
based on a masked view of the input in a selfdistillation setup using a standard Transformer architecture. Instead of predicting modality-specific |
|
targets such as words, visual tokens or units of |
|
human speech which are local in nature, data2vec |
|
predicts contextualized latent representations that |
|
contain information from the entire input. Experiments on the major benchmarks of speech |
|
recognition, image classification, and natural language understanding demonstrate a new state of |
|
the art or competitive performance to predominant approaches.* |
|
|
|
## Intended uses & limitations |
|
|
|
You can use the raw model for image classification. See the [model hub](https://huggingface.co/models?search=data2vec-vision) to look for |
|
fine-tuned versions on a task that interests you. |
|
|
|
### How to use |
|
|
|
Here is how to use this model to classify an image of the COCO 2017 dataset into one of the 1,000 ImageNet classes: |
|
|
|
```python |
|
from transformers import BeitFeatureExtractor, Data2VecVisionForImageClassification |
|
from PIL import Image |
|
import requests |
|
url = 'http://images.cocodataset.org/val2017/000000039769.jpg' |
|
image = Image.open(requests.get(url, stream=True).raw) |
|
feature_extractor = BeitFeatureExtractor.from_pretrained('facebook/data2vec-vision-base-ft1k') |
|
model = Data2VecVisionForImageClassification.from_pretrained('facebook/data2vec-vision-base-ft1k') |
|
inputs = feature_extractor(images=image, return_tensors="pt") |
|
outputs = model(**inputs) |
|
logits = outputs.logits |
|
# model predicts one of the 1000 ImageNet classes |
|
predicted_class_idx = logits.argmax(-1).item() |
|
print("Predicted class:", model.config.id2label[predicted_class_idx]) |
|
``` |
|
|
|
Currently, both the feature extractor and model support PyTorch. |
|
|
|
## Training data |
|
|
|
The BEiT model was pretrained and fine-tuned on [ImageNet-1k](http://www.image-net.org/), a dataset consisting of 1,2 million images and 1k classes. |
|
|
|
## Training procedure |
|
|
|
### Preprocessing |
|
|
|
The exact details of preprocessing of images during training/validation can be found [here](https://github.com/microsoft/unilm/blob/master/beit/datasets.py). |
|
|
|
Images are resized/rescaled to the same resolution (224x224) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5). |
|
|
|
### Pretraining |
|
|
|
For all pre-training related hyperparameters, we refer to the [original paper](https://arxiv.org/abs/2106.08254) and the [original codebase](https://github.com/facebookresearch/data2vec_vision/tree/main/beit) |
|
|
|
## Evaluation results |
|
|
|
For evaluation results on several image classification benchmarks, we refer to tables 1 of the original paper. Note that for fine-tuning, the best results are obtained with a higher resolution. Of course, increasing the model size will result in better performance. |
|
|
|
We evaluated the model on `ImageNet1K` and got top-1 accuracy = **83.97** while in the original paper it was reported top-1 accuracy = 84.2. |
|
If you want to reproduce our evaluation process you can use [This Colab Notebook](https://colab.research.google.com/drive/1Tse8Rfv-QhapMEMzauxUqnAQyXUgnTLK?usp=sharing) |
|
|
|
### BibTeX entry and citation info |
|
|
|
```bibtex |
|
@misc{https://doi.org/10.48550/arxiv.2202.03555, |
|
doi = {10.48550/ARXIV.2202.03555}, |
|
url = {https://arxiv.org/abs/2202.03555}, |
|
author = {Baevski, Alexei and Hsu, Wei-Ning and Xu, Qiantong and Babu, Arun and Gu, Jiatao and Auli, Michael}, |
|
keywords = {Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences}, |
|
title = {data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language}, |
|
publisher = {arXiv}, |
|
year = {2022}, |
|
copyright = {arXiv.org perpetual, non-exclusive license} |
|
} |
|
``` |