library_name: aim
pipeline_tag: image-classification
license: other
license_name: apple-sample-code-license
license_link: LICENSE
datasets:
- imagenet-1k
metrics:
- accuracy
tags:
- large-scale-vision-models
- pytorch
- mlx
- jax
- vision
- ssl
- pre-training
- DFN
AIM: Autoregressive Image Models
Alaaeldin El-Nouby, Michal Klein, Shuangfei Zhai, Miguel Angel Bautista, Alexander Toshev, Vaishaal Shankar, Joshua M Susskind, and Armand Joulin
This software project accompanies the research paper, Scalable Pre-training of Large Autoregressive Image Models.
We introduce AIM a collection of vision models pre-trained with an autoregressive generative objective. We show that autoregressive pre-training of image features exhibits similar scaling properties to their textual counterpart (i.e. Large Language Models). Specifically, we highlight two findings:
- the model capacity can be trivially scaled to billions of parameters, and
- AIM effectively leverages large collections of uncurated image data.
Installation
Please install PyTorch using the official installation instructions. Afterward, install the package as:
pip install git+https://[email protected]/apple/ml-aim.git
We also offer MLX backend support for research and experimentation on Apple silicon. To enable MLX support, simply run:
pip install mlx
Usage
Below we provide an example of usage in PyTorch:
from PIL import Image
from aim.utils import load_pretrained
from aim.torch.data import val_transforms
img = Image.open(...)
model = load_pretrained("aim-600M-2B-imgs", backend="torch")
transform = val_transforms()
inp = transform(img).unsqueeze(0)
logits, _ = model(inp)
and in both MLX
from PIL import Image
import mlx.core as mx
from aim.utils import load_pretrained
from aim.torch.data import val_transforms
img = Image.open(...)
model = load_pretrained("aim-600M-2B-imgs", backend="mlx")
transform = val_transforms()
inp = transform(img).unsqueeze(0)
inp = mx.array(inp.numpy())
logits, _ = model(inp)
and JAX
from PIL import Image
import jax.numpy as jnp
from aim.utils import load_pretrained
from aim.torch.data import val_transforms
img = Image.open(...)
model, params = load_pretrained("aim-600M-2B-imgs", backend="jax")
transform = val_transforms()
inp = transform(img).unsqueeze(0)
inp = jnp.array(inp)
(logits, _), _ = model.apply(params, inp, mutable=['batch_stats'])
Usage
The pre-trained models can be used via Hugging Face hub as follows:
from PIL import Image
from aim.torch.models import AIMForImageClassification
from aim.torch.data import val_transforms
img = Image.open(...)
model = AIMForImageClassification.from_pretrained("apple/aim-7B")
transform = val_transforms()
inp = transform(img).unsqueeze(0)
logits, features = model(inp)
Pre-trained backbones
The following table contains pre-trained backbones used in our paper.
model | #params | attn (best layer) | backbone, SHA256 |
---|---|---|---|
AIM-0.6B | 0.6B | 79.4% | link, 0d6f6b8f |
AIM-1B | 1B | 82.3% | link, d254ecd3 |
AIM-3B | 3B | 83.3% | link, 8475ce4e |
AIM-7B | 7B | 84.0% | link, 184ed94c |
Pre-trained attention heads
The table below contains the classification results on ImageNet-1k validation set.
model | top-1 IN-1k | attention head, SHA256 | ||
---|---|---|---|---|
last layer | best layer | last layer | best layer | |
AIM-0.6B | 78.5% | 79.4% | link, 5ce5a341 | link, ebd45c05 |
AIM-1B | 80.6% | 82.3% | link, db3be2ad | link, f1ed7852 |
AIM-3B | 82.2% | 83.3% | link, 5c057b30 | link, ad380e16 |
AIM-7B | 82.4% | 84.0% | link, 1e5c99ba | link, 73ecd732 |
Reproducing the IN-1k classification results
The commands below reproduce the attention probe results on ImageNet-1k validation set. We run the evaluation using 1 node with 8 GPUs:
torchrun --standalone --nnodes=1 --nproc-per-node=8 main_attnprobe.py \
--model=aim-7B \
--batch-size=64 \
--data-path=/path/to/imagenet \
--probe-layers=last \
--backbone-ckpt-path=/path/to/backbone_ckpt.pth \
--head-ckpt-path=/path/to/head_ckpt.pth
By default, we probe the last 6 layers. To change this, simply pass --probe-layers=best
.