AIM / README.md

nielsr HF staff

Update README.md

6eaf354 verified 10 months ago

6.83 kB

	---
	library_name: aim
	pipeline_tag: image-classification
	license: other
	license_name: apple-sample-code-license
	license_link: LICENSE
	datasets:
	- imagenet-1k
	metrics:
	- accuracy
	tags:
	- large-scale-vision-models
	- pytorch
	- mlx
	- jax
	- vision
	- ssl
	- pre-training
	- DFN
	---
	# AIM: Autoregressive Image Models

	*Alaaeldin El-Nouby, Michal Klein, Shuangfei Zhai, Miguel Angel Bautista, Alexander Toshev, Vaishaal Shankar,
	Joshua M Susskind, and Armand Joulin*


	This software project accompanies the research paper, Scalable Pre-training of Large Autoregressive Image Models.

	We introduce AIM a collection of vision models pre-trained with an autoregressive generative objective.
	We show that autoregressive pre-training of image features exhibits similar scaling properties to their
	textual counterpart (i.e. Large Language Models). Specifically, we highlight two findings:
	1. the model capacity can be trivially scaled to billions of parameters, and
	2. AIM effectively leverages large collections of uncurated image data.

	## Installation
	Please install PyTorch using the official [installation instructions](https://pytorch.org/get-started/locally/).
	Afterward, install the package as:
	```commandline
	pip install git+https://[email protected]/apple/ml-aim.git
	```
	We also offer [MLX](https://github.com/ml-explore/mlx) backend support for research and experimentation on Apple silicon.
	To enable MLX support, simply run:
	```commandline
	pip install mlx
	```

	## Usage
	Below we provide an example of usage in [PyTorch](https://pytorch.org/):
	```python
	from PIL import Image

	from aim.utils import load_pretrained
	from aim.torch.data import val_transforms

	img = Image.open(...)
	model = load_pretrained("aim-600M-2B-imgs", backend="torch")
	transform = val_transforms()

	inp = transform(img).unsqueeze(0)
	logits, _ = model(inp)
	```

	<details>
	<summary>and in both <a href="https://ml-explore.github.io/mlx/">MLX</a></summary>

	```python
	from PIL import Image
	import mlx.core as mx

	from aim.utils import load_pretrained
	from aim.torch.data import val_transforms

	img = Image.open(...)
	model = load_pretrained("aim-600M-2B-imgs", backend="mlx")
	transform = val_transforms()

	inp = transform(img).unsqueeze(0)
	inp = mx.array(inp.numpy())
	logits, _ = model(inp)
	```
	</details>

	<details>
	<summary>and <a href="https://jax.readthedocs.io/">JAX</a></summary>

	```python
	from PIL import Image
	import jax.numpy as jnp

	from aim.utils import load_pretrained
	from aim.torch.data import val_transforms

	img = Image.open(...)
	model, params = load_pretrained("aim-600M-2B-imgs", backend="jax")
	transform = val_transforms()

	inp = transform(img).unsqueeze(0)
	inp = jnp.array(inp)
	(logits, _), _ = model.apply(params, inp, mutable=['batch_stats'])
	```
	</details>


	## Usage

	The pre-trained models can be used via [Hugging Face hub](https://huggingface.co/collections/apple/aim-65aa3ce948c718a574f09eb7) as follows:

	```python
	from PIL import Image

	from aim.torch.models import AIMForImageClassification
	from aim.torch.data import val_transforms

	img = Image.open(...)
	model = AIMForImageClassification.from_pretrained("apple/aim-7B")
	transform = val_transforms()

	inp = transform(img).unsqueeze(0)
	logits, features = model(inp)
	```

	### Pre-trained backbones

	The following table contains pre-trained backbones used in our paper.

	<table style="margin: auto">
	<thead>
	<tr>
	<th>model</th>
	<th>#params</th>
	<th>attn (best layer)</th>
	<th>backbone, SHA256</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td>AIM-0.6B</td>
	<td>0.6B</td>
	<td>79.4%</td>
	<td><a href="https://huggingface.co/apple/AIM/resolve/main/aim_600m_2bimgs_attnprobe_backbone.pth">link</a>, 0d6f6b8f</td>
	</tr>
	<tr>
	<td>AIM-1B</td>
	<td>1B</td>
	<td>82.3%</td>
	<td><a href="https://huggingface.co/apple/AIM/resolve/main/aim_1b_5bimgs_attnprobe_backbone.pth">link</a>, d254ecd3</td>
	</tr>
	<tr>
	<td>AIM-3B</td>
	<td>3B</td>
	<td>83.3%</td>
	<td><a href="https://huggingface.co/apple/AIM/resolve/main/aim_3b_5bimgs_attnprobe_backbone.pth">link</a>, 8475ce4e</td>
	</tr>
	<tr>
	<td>AIM-7B</td>
	<td>7B</td>
	<td>84.0%</td>
	<td><a href="https://huggingface.co/apple/AIM/resolve/main/aim_7b_5bimgs_attnprobe_backbone.pth">link</a>, 184ed94c</td>
	</tr>
	</tbody>
	</table>

	### Pre-trained attention heads

	The table below contains the classification results on ImageNet-1k validation set.

	<table style="margin: auto">
	<thead>
	<tr>
	<th rowspan="2">model</th>
	<th colspan="2">top-1 IN-1k</th>
	<th colspan="2">attention head, SHA256</th>
	</tr>
	<tr>
	<th>last layer</th>
	<th>best layer</th>
	<th>last layer</th>
	<th>best layer</th>
	</tr>
	</thead>

	<tbody>
	<tr>
	<td>AIM-0.6B</td>
	<td>78.5%</td>
	<td>79.4%</td>
	<td><a href="https://huggingface.co/apple/AIM/resolve/main/aim_600m_2bimgs_attnprobe_head_last_layers.pth">link</a>, 5ce5a341</td>
	<td><a href="https://huggingface.co/apple/AIM/resolve/main/aim_600m_2bimgs_attnprobe_head_best_layers.pth">link</a>, ebd45c05</td>
	</tr>
	<tr>
	<td>AIM-1B</td>
	<td>80.6%</td>
	<td>82.3%</td>
	<td><a href="https://huggingface.co/apple/AIM/resolve/main/aim_1b_5bimgs_attnprobe_head_last_layers.pth">link</a>, db3be2ad</td>
	<td><a href="https://huggingface.co/apple/AIM/resolve/main/aim_1b_5bimgs_attnprobe_head_best_layers.pth">link</a>, f1ed7852</td>
	</tr>
	<tr>
	<td>AIM-3B</td>
	<td>82.2%</td>
	<td>83.3%</td>
	<td><a href="https://huggingface.co/apple/AIM/resolve/main/aim_3b_5bimgs_attnprobe_head_last_layers.pth">link</a>, 5c057b30</td>
	<td><a href="https://huggingface.co/apple/AIM/resolve/main/aim_3b_5bimgs_attnprobe_head_best_layers.pth">link</a>, ad380e16</td>
	</tr>
	<tr>
	<td>AIM-7B</td>
	<td>82.4%</td>
	<td>84.0%</td>
	<td><a href="https://huggingface.co/apple/AIM/resolve/main/aim_7b_5bimgs_attnprobe_head_last_layers.pth">link</a>, 1e5c99ba</td>
	<td><a href="https://huggingface.co/apple/AIM/resolve/main/aim_7b_5bimgs_attnprobe_head_best_layers.pth">link</a>, 73ecd732</td>
	</tr>
	</tbody>
	</table>

	## Reproducing the IN-1k classification results
	The commands below reproduce the [attention probe results](#pre-trained-attention-heads) on ImageNet-1k
	validation set. We run the evaluation using 1 node with 8 GPUs:
	```commandline
	torchrun --standalone --nnodes=1 --nproc-per-node=8 main_attnprobe.py \
	--model=aim-7B \
	--batch-size=64 \
	--data-path=/path/to/imagenet \
	--probe-layers=last \
	--backbone-ckpt-path=/path/to/backbone_ckpt.pth \
	--head-ckpt-path=/path/to/head_ckpt.pth
	```
	By default, we probe the last 6 layers. To change this, simply pass `--probe-layers=best`.