File size: 1,662 Bytes
74e8f2f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
# Image Captioners Are Scalable Vision Learners Too
*by Michael Tschannen, Manoj Kumar, Andreas Steiner, Xiaohua Zhai, Neil Houlsby, Lucas Beyer* [[arxiv]](https://arxiv.org/abs/2306.07915)

This directory contains a config for training a CapPa model from scratch.
Note that most models in the paper were trained on a proprietary dataset
(WebLI), but similar results can be obtained by training on [LAION](https://laion.ai/).
By default, this config trains on COCO captions as this data set is readily
available in [TFDS](https://www.tensorflow.org/datasets) without manual steps.
This is not meant to produce a meaningful model, but
provides a way for the user to run the config out of the box. Please update the
config with with a TFDS-wrapped variant of your favorite image/text data set to
train capable models.
After setting up `big_vision` as described in the [main README](https://github.com/google-research/big_vision#cloud-tpu-vm-setup), training can be launched as follows
```
python -m big_vision.trainers.proj.cappa.generative \
--config big_vision/configs/proj/cappa/pretrain.py \
--workdir gs://$GS_BUCKET_NAME/big_vision/`date '+%m-%d_%H%M'`
```
To run the Cap baseline (autoregressive captioning without parallel prediction),
set `config.model.masked_pred_prob = 0.0`.
### Citation
```
@inproceedings{tschannen2023image,
title={Image Captioners Are Scalable Vision Learners Too},
author={Tschannen, Michael and Kumar, Manoj and Steiner, Andreas and Zhai, Xiaohua and Houlsby, Neil and Beyer, Lucas},
booktitle={Neural Information Processing Systems (NeurIPS)},
year={2023}
}
```
|