|
 |
|
 |
|
 |
|
|
|
**NOTE: For the most part, you will find a newer version of this code at [models/research/slim](https://github.com/tensorflow/models/tree/master/research/slim).** In particular: |
|
|
|
* `inception_train.py` and `imagenet_train.py` should no longer be used. The slim editions for running on multiple GPUs are the current best examples. |
|
* `inception_distributed_train.py` and `imagenet_distributed_train.py` are still valid examples of distributed training. |
|
|
|
For performance benchmarking, please see https://www.tensorflow.org/performance/benchmarks. |
|
|
|
--- |
|
|
|
# Inception in TensorFlow |
|
|
|
[ImageNet](http://www.image-net.org/) is a common academic data set in machine |
|
learning for training an image recognition system. Code in this directory |
|
demonstrates how to use TensorFlow to train and evaluate a type of convolutional |
|
neural network (CNN) on this academic data set. In particular, we demonstrate |
|
how to train the Inception v3 architecture as specified in: |
|
|
|
_Rethinking the Inception Architecture for Computer Vision_ |
|
|
|
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew |
|
Wojna |
|
|
|
http://arxiv.org/abs/1512.00567 |
|
|
|
This network achieves 21.2% top-1 and 5.6% top-5 error for single frame |
|
evaluation with a computational cost of 5 billion multiply-adds per inference |
|
and with using less than 25 million parameters. Below is a visualization of the |
|
model architecture. |
|
|
|
 |
|
|
|
## Description of Code |
|
|
|
The code base provides three core binaries for: |
|
|
|
* Training an Inception v3 network from scratch across multiple GPUs and/or |
|
multiple machines using the ImageNet 2012 Challenge training data set. |
|
* Evaluating an Inception v3 network using the ImageNet 2012 Challenge |
|
validation data set. |
|
* Retraining an Inception v3 network on a novel task and back-propagating the |
|
errors to fine tune the network weights. |
|
|
|
The training procedure employs synchronous stochastic gradient descent across |
|
multiple GPUs. The user may specify the number of GPUs they wish to harness. The |
|
synchronous training performs *batch-splitting* by dividing a given batch across |
|
multiple GPUs. |
|
|
|
The training set up is nearly identical to the section [Training a Model Using |
|
Multiple GPU Cards](https://www.tensorflow.org/tutorials/deep_cnn/index.html#launching_and_training_the_model_on_multiple_gpu_cards) |
|
where we have substituted the CIFAR-10 model architecture with Inception v3. The |
|
primary differences with that setup are: |
|
|
|
* Calculate and update the batch-norm statistics during training so that they |
|
may be substituted in during evaluation. |
|
* Specify the model architecture using a (still experimental) higher level |
|
language called TensorFlow-Slim. |
|
|
|
For more details about TensorFlow-Slim, please see the [Slim README](inception/slim/README.md). Please note that this higher-level language is still |
|
*experimental* and the API may change over time depending on usage and |
|
subsequent research. |
|
|
|
## Getting Started |
|
|
|
Before you run the training script for the first time, you will need to download |
|
and convert the ImageNet data to native TFRecord format. The TFRecord format |
|
consists of a set of sharded files where each entry is a serialized `tf.Example` |
|
proto. Each `tf.Example` proto contains the ImageNet image (JPEG encoded) as |
|
well as metadata such as label and bounding box information. See |
|
[`parse_example_proto`](inception/image_processing.py) for details. |
|
|
|
We provide a single [script](inception/data/download_and_preprocess_imagenet.sh) for |
|
downloading and converting ImageNet data to TFRecord format. Downloading and |
|
preprocessing the data may take several hours (up to half a day) depending on |
|
your network and computer speed. Please be patient. |
|
|
|
To begin, you will need to sign up for an account with [ImageNet](http://image-net.org) to gain access to the data. Look for the sign up page, |
|
create an account and request an access key to download the data. |
|
|
|
After you have `USERNAME` and `PASSWORD`, you are ready to run our script. Make |
|
sure that your hard disk has at least 500 GB of free space for downloading and |
|
storing the data. Here we select `DATA_DIR=$HOME/imagenet-data` as such a |
|
location but feel free to edit accordingly. |
|
|
|
When you run the below script, please enter *USERNAME* and *PASSWORD* when |
|
prompted. This will occur at the very beginning. Once these values are entered, |
|
you will not need to interact with the script again. |
|
|
|
```shell |
|
# location of where to place the ImageNet data |
|
DATA_DIR=$HOME/imagenet-data |
|
|
|
# build the preprocessing script. |
|
cd tensorflow-models/inception |
|
bazel build //inception:download_and_preprocess_imagenet |
|
|
|
# run it |
|
bazel-bin/inception/download_and_preprocess_imagenet "${DATA_DIR}" |
|
``` |
|
|
|
The final line of the output script should read: |
|
|
|
```shell |
|
2016-02-17 14:30:17.287989: Finished writing all 1281167 images in data set. |
|
``` |
|
|
|
When the script finishes, you will find 1024 training files and 128 validation |
|
files in the `DATA_DIR`. The files will match the patterns |
|
`train-?????-of-01024` and `validation-?????-of-00128`, respectively. |
|
|
|
[Congratulations!](https://www.youtube.com/watch?v=9bZkp7q19f0) You are now |
|
ready to train or evaluate with the ImageNet data set. |
|
|
|
## How to Train from Scratch |
|
|
|
**WARNING** Training an Inception v3 network from scratch is a computationally |
|
intensive task and depending on your compute setup may take several days or even |
|
weeks. |
|
|
|
*Before proceeding* please read the [Convolutional Neural Networks](https://www.tensorflow.org/tutorials/deep_cnn/index.html) tutorial; in |
|
particular, focus on [Training a Model Using Multiple GPU Cards](https://www.tensorflow.org/tutorials/deep_cnn/index.html#launching_and_training_the_model_on_multiple_gpu_cards). The model training method is nearly identical to that described in the |
|
CIFAR-10 multi-GPU model training. Briefly, the model training |
|
|
|
* Places an individual model replica on each GPU. |
|
* Splits the batch across the GPUs. |
|
* Updates model parameters synchronously by waiting for all GPUs to finish |
|
processing a batch of data. |
|
|
|
The training procedure is encapsulated by this diagram of how operations and |
|
variables are placed on CPU and GPUs respectively. |
|
|
|
<div style="width:40%; margin:auto; margin-bottom:10px; margin-top:20px;"> |
|
<img style="width:100%" src="https://www.tensorflow.org/images/Parallelism.png"> |
|
</div> |
|
|
|
Each tower computes the gradients for a portion of the batch and the gradients |
|
are combined and averaged across the multiple towers in order to provide a |
|
single update of the Variables stored on the CPU. |
|
|
|
A crucial aspect of training a network of this size is *training speed* in terms |
|
of wall-clock time. The training speed is dictated by many factors -- most |
|
importantly the batch size and the learning rate schedule. Both of these |
|
parameters are heavily coupled to the hardware set up. |
|
|
|
Generally speaking, a batch size is a difficult parameter to tune as it requires |
|
balancing memory demands of the model, memory available on the GPU and speed of |
|
computation. Generally speaking, employing larger batch sizes leads to more |
|
efficient computation and potentially more efficient training steps. |
|
|
|
We have tested several hardware setups for training this model from scratch but |
|
we emphasize that depending your hardware set up, you may need to adapt the |
|
batch size and learning rate schedule. |
|
|
|
Please see the comments in `inception_train.py` for a few selected learning rate |
|
plans based on some selected hardware setups. |
|
|
|
To train this model, you simply need to specify the following: |
|
|
|
```shell |
|
# Build the model. Note that we need to make sure the TensorFlow is ready to |
|
# use before this as this command will not build TensorFlow. |
|
cd tensorflow-models/inception |
|
bazel build //inception:imagenet_train |
|
|
|
# run it |
|
bazel-bin/inception/imagenet_train --num_gpus=1 --batch_size=32 --train_dir=/tmp/imagenet_train --data_dir=/tmp/imagenet_data |
|
``` |
|
|
|
The model reads in the ImageNet training data from `--data_dir`. If you followed |
|
the instructions in [Getting Started](#getting-started), then set |
|
`--data_dir="${DATA_DIR}"`. The script assumes that there exists a set of |
|
sharded TFRecord files containing the ImageNet data. If you have not created |
|
TFRecord files, please refer to [Getting Started](#getting-started) |
|
|
|
Here is the output of the above command line when running on a Tesla K40c: |
|
|
|
```shell |
|
2016-03-07 12:24:59.922898: step 0, loss = 13.11 (5.3 examples/sec; 6.064 sec/batch) |
|
2016-03-07 12:25:55.206783: step 10, loss = 13.71 (9.4 examples/sec; 3.394 sec/batch) |
|
2016-03-07 12:26:28.905231: step 20, loss = 14.81 (9.5 examples/sec; 3.380 sec/batch) |
|
2016-03-07 12:27:02.699719: step 30, loss = 14.45 (9.5 examples/sec; 3.378 sec/batch) |
|
2016-03-07 12:27:36.515699: step 40, loss = 13.98 (9.5 examples/sec; 3.376 sec/batch) |
|
2016-03-07 12:28:10.220956: step 50, loss = 13.92 (9.6 examples/sec; 3.327 sec/batch) |
|
2016-03-07 12:28:43.658223: step 60, loss = 13.28 (9.6 examples/sec; 3.350 sec/batch) |
|
... |
|
``` |
|
|
|
In this example, a log entry is printed every 10 step and the line includes the |
|
total loss (starts around 13.0-14.0) and the speed of processing in terms of |
|
throughput (examples / sec) and batch speed (sec/batch). |
|
|
|
The number of GPU devices is specified by `--num_gpus` (which defaults to 1). |
|
Specifying `--num_gpus` greater then 1 splits the batch evenly split across the |
|
GPU cards. |
|
|
|
```shell |
|
# Build the model. Note that we need to make sure the TensorFlow is ready to |
|
# use before this as this command will not build TensorFlow. |
|
cd tensorflow-models/inception |
|
bazel build //inception:imagenet_train |
|
|
|
# run it |
|
bazel-bin/inception/imagenet_train --num_gpus=2 --batch_size=64 --train_dir=/tmp/imagenet_train |
|
``` |
|
|
|
This model splits the batch of 64 images across 2 GPUs and calculates the |
|
average gradient by waiting for both GPUs to finish calculating the gradients |
|
from their respective data (See diagram above). Generally speaking, using larger |
|
numbers of GPUs leads to higher throughput as well as the opportunity to use |
|
larger batch sizes. In turn, larger batch sizes imply better estimates of the |
|
gradient enabling the usage of higher learning rates. In summary, using more |
|
GPUs results in simply faster training speed. |
|
|
|
Note that selecting a batch size is a difficult parameter to tune as it requires |
|
balancing memory demands of the model, memory available on the GPU and speed of |
|
computation. Generally speaking, employing larger batch sizes leads to more |
|
efficient computation and potentially more efficient training steps. |
|
|
|
Note that there is considerable noise in the loss function on individual steps |
|
in the previous log. Because of this noise, it is difficult to discern how well |
|
a model is learning. The solution to the last problem is to launch TensorBoard |
|
pointing to the directory containing the events log. |
|
|
|
```shell |
|
tensorboard --logdir=/tmp/imagenet_train |
|
``` |
|
|
|
TensorBoard has access to the many Summaries produced by the model that describe |
|
multitudes of statistics tracking the model behavior and the quality of the |
|
learned model. In particular, TensorBoard tracks a exponentially smoothed |
|
version of the loss. In practice, it is far easier to judge how well a model |
|
learns by monitoring the smoothed version of the loss. |
|
|
|
## How to Train from Scratch in a Distributed Setting |
|
|
|
**NOTE** Distributed TensorFlow requires version 0.8 or later. |
|
|
|
Distributed TensorFlow lets us use multiple machines to train a model faster. |
|
This is quite different from the training with multiple GPU towers on a single |
|
machine where all parameters and gradients computation are in the same place. We |
|
coordinate the computation across multiple machines by employing a centralized |
|
repository for parameters that maintains a unified, single copy of model |
|
parameters. Each individual machine sends gradient updates to the centralized |
|
parameter repository which coordinates these updates and sends back updated |
|
parameters to the individual machines running the model training. |
|
|
|
We term each machine that runs a copy of the training a `worker` or `replica`. |
|
We term each machine that maintains model parameters a `ps`, short for |
|
`parameter server`. Note that we might have more than one machine acting as a |
|
`ps` as the model parameters may be sharded across multiple machines. |
|
|
|
Variables may be updated with synchronous or asynchronous gradient updates. One |
|
may construct a an [`Optimizer`](https://www.tensorflow.org/api_docs/python/train.html#optimizers) in TensorFlow |
|
that constructs the necessary graph for either case diagrammed below from the |
|
TensorFlow [Whitepaper](http://download.tensorflow.org/paper/whitepaper2015.pdf): |
|
|
|
<div style="width:40%; margin:auto; margin-bottom:10px; margin-top:20px;"> |
|
<img style="width:100%" |
|
src="https://www.tensorflow.org/images/tensorflow_figure7.png"> |
|
</div> |
|
|
|
In [a recent paper](https://arxiv.org/abs/1604.00981), synchronous gradient |
|
updates have demonstrated to reach higher accuracy in a shorter amount of time. |
|
In this distributed Inception example we employ synchronous gradient updates. |
|
|
|
Note that in this example each replica has a single tower that uses one GPU. |
|
|
|
The command-line flags `worker_hosts` and `ps_hosts` specify available servers. |
|
The same binary will be used for both the `worker` jobs and the `ps` jobs. |
|
Command line flag `job_name` will be used to specify what role a task will be |
|
playing and `task_id` will be used to identify which one of the jobs it is |
|
running. Several things to note here: |
|
|
|
* The numbers of `ps` and `worker` tasks are inferred from the lists of hosts |
|
specified in the flags. The `task_id` should be within the range `[0, |
|
num_ps_tasks)` for `ps` tasks and `[0, num_worker_tasks)` for `worker` |
|
tasks. |
|
* `ps` and `worker` tasks can run on the same machine, as long as that machine |
|
has sufficient resources to handle both tasks. Note that the `ps` task does |
|
not benefit from a GPU, so it should not attempt to use one (see below). |
|
* Multiple `worker` tasks can run on the same machine with multiple GPUs so |
|
machine_A with 2 GPUs may have 2 workers while machine_B with 1 GPU just has |
|
1 worker. |
|
* The default learning rate schedule works well for a wide range of number of |
|
replicas [25, 50, 100] but feel free to tune it for even better results. |
|
* The command line of both `ps` and `worker` tasks should include the complete |
|
list of `ps_hosts` and `worker_hosts`. |
|
* There is a chief `worker` among all workers which defaults to `worker` 0. |
|
The chief will be in charge of initializing all the parameters, writing out |
|
the summaries and the checkpoint. The checkpoint and summary will be in the |
|
`train_dir` of the host for `worker` 0. |
|
* Each worker processes a batch_size number of examples but each gradient |
|
update is computed from all replicas. Hence, the effective batch size of |
|
this model is batch_size * num_workers. |
|
|
|
```shell |
|
# Build the model. Note that we need to make sure the TensorFlow is ready to |
|
# use before this as this command will not build TensorFlow. |
|
cd tensorflow-models/inception |
|
bazel build //inception:imagenet_distributed_train |
|
|
|
# To start worker 0, go to the worker0 host and run the following (Note that |
|
# task_id should be in the range [0, num_worker_tasks): |
|
bazel-bin/inception/imagenet_distributed_train \ |
|
--batch_size=32 \ |
|
--data_dir=$HOME/imagenet-data \ |
|
--job_name='worker' \ |
|
--task_id=0 \ |
|
--ps_hosts='ps0.example.com:2222' \ |
|
--worker_hosts='worker0.example.com:2222,worker1.example.com:2222' |
|
|
|
# To start worker 1, go to the worker1 host and run the following (Note that |
|
# task_id should be in the range [0, num_worker_tasks): |
|
bazel-bin/inception/imagenet_distributed_train \ |
|
--batch_size=32 \ |
|
--data_dir=$HOME/imagenet-data \ |
|
--job_name='worker' \ |
|
--task_id=1 \ |
|
--ps_hosts='ps0.example.com:2222' \ |
|
--worker_hosts='worker0.example.com:2222,worker1.example.com:2222' |
|
|
|
# To start the parameter server (ps), go to the ps host and run the following (Note |
|
# that task_id should be in the range [0, num_ps_tasks): |
|
bazel-bin/inception/imagenet_distributed_train \ |
|
--job_name='ps' \ |
|
--task_id=0 \ |
|
--ps_hosts='ps0.example.com:2222' \ |
|
--worker_hosts='worker0.example.com:2222,worker1.example.com:2222' |
|
``` |
|
|
|
If you have installed a GPU-compatible version of TensorFlow, the `ps` will also |
|
try to allocate GPU memory although it is not helpful. This could potentially |
|
crash the worker on the same machine as it has little to no GPU memory to |
|
allocate. To avoid this, you can prepend the previous command to start `ps` |
|
with: `CUDA_VISIBLE_DEVICES=''` |
|
|
|
```shell |
|
CUDA_VISIBLE_DEVICES='' bazel-bin/inception/imagenet_distributed_train \ |
|
--job_name='ps' \ |
|
--task_id=0 \ |
|
--ps_hosts='ps0.example.com:2222' \ |
|
--worker_hosts='worker0.example.com:2222,worker1.example.com:2222' |
|
``` |
|
|
|
If you have run everything correctly, you should see a log in each `worker` job |
|
that looks like the following. Note the training speed varies depending on your |
|
hardware and the first several steps could take much longer. |
|
|
|
```shell |
|
INFO:tensorflow:PS hosts are: ['ps0.example.com:2222', 'ps1.example.com:2222'] |
|
INFO:tensorflow:Worker hosts are: ['worker0.example.com:2222', 'worker1.example.com:2222'] |
|
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job ps -> {ps0.example.com:2222, ps1.example.com:2222} |
|
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job worker -> {localhost:2222, worker1.example.com:2222} |
|
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:202] Started server with target: grpc://localhost:2222 |
|
INFO:tensorflow:Created variable global_step:0 with shape () and init <function zeros_initializer at 0x7f6aa014b140> |
|
|
|
... |
|
|
|
INFO:tensorflow:Created variable logits/logits/biases:0 with shape (1001,) and init <function _initializer at 0x7f6a77f3cf50> |
|
INFO:tensorflow:SyncReplicas enabled: replicas_to_aggregate=2; total_num_replicas=2 |
|
INFO:tensorflow:2016-04-13 01:56:26.405639 Supervisor |
|
INFO:tensorflow:Started 2 queues for processing input data. |
|
INFO:tensorflow:global_step/sec: 0 |
|
INFO:tensorflow:Worker 0: 2016-04-13 01:58:40.342404: step 0, loss = 12.97(0.0 examples/sec; 65.428 sec/batch) |
|
INFO:tensorflow:global_step/sec: 0.0172907 |
|
... |
|
``` |
|
|
|
and a log in each `ps` job that looks like the following: |
|
|
|
```shell |
|
INFO:tensorflow:PS hosts are: ['ps0.example.com:2222', 'ps1.example.com:2222'] |
|
INFO:tensorflow:Worker hosts are: ['worker0.example.com:2222', 'worker1.example.com:2222'] |
|
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job ps -> {localhost:2222, ps1.example.com:2222} |
|
I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:206] Initialize HostPortsGrpcChannelCache for job worker -> {worker0.example.com:2222, worker1.example.com:2222} |
|
I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:202] Started server with target: grpc://localhost:2222 |
|
``` |
|
|
|
If you compiled TensorFlow (from v1.1-rc3) with VERBS support and you have the |
|
required device and IB verbs SW stack, you can specify --protocol='grpc+verbs' |
|
In order to use Verbs RDMA for Tensor passing between workers and ps. |
|
Need to add the the --protocol flag in all tasks (ps and workers). |
|
The default protocol is the TensorFlow default protocol of grpc. |
|
|
|
|
|
[Congratulations!](https://www.youtube.com/watch?v=9bZkp7q19f0) You are now |
|
training Inception in a distributed manner. |
|
|
|
## How to Evaluate |
|
|
|
Evaluating an Inception v3 model on the ImageNet 2012 validation data set |
|
requires running a separate binary. |
|
|
|
The evaluation procedure is nearly identical to [Evaluating a Model](https://www.tensorflow.org/tutorials/deep_cnn/index.html#evaluating_a_model) |
|
described in the [Convolutional Neural Network](https://www.tensorflow.org/tutorials/deep_cnn/index.html) tutorial. |
|
|
|
**WARNING** Be careful not to run the evaluation and training binary on the same |
|
GPU or else you might run out of memory. Consider running the evaluation on a |
|
separate GPU if available or suspending the training binary while running the |
|
evaluation on the same GPU. |
|
|
|
Briefly, one can evaluate the model by running: |
|
|
|
```shell |
|
# Build the model. Note that we need to make sure the TensorFlow is ready to |
|
# use before this as this command will not build TensorFlow. |
|
cd tensorflow-models/inception |
|
bazel build //inception:imagenet_eval |
|
|
|
# run it |
|
bazel-bin/inception/imagenet_eval --checkpoint_dir=/tmp/imagenet_train --eval_dir=/tmp/imagenet_eval |
|
``` |
|
|
|
Note that we point `--checkpoint_dir` to the location of the checkpoints saved |
|
by `inception_train.py` above. Running the above command results in the |
|
following output: |
|
|
|
```shell |
|
2016-02-17 22:32:50.391206: precision @ 1 = 0.735 |
|
... |
|
``` |
|
|
|
The script calculates the precision @ 1 over the entire validation data |
|
periodically. The precision @ 1 measures the how often the highest scoring |
|
prediction from the model matched the ImageNet label -- in this case, 73.5%. If |
|
you wish to run the eval just once and not periodically, append the `--run_once` |
|
option. |
|
|
|
Much like the training script, `imagenet_eval.py` also exports summaries that |
|
may be visualized in TensorBoard. These summaries calculate additional |
|
statistics on the predictions (e.g. recall @ 5) as well as monitor the |
|
statistics of the model activations and weights during evaluation. |
|
|
|
## How to Fine-Tune a Pre-Trained Model on a New Task |
|
|
|
### Getting Started |
|
|
|
Much like training the ImageNet model we must first convert a new data set to |
|
the sharded TFRecord format which each entry is a serialized `tf.Example` proto. |
|
|
|
We have provided a script demonstrating how to do this for small data set of of |
|
a few thousand flower images spread across 5 labels: |
|
|
|
```shell |
|
daisy, dandelion, roses, sunflowers, tulips |
|
``` |
|
|
|
There is a single automated script that downloads the data set and converts it |
|
to the TFRecord format. Much like the ImageNet data set, each record in the |
|
TFRecord format is a serialized `tf.Example` proto whose entries include a |
|
JPEG-encoded string and an integer label. Please see [`parse_example_proto`](inception/image_processing.py) for details. |
|
|
|
The script just takes a few minutes to run depending your network connection |
|
speed for downloading and processing the images. Your hard disk requires 200MB |
|
of free storage. Here we select `DATA_DIR=/tmp/flowers-data/` as such a location |
|
but feel free to edit accordingly. |
|
|
|
```shell |
|
# location of where to place the flowers data |
|
FLOWERS_DATA_DIR=/tmp/flowers-data/ |
|
|
|
# build the preprocessing script. |
|
cd tensorflow-models/inception |
|
bazel build //inception:download_and_preprocess_flowers |
|
|
|
# run it |
|
bazel-bin/inception/download_and_preprocess_flowers "${FLOWERS_DATA_DIR}" |
|
``` |
|
|
|
If the script runs successfully, the final line of the terminal output should |
|
look like: |
|
|
|
```shell |
|
2016-02-24 20:42:25.067551: Finished writing all 3170 images in data set. |
|
``` |
|
|
|
When the script finishes you will find 2 shards for the training and validation |
|
files in the `DATA_DIR`. The files will match the patterns `train-?????-of-00002` |
|
and `validation-?????-of-00002`, respectively. |
|
|
|
**NOTE** If you wish to prepare a custom image data set for transfer learning, |
|
you will need to invoke [`build_image_data.py`](inception/data/build_image_data.py) on |
|
your custom data set. Please see the associated options and assumptions behind |
|
this script by reading the comments section of [`build_image_data.py`](inception/data/build_image_data.py). Also, if your custom data has a different |
|
number of examples or classes, you need to change the appropriate values in |
|
[`imagenet_data.py`](inception/imagenet_data.py). |
|
|
|
The second piece you will need is a trained Inception v3 image model. You have |
|
the option of either training one yourself (See [How to Train from Scratch](#how-to-train-from-scratch) for details) or you can download a pre-trained |
|
model like so: |
|
|
|
```shell |
|
# location of where to place the Inception v3 model |
|
INCEPTION_MODEL_DIR=$HOME/inception-v3-model |
|
mkdir -p ${INCEPTION_MODEL_DIR} |
|
cd ${INCEPTION_MODEL_DIR} |
|
|
|
# download the Inception v3 model |
|
curl -O http://download.tensorflow.org/models/image/imagenet/inception-v3-2016-03-01.tar.gz |
|
tar xzf inception-v3-2016-03-01.tar.gz |
|
|
|
# this will create a directory called inception-v3 which contains the following files. |
|
> ls inception-v3 |
|
README.txt |
|
checkpoint |
|
model.ckpt-157585 |
|
``` |
|
|
|
[Congratulations!](https://www.youtube.com/watch?v=9bZkp7q19f0) You are now |
|
ready to fine-tune your pre-trained Inception v3 model with the flower data set. |
|
|
|
### How to Retrain a Trained Model on the Flowers Data |
|
|
|
We are now ready to fine-tune a pre-trained Inception-v3 model on the flowers |
|
data set. This requires two distinct changes to our training procedure: |
|
|
|
1. Build the exact same model as previously except we change the number of |
|
labels in the final classification layer. |
|
|
|
2. Restore all weights from the pre-trained Inception-v3 except for the final |
|
classification layer; this will get randomly initialized instead. |
|
|
|
We can perform these two operations by specifying two flags: |
|
`--pretrained_model_checkpoint_path` and `--fine_tune`. The first flag is a |
|
string that points to the path of a pre-trained Inception-v3 model. If this flag |
|
is specified, it will load the entire model from the checkpoint before the |
|
script begins training. |
|
|
|
The second flag `--fine_tune` is a boolean that indicates whether the last |
|
classification layer should be randomly initialized or restored. You may set |
|
this flag to false if you wish to continue training a pre-trained model from a |
|
checkpoint. If you set this flag to true, you can train a new classification |
|
layer from scratch. |
|
|
|
In order to understand how `--fine_tune` works, please see the discussion on |
|
`Variables` in the TensorFlow-Slim [`README.md`](inception/slim/README.md). |
|
|
|
Putting this all together you can retrain a pre-trained Inception-v3 model on |
|
the flowers data set with the following command. |
|
|
|
```shell |
|
# Build the model. Note that we need to make sure the TensorFlow is ready to |
|
# use before this as this command will not build TensorFlow. |
|
cd tensorflow-models/inception |
|
bazel build //inception:flowers_train |
|
|
|
# Path to the downloaded Inception-v3 model. |
|
MODEL_PATH="${INCEPTION_MODEL_DIR}/inception-v3/model.ckpt-157585" |
|
|
|
# Directory where the flowers data resides. |
|
FLOWERS_DATA_DIR=/tmp/flowers-data/ |
|
|
|
# Directory where to save the checkpoint and events files. |
|
TRAIN_DIR=/tmp/flowers_train/ |
|
|
|
# Run the fine-tuning on the flowers data set starting from the pre-trained |
|
# Imagenet-v3 model. |
|
bazel-bin/inception/flowers_train \ |
|
--train_dir="${TRAIN_DIR}" \ |
|
--data_dir="${FLOWERS_DATA_DIR}" \ |
|
--pretrained_model_checkpoint_path="${MODEL_PATH}" \ |
|
--fine_tune=True \ |
|
--initial_learning_rate=0.001 \ |
|
--input_queue_memory_factor=1 |
|
``` |
|
|
|
We have added a few extra options to the training procedure. |
|
|
|
* Fine-tuning a model a separate data set requires significantly lowering the |
|
initial learning rate. We set the initial learning rate to 0.001. |
|
* The flowers data set is quite small so we shrink the size of the shuffling |
|
queue of examples. See [Adjusting Memory Demands](#adjusting-memory-demands) |
|
for more details. |
|
|
|
The training script will only reports the loss. To evaluate the quality of the |
|
fine-tuned model, you will need to run `flowers_eval`: |
|
|
|
```shell |
|
# Build the model. Note that we need to make sure the TensorFlow is ready to |
|
# use before this as this command will not build TensorFlow. |
|
cd tensorflow-models/inception |
|
bazel build //inception:flowers_eval |
|
|
|
# Directory where we saved the fine-tuned checkpoint and events files. |
|
TRAIN_DIR=/tmp/flowers_train/ |
|
|
|
# Directory where the flowers data resides. |
|
FLOWERS_DATA_DIR=/tmp/flowers-data/ |
|
|
|
# Directory where to save the evaluation events files. |
|
EVAL_DIR=/tmp/flowers_eval/ |
|
|
|
# Evaluate the fine-tuned model on a hold-out of the flower data set. |
|
bazel-bin/inception/flowers_eval \ |
|
--eval_dir="${EVAL_DIR}" \ |
|
--data_dir="${FLOWERS_DATA_DIR}" \ |
|
--subset=validation \ |
|
--num_examples=500 \ |
|
--checkpoint_dir="${TRAIN_DIR}" \ |
|
--input_queue_memory_factor=1 \ |
|
--run_once |
|
``` |
|
|
|
We find that the evaluation arrives at roughly 93.4% precision@1 after the model |
|
has been running for 2000 steps. |
|
|
|
```shell |
|
Successfully loaded model from /tmp/flowers/model.ckpt-1999 at step=1999. |
|
2016-03-01 16:52:51.761219: starting evaluation on (validation). |
|
2016-03-01 16:53:05.450419: [20 batches out of 20] (36.5 examples/sec; 0.684sec/batch) |
|
2016-03-01 16:53:05.450471: precision @ 1 = 0.9340 recall @ 5 = 0.9960 [500 examples] |
|
``` |
|
|
|
## How to Construct a New Dataset for Retraining |
|
|
|
One can use the existing scripts supplied with this model to build a new dataset |
|
for training or fine-tuning. The main script to employ is |
|
[`build_image_data.py`](inception/data/build_image_data.py). Briefly, this script takes a |
|
structured directory of images and converts it to a sharded `TFRecord` that can |
|
be read by the Inception model. |
|
|
|
In particular, you will need to create a directory of training images that |
|
reside within `$TRAIN_DIR` and `$VALIDATION_DIR` arranged as such: |
|
|
|
```shell |
|
$TRAIN_DIR/dog/image0.jpeg |
|
$TRAIN_DIR/dog/image1.jpg |
|
$TRAIN_DIR/dog/image2.png |
|
... |
|
$TRAIN_DIR/cat/weird-image.jpeg |
|
$TRAIN_DIR/cat/my-image.jpeg |
|
$TRAIN_DIR/cat/my-image.JPG |
|
... |
|
$VALIDATION_DIR/dog/imageA.jpeg |
|
$VALIDATION_DIR/dog/imageB.jpg |
|
$VALIDATION_DIR/dog/imageC.png |
|
... |
|
$VALIDATION_DIR/cat/weird-image.PNG |
|
$VALIDATION_DIR/cat/that-image.jpg |
|
$VALIDATION_DIR/cat/cat.JPG |
|
... |
|
``` |
|
**NOTE**: This script will append an extra background class indexed at 0, so |
|
your class labels will range from 0 to num_labels. Using the example above, the |
|
corresponding class labels generated from `build_image_data.py` will be as |
|
follows: |
|
```shell |
|
0 |
|
1 dog |
|
2 cat |
|
``` |
|
|
|
Each sub-directory in `$TRAIN_DIR` and `$VALIDATION_DIR` corresponds to a unique |
|
label for the images that reside within that sub-directory. The images may be |
|
JPEG or PNG images. We do not support other images types currently. |
|
|
|
Once the data is arranged in this directory structure, we can run |
|
`build_image_data.py` on the data to generate the sharded `TFRecord` dataset. |
|
Each entry of the `TFRecord` is a serialized `tf.Example` protocol buffer. A |
|
complete list of information contained in the `tf.Example` is described in the |
|
comments of `build_image_data.py`. |
|
|
|
To run `build_image_data.py`, you can run the following command line: |
|
|
|
```shell |
|
# location to where to save the TFRecord data. |
|
OUTPUT_DIRECTORY=$HOME/my-custom-data/ |
|
|
|
# build the preprocessing script. |
|
cd tensorflow-models/inception |
|
bazel build //inception:build_image_data |
|
|
|
# convert the data. |
|
bazel-bin/inception/build_image_data \ |
|
--train_directory="${TRAIN_DIR}" \ |
|
--validation_directory="${VALIDATION_DIR}" \ |
|
--output_directory="${OUTPUT_DIRECTORY}" \ |
|
--labels_file="${LABELS_FILE}" \ |
|
--train_shards=128 \ |
|
--validation_shards=24 \ |
|
--num_threads=8 |
|
``` |
|
|
|
where the `$OUTPUT_DIRECTORY` is the location of the sharded `TFRecords`. The |
|
`$LABELS_FILE` will be a text file that is read by the script that provides |
|
a list of all of the labels. For instance, in the case flowers data set, the |
|
`$LABELS_FILE` contained the following data: |
|
|
|
```shell |
|
daisy |
|
dandelion |
|
roses |
|
sunflowers |
|
tulips |
|
``` |
|
|
|
Note that each row of each label corresponds with the entry in the final |
|
classifier in the model. That is, the `daisy` corresponds to the classifier for |
|
entry `1`; `dandelion` is entry `2`, etc. We skip label `0` as a background |
|
class. |
|
|
|
After running this script produces files that look like the following: |
|
|
|
```shell |
|
$TRAIN_DIR/train-00000-of-00128 |
|
$TRAIN_DIR/train-00001-of-00128 |
|
... |
|
$TRAIN_DIR/train-00127-of-00128 |
|
|
|
and |
|
|
|
$VALIDATION_DIR/validation-00000-of-00024 |
|
$VALIDATION_DIR/validation-00001-of-00024 |
|
... |
|
$VALIDATION_DIR/validation-00023-of-00024 |
|
``` |
|
|
|
where 128 and 24 are the number of shards specified for each dataset, |
|
respectively. Generally speaking, we aim for selecting the number of shards such |
|
that roughly 1024 images reside in each shard. Once this data set is built, you |
|
are ready to train or fine-tune an Inception model on this data set. |
|
|
|
Note, if you are piggy backing on the flowers retraining scripts, be sure to |
|
update `num_classes()` and `num_examples_per_epoch()` in `flowers_data.py` |
|
to correspond with your data. |
|
|
|
## Practical Considerations for Training a Model |
|
|
|
The model architecture and training procedure is heavily dependent on the |
|
hardware used to train the model. If you wish to train or fine-tune this model |
|
on your machine **you will need to adjust and empirically determine a good set |
|
of training hyper-parameters for your setup**. What follows are some general |
|
considerations for novices. |
|
|
|
### Finding Good Hyperparameters |
|
|
|
Roughly 5-10 hyper-parameters govern the speed at which a network is trained. In |
|
addition to `--batch_size` and `--num_gpus`, there are several constants defined |
|
in [inception_train.py](inception/inception_train.py) which dictate the learning |
|
schedule. |
|
|
|
```shell |
|
RMSPROP_DECAY = 0.9 # Decay term for RMSProp. |
|
MOMENTUM = 0.9 # Momentum in RMSProp. |
|
RMSPROP_EPSILON = 1.0 # Epsilon term for RMSProp. |
|
INITIAL_LEARNING_RATE = 0.1 # Initial learning rate. |
|
NUM_EPOCHS_PER_DECAY = 30.0 # Epochs after which learning rate decays. |
|
LEARNING_RATE_DECAY_FACTOR = 0.16 # Learning rate decay factor. |
|
``` |
|
|
|
There are many papers that discuss the various tricks and trade-offs associated |
|
with training a model with stochastic gradient descent. For those new to the |
|
field, some great references are: |
|
|
|
* Y Bengio, [Practical recommendations for gradient-based training of deep |
|
architectures](http://arxiv.org/abs/1206.5533) |
|
* I Goodfellow, Y Bengio and A Courville, [Deep Learning] |
|
(http://www.deeplearningbook.org/) |
|
|
|
What follows is a summary of some general advice for identifying appropriate |
|
model hyper-parameters in the context of this particular model training setup. |
|
Namely, this library provides *synchronous* updates to model parameters based on |
|
batch-splitting the model across multiple GPUs. |
|
|
|
* Higher learning rates leads to faster training. Too high of learning rate |
|
leads to instability and will cause model parameters to diverge to infinity |
|
or NaN. |
|
|
|
* Larger batch sizes lead to higher quality estimates of the gradient and |
|
permit training the model with higher learning rates. |
|
|
|
* Often the GPU memory is a bottleneck that prevents employing larger batch |
|
sizes. Employing more GPUs allows one to use larger batch sizes because |
|
this model splits the batch across the GPUs. |
|
|
|
**NOTE** If one wishes to train this model with *asynchronous* gradient updates, |
|
one will need to substantially alter this model and new considerations need to |
|
be factored into hyperparameter tuning. See [Large Scale Distributed Deep |
|
Networks](http://research.google.com/archive/large_deep_networks_nips2012.html) |
|
for a discussion in this domain. |
|
|
|
### Adjusting Memory Demands |
|
|
|
Training this model has large memory demands in terms of the CPU and GPU. Let's |
|
discuss each item in turn. |
|
|
|
GPU memory is relatively small compared to CPU memory. Two items dictate the |
|
amount of GPU memory employed -- model architecture and batch size. Assuming |
|
that you keep the model architecture fixed, the sole parameter governing the GPU |
|
demand is the batch size. A good rule of thumb is to try employ as large of |
|
batch size as will fit on the GPU. |
|
|
|
If you run out of GPU memory, either lower the `--batch_size` or employ more |
|
GPUs on your desktop. The model performs batch-splitting across GPUs, thus N |
|
GPUs can handle N times the batch size of 1 GPU. |
|
|
|
The model requires a large amount of CPU memory as well. We have tuned the model |
|
to employ about ~20GB of CPU memory. Thus, having access to about 40 GB of CPU |
|
memory would be ideal. |
|
|
|
If that is not possible, you can tune down the memory demands of the model via |
|
lowering `--input_queue_memory_factor`. Images are preprocessed asynchronously |
|
with respect to the main training across `--num_preprocess_threads` threads. The |
|
preprocessed images are stored in shuffling queue in which each GPU performs a |
|
dequeue operation in order to receive a `batch_size` worth of images. |
|
|
|
In order to guarantee good shuffling across the data, we maintain a large |
|
shuffling queue of 1024 x `input_queue_memory_factor` images. For the current |
|
model architecture, this corresponds to about 4GB of CPU memory. You may lower |
|
`input_queue_memory_factor` in order to decrease the memory footprint. Keep in |
|
mind though that lowering this value drastically may result in a model with |
|
slightly lower predictive accuracy when training from scratch. Please see |
|
comments in [`image_processing.py`](inception/image_processing.py) for more details. |
|
|
|
## Troubleshooting |
|
|
|
#### The model runs out of CPU memory. |
|
|
|
In lieu of buying more CPU memory, an easy fix is to decrease |
|
`--input_queue_memory_factor`. See [Adjusting Memory Demands](#adjusting-memory-demands). |
|
|
|
#### The model runs out of GPU memory. |
|
|
|
The data is not able to fit on the GPU card. The simplest solution is to |
|
decrease the batch size of the model. Otherwise, you will need to think about a |
|
more sophisticated method for specifying the training which cuts up the model |
|
across multiple `session.run()` calls or partitions the model across multiple |
|
GPUs. See [Using GPUs](https://www.tensorflow.org/how_tos/using_gpu/index.html) |
|
and [Adjusting Memory Demands](#adjusting-memory-demands) for more information. |
|
|
|
#### The model training results in NaN's. |
|
|
|
The learning rate of the model is too high. Turn down your learning rate. |
|
|
|
#### I wish to train a model with a different image size. |
|
|
|
The simplest solution is to artificially resize your images to `299x299` pixels. |
|
See [Images](https://www.tensorflow.org/api_docs/python/image.html) section for |
|
many resizing, cropping and padding methods. Note that the entire model |
|
architecture is predicated on a `299x299` image, thus if you wish to change the |
|
input image size, then you may need to redesign the entire model architecture. |
|
|
|
#### What hardware specification are these hyper-parameters targeted for? |
|
|
|
We targeted a desktop with 128GB of CPU ram connected to 8 NVIDIA Tesla K40 GPU |
|
cards but we have run this on desktops with 32GB of CPU ram and 1 NVIDIA Tesla |
|
K40. You can get a sense of the various training configurations we tested by |
|
reading the comments in [`inception_train.py`](inception/inception_train.py). |
|
|
|
#### How do I continue training from a checkpoint in distributed setting? |
|
|
|
You only need to make sure that the checkpoint is in a location that can be |
|
reached by all of the `ps` tasks. By specifying the checkpoint location with |
|
`--train_dir` , the `ps` servers will load the checkpoint before commencing |
|
training. |
|
|