MelodyFlow / docs /TRAINING.md
Gael Le Lan
Initial commit
9d0d223
# AudioCraft training pipelines
AudioCraft training pipelines are built on top of PyTorch as our core deep learning library
and [Flashy](https://github.com/facebookresearch/flashy) as our training pipeline design library,
and [Dora](https://github.com/facebookresearch/dora) as our experiment manager.
AudioCraft training pipelines are designed to be research and experiment-friendly.
## Environment setup
For the base installation, follow the instructions from the [README.md](../README.md).
Below are some additional instructions for setting up the environment to train new models.
### Team and cluster configuration
In order to support multiple teams and clusters, AudioCraft uses an environment configuration.
The team configuration allows to specify cluster-specific configurations (e.g. SLURM configuration),
or convenient mapping of paths between the supported environments.
Each team can have a yaml file under the [configuration folder](../config). To select a team set the
`AUDIOCRAFT_TEAM` environment variable to a valid team name (e.g. `labs` or `default`):
```shell
conda env config vars set AUDIOCRAFT_TEAM=default
```
Alternatively, you can add it to your `.bashrc`:
```shell
export AUDIOCRAFT_TEAM=default
```
If not defined, the environment will default to the `default` team.
The cluster is automatically detected, but it is also possible to override it by setting
the `AUDIOCRAFT_CLUSTER` environment variable.
Based on this team and cluster, the environment is then configured with:
* The dora experiment outputs directory.
* The available slurm partitions: categorized by global and team.
* A shared reference directory: In order to facilitate sharing research models while remaining
agnostic to the used compute cluster, we created the `//reference` symbol that can be used in
YAML config to point to a defined reference folder containing shared checkpoints
(e.g. baselines, models for evaluation...).
**Important:** The default output dir for trained models and checkpoints is under `/tmp/`. This is suitable
only for quick testing. If you are doing anything serious you MUST edit the file `default.yaml` and
properly set the `dora_dir` entries.
#### Overriding environment configurations
You can set the following environment variables to bypass the team's environment configuration:
* `AUDIOCRAFT_CONFIG`: absolute path to a team config yaml file.
* `AUDIOCRAFT_DORA_DIR`: absolute path to a custom dora directory.
* `AUDIOCRAFT_REFERENCE_DIR`: absolute path to the shared reference directory.
## Training pipelines
Each task supported in AudioCraft has its own training pipeline and dedicated solver.
Learn more about solvers and key designs around AudioCraft training pipeline below.
Please refer to the documentation of each task and model for specific information on a given task.
### Solvers
The core training component in AudioCraft is the solver. A solver holds the definition
of how to solve a given task: It implements the training pipeline logic, combining the datasets,
model, optimization criterion and components and the full training loop. We refer the reader
to [Flashy](https://github.com/facebookresearch/flashy) for core principles around solvers.
AudioCraft proposes an initial solver, the `StandardSolver` that is used as the base implementation
for downstream solvers. This standard solver provides a nice base management of logging,
checkpoints loading/saving, xp restoration, etc. on top of the base Flashy implementation.
In AudioCraft, we made the assumption that all tasks are following the same set of stages:
train, valid, evaluate and generation, each relying on a dedicated dataset.
Each solver is responsible for defining the task to solve and the associated stages
of the training loop in order to leave the full ownership of the training pipeline
to the researchers. This includes loading the datasets, building the model and
optimisation components, registering them and defining the execution of each stage.
To create a new solver for a given task, one should extend the StandardSolver
and define each stage of the training loop. One can further customise its own solver
starting from scratch instead of inheriting from the standard solver.
```python
from . import base
from .. import optim
class MyNewSolver(base.StandardSolver):
def __init__(self, cfg: omegaconf.DictConfig):
super().__init__(cfg)
# one can add custom attributes to the solver
self.criterion = torch.nn.L1Loss()
def best_metric(self):
# here optionally specify which metric to use to keep track of best state
return 'loss'
def build_model(self):
# here you can instantiate your models and optimization related objects
# this method will be called by the StandardSolver init method
self.model = ...
# the self.cfg attribute contains the raw configuration
self.optimizer = optim.build_optimizer(self.model.parameters(), self.cfg.optim)
# don't forget to register the states you'd like to include in your checkpoints!
self.register_stateful('model', 'optimizer')
# keep the model best state based on the best value achieved at validation for the given best_metric
self.register_best('model')
# if you want to add EMA around the model
self.register_ema('model')
def build_dataloaders(self):
# here you can instantiate your dataloaders
# this method will be called by the StandardSolver init method
self.dataloaders = ...
...
# For both train and valid stages, the StandardSolver relies on
# a share common_train_valid implementation that is in charge of
# accessing the appropriate loader, iterate over the data up to
# the specified number of updates_per_epoch, run the ``run_step``
# function that you need to implement to specify the behavior
# and finally update the EMA and collect the metrics properly.
@abstractmethod
def run_step(self, idx: int, batch: tp.Any, metrics: dict):
"""Perform one training or valid step on a given batch.
"""
... # provide your implementation of the solver over a batch
def train(self):
"""Train stage.
"""
return self.common_train_valid('train')
def valid(self):
"""Valid stage.
"""
return self.common_train_valid('valid')
@abstractmethod
def evaluate(self):
"""Evaluate stage.
"""
... # provide your implementation here!
@abstractmethod
def generate(self):
"""Generate stage.
"""
... # provide your implementation here!
```
### About Epochs
AudioCraft Solvers uses the concept of Epoch. One epoch doesn't necessarily mean one pass over the entire
dataset, but instead represent the smallest amount of computation that we want to work with before checkpointing.
Typically, we find that having an Epoch time around 30min is ideal both in terms of safety (checkpointing often enough)
and getting updates often enough. One Epoch is at least a `train` stage that lasts for `optim.updates_per_epoch` (2000 by default),
and a `valid` stage. You can control how long the valid stage takes with `dataset.valid.num_samples`.
Other stages (`evaluate`, `generate`) will only happen every X epochs, as given by `evaluate.every` and `generate.every`).
### Models
In AudioCraft, a model is a container object that wraps one or more torch modules together
with potential processing logic to use in a solver. For example, a model would wrap an encoder module,
a quantisation bottleneck module, a decoder and some tensor processing logic. Each of the previous components
can be considered as a small « model unit » on its own but the container model is a practical component
to manipulate and train a set of modules together.
### Datasets
See the [dedicated documentation on datasets](./DATASETS.md).
### Metrics
See the [dedicated documentation on metrics](./METRICS.md).
### Conditioners
AudioCraft language models can be conditioned in various ways and the codebase offers a modular implementation
of different conditioners that can be potentially combined together.
Learn more in the [dedicated documentation on conditioning](./CONDITIONING.md).
### Configuration
AudioCraft's configuration is defined in yaml files and the framework relies on
[hydra](https://hydra.cc/docs/intro/) and [omegaconf](https://omegaconf.readthedocs.io/) to parse
and manipulate the configuration through Dora.
##### :warning: Important considerations around configurations
Our configuration management relies on Hydra and the concept of group configs to structure
and compose configurations. Updating the root default configuration files will then have
an impact on all solvers and tasks.
**One should never change the default configuration files. Instead they should use Hydra config groups in order to store custom configuration.**
Once this configuration is created and used for running experiments, you should not edit it anymore.
Note that as we are using Dora as our experiment manager, all our experiment tracking is based on
signatures computed from delta between configurations.
**One must therefore ensure backward compatibility of the configuration at all time.**
See [Dora's README](https://github.com/facebookresearch/dora) and the
[section below introduction Dora](#running-experiments-with-dora).
##### Configuration structure
The configuration is organized in config groups:
* `conditioner`: default values for conditioning modules.
* `dset`: contains all data source related information (paths to manifest files
and metadata for a given dataset).
* `model`: contains configuration for each model defined in AudioCraft and configurations
for different variants of models.
* `solver`: contains the default configuration for each solver as well as configuration
for each solver task, combining all the above components.
* `teams`: contains the cluster configuration per teams. See environment setup for more details.
The `config.yaml` file is the main configuration that composes the above groups
and contains default configuration for AudioCraft.
##### Solver's core configuration structure
The core configuration structure shared across solver is available in `solvers/default.yaml`.
##### Other configuration modules
AudioCraft configuration contains the different setups we used for our research and publications.
## Running experiments with Dora
### Launching jobs
Try launching jobs for different tasks locally with dora run:
```shell
# run compression task with lightweight encodec
dora run solver=compression/debug
```
Most of the time, the jobs are launched through dora grids, for example:
```shell
# run compression task through debug grid
dora grid compression.debug
```
Learn more about running experiments with Dora below.
### A small introduction to Dora
[Dora](https://github.com/facebookresearch/dora) is the experiment manager tool used in AudioCraft.
Check out the README to learn how Dora works. Here is a quick summary of what to know:
* An XP is a unique set of hyper-parameters with a given signature. The signature is a hash
of those hyper-parameters. We always refer to an XP with its signature, e.g. 9357e12e. We will see
after that one can retrieve the hyper-params and re-rerun it in a single command.
* In fact, the hash is defined as a delta between the base config and the one obtained
with the config overrides you passed from the command line. This means you must never change
the `conf/**.yaml` files directly, except for editing things like paths. Changing the default values
in the config files means the XP signature won't reflect that change, and wrong checkpoints might be reused.
I know, this is annoying, but the reason is that otherwise, any change to the config file would mean
that all XPs ran so far would see their signature change.
#### Dora commands
```shell
dora info -f 81de367c # this will show the hyper-parameter used by a specific XP.
# Be careful some overrides might present twice, and the right most one
# will give you the right value for it.
dora run -d -f 81de367c # run an XP with the hyper-parameters from XP 81de367c.
# `-d` is for distributed, it will use all available GPUs.
dora run -d -f 81de367c dataset.batch_size=32 # start from the config of XP 81de367c but change some hyper-params.
# This will give you a new XP with a new signature (e.g. 3fe9c332).
dora info -f SIG -t # will tail the log (if the XP has scheduled).
# if you need to access the logs of the process for rank > 0, in particular because a crash didn't happen in the main
# process, then use `dora info -f SIG` to get the main log name (finished into something like `/5037674_0_0_log.out`)
# and worker K can be accessed as `/5037674_0_{K}_log.out`.
# This is only for scheduled jobs, for local distributed runs with `-d`, then you should go into the XP folder,
# and look for `worker_{K}.log` logs.
```
An XP runs from a specific folder based on its signature, under the
`<cluster_specific_path>/<user>/experiments/audiocraft/outputs/` folder.
You can safely interrupt a training and resume it, it will reuse any existing checkpoint,
as it will reuse the same folder. If you made some change to the code and need to ignore
a previous checkpoint you can use `dora run --clear [RUN ARGS]`.
If you have a Slurm cluster, you can also use the dora grid command, e.g.
```shell
# Run a dummy grid located at `audiocraft/grids/my_grid_folder/my_grid_name.py`
dora grid my_grid_folder.my_grid_name
# The following will simply display the grid and also initialize the Dora experiments database.
# You can then simply refer to a config using its signature (e.g. as `dora run -f SIG`).
dora grid my_grid_folder.my_grid_name --dry_run --init
```
Please refer to the [Dora documentation](https://github.com/facebookresearch/dora) for more information.
#### Clearing up past experiments
```shell
# This will cancel all the XPs and delete their folder and checkpoints.
# It will then reschedule them starting from scratch.
dora grid my_grid_folder.my_grid_name --clear
# The following will delete the folder and checkpoint for a single XP,
# and then run it afresh.
dora run [-f BASE_SIG] [ARGS] --clear
```