Configuration

Configuration
- Configuration Headings
  - Env
  - Env Size
  - Learning
  - Misc
  - Eval
  - Eval Env Size
  - Train Levels
  - Model
  - UED

We use hydra for all of our configurations, and we use hierarchical configuration to organise everything better.

In particular, we have the following configuration headings, with the base ppo config looking like:

defaults:
  - env: entity
  - env_size: s
  - learning:
    - ppo-base
    - ppo-rnn
  - misc: misc
  - eval: s
  - eval_env_size: s
  - train_levels: random
  - model:
    - model-base
    - model-transformer
  - _self_
seed: 0

Configuration Headings

Env

This controls the environment to be used.

Preset Options

We provide two options in configs/env, namely entity and symbolic; each of these can be used by running python3 experiments/ppo.py env=symbolic or python3 experiments/ppo.py env=entity. If you wish to customise the options further, you can add any of the following subkeys (e.g. by running python3 experiments/ppo.py env=symbolic env.dense_reward_scale=0.0):

Individual Subkeys

env.env_name: The name of the environment, with controls the observation and action space.
env.dense_reward_scale: How large the dense reward scale is, set this to zero to disable dense rewards.
env.frame_skip: The number of frames to skip, setting this to 2 (the default) seems to perform better.

Env Size

This controls the maximum number of shapes present in the simulation. This has two important tradeoffs, namely speed and representational power: Small environments run much faster but some complex environments require a large number of shapes. See configs/env_size

Preset Options

s: The small preset
m: Medium preset
l: Large preset
custom: Allows the use of a custom environment size loaded from a json file (see here for more).

Individual Subkeys

num_polygons: How many polygons
num_circles: How many circles
num_joints: How many joints
num_thrusters: How many thrusters
env_size_name: "s", "m" or "l"
num_motor_bindings: How many different joint bindings are there, meaning how many different actions are there associated with joints. All joints with the same binding will have the same action applied to them.
num_thruster_bindings: How many different thruster bindings are there
env_size_type: "predefined" or "custom"
custom_path: Only for env_size_type=custom, controls the json file to load the custom environment size from.

Learning

This controls the agent's learning, see configs/learning

Preset Options

ppo-base: This has all of the base PPO parameters, and is used by all methods
ppo-rnn: This has the PureJaxRL settings for some of PPO's hyperparameters (mainly num_steps is different)
ppo-sfl: This has the SFL-specific value of num_steps
ppo-ued: This has the JAXUED-specific num_steps and outer_rollout_steps

Individual Subkeys

lr: Learning Rate
anneal_lr: Whether to anneal LR
warmup_lr: Whether to warmup LR
peak_lr: If warming up, the peak
initial_lr: If warming up, the initial LR
warmup_frac: If warming up, the warmup fraction of training time
max_grad_norm: Maximum grad norm
total_timesteps: How many total environment interactions must be run
num_train_envs: Number of parallel environments to run simultaneously
num_minibatches: Minibatches for PPO learning
gamma: Discount factor
update_epochs: PPO update epochs
clip_eps: PPO clipping epsilon
gae_lambda: PPO Lambda for GAE
ent_coef: Entropy loss coefficient
vf_coef: Value function loss coefficient
permute_state_during_training: If true, the state is permuted on every reset.
filter_levels: If true, and we are training on random levels, this filters out levels that can be solved by a no-op
level_filter_n_steps: How many steps to allocate to the no-op policy for filtering
level_filter_sample_ratio: How many more levels to sample than required (ideally level_filter_sample_ratio is more than the fraction that will be filtered out).
num_steps: PPO rollout length
outer_rollout_steps: How many learning steps to do for e.g. PLR for each rollout (see the Craftax paper for a more in-depth explanation).

Misc

There are a plethora of miscellaneous options that are grouped under the misc category. There is only one preset option, configs/misc/misc.yaml.

Individual Subkeys

group: Wandb group ("auto" usually works well)
group_auto_prefix: If using group=auto, this is a user-defined prefix
save_path: Where to save checkpoints to
use_wandb: Should wandb be logged to
save_policy: Should we save the policy
wandb_project: Wandb project
wandb_entity: Wandb entity, leave as null to use your default one
wandb_mode : Wandb mode
video_frequency: How often to log videos (they are quite large)
load_from_checkpoint: WWandb artifact path to load from
load_only_params: Whether to load just the network parameters or entire train state.
checkpoint_save_freq: How often to log checkpoits
checkpoint_human_numbers: Should the checkpoints have human-readable timestep numbers
load_legacy_checkpoint: Do not use
load_train_levels_legacy: Do not use
economical_saving: If true, only saves a few important checkpoints for space conservation purposes.

Eval

This option (see configs/eval) controls how evaluation works, and what levels are used.

Preset Options

s: Eval on the s hand-designed levels located in worlds/s
m: Eval on the m hand-designed levels located in worlds/m
l: Eval on the l hand-designed levels located in worlds/l
eval_all: Eval on all of the hand-designed eval levels
eval_auto: If train_levels is not random, evaluate on the training levels.
mujoco: Eval on the recreations of the mujoco tasks.
eval_general: General option if you are planning on overwriting most options.

Individual Subkeys

eval_levels: List of eval levels or the string "auto"
eval_num_attempts: How many times to eval on the same level
eval_freq: How often to evaluate
EVAL_ON_SAMPLED: If true, in plr.py and sfl.py, evaluates on a fixed set of randomly-generated levels

Eval Env Size

This controls the size of the evaluation environment. This is crucial to match up with the size of the evaluation levels.

Preset Options

s: Same as the env_size option.
m: Same as the env_size option.
l: Same as the env_size option.

Train Levels

Which levels to train on.

Preset Options

s: All of the s holdout levels
m: All of the m holdout levels
l: All of the l holdout levels
train_all: All of the levels from all 3 holdout sets
mujoco: All of the mujoco recreation levels.
random: Train on random levels

Individual Subkeys

train_level_mode: "random" or "list"
train_level_distribution: if train_level_mode=random, this controls which distribution to use. By default distribution_v3
train_levels_list: This is a list of levels to train on.

Model

This controls the model architecture and options associated with that.

Preset Options

We use both of the following:

model-base
model-entity

Individual Subkeys

fc_layer_depth: How many layers in the FC model fc_layer_width: How wide is each FC layer activation: NN activation recurrent_model: Whether or not to use recurrence The following are just relevant when using env=entity transformer_depth: How many transformer layers to use transformer_size: How large are the KQV vectors transformer_encoder_size: How large are the initial embeddings num_heads: How many heads, must be a multiple of 4 and divide transformer_size evenly. full_attention_mask: If true, all heads use the full attention mask aggregate_mode: dummy_and_mean works well.

UED

Options pertaining to UED (i.e., when using the scripts plr.py or sfl.py)

Preset Options

sfl
plr
accel

Individual Subkeys

See the individual files for the configuration options used. For SFL, we have:

sampled_envs_ratio: How many environments are from the SFL buffer and how many are randomly generated
batch_size: How many levels to evaluate learnability on per batch
num_batches: How many batches to run when choosing the most learnable levels
rollout_steps: How many steps to rollout for when doing the learnability calculation.
num_to_save: How many levels to save in the learnability buffer