arxiv:2110.12628

Recurrent Off-policy Baselines for Memory-based Continuous Control

Published on Oct 25, 2021

Authors:

Abstract

When the environment is <PRE_TAG>partially observable (PO)</POST_TAG>, a deep reinforcement learning (RL) agent must learn a suitable temporal representation of the entire history in addition to a strategy to control. This problem is not novel, and there have been model-free and model-based algorithms proposed for this problem. However, inspired by recent success in model-free image-based RL, we noticed the absence of a model-free baseline for history-based RL that (1) uses full history and (2) incorporates recent advances in off-policy continuous control. Therefore, we implement <PRE_TAG>recurrent versions</POST_TAG> of <PRE_TAG>DDPG</POST_TAG>, <PRE_TAG>TD3</POST_TAG>, and <PRE_TAG>SAC</POST_TAG> (<PRE_TAG>RDPG</POST_TAG>, <PRE_TAG>R<PRE_TAG><PRE_TAG>TD3</POST_TAG></POST_TAG></POST_TAG>, and <PRE_TAG>R<PRE_TAG><PRE_TAG>SAC</POST_TAG></POST_TAG></POST_TAG>) in this work, evaluate them on short-term and long-term PO domains, and investigate key design choices. Our experiments show that <PRE_TAG>RDPG</POST_TAG> and <PRE_TAG>R<PRE_TAG><PRE_TAG>TD3</POST_TAG></POST_TAG></POST_TAG> can surprisingly fail on some domains and that <PRE_TAG>R<PRE_TAG><PRE_TAG>SAC</POST_TAG></POST_TAG></POST_TAG> is the most reliable, reaching near-optimal performance on nearly all domains. However, one task that requires systematic exploration still proved to be difficult, even for <PRE_TAG>R<PRE_TAG><PRE_TAG>SAC</POST_TAG></POST_TAG></POST_TAG>. These results show that model-free RL can learn good temporal representation using only reward signals; the primary difficulty seems to be computational cost and exploration. To facilitate future research, we have made our PyTorch implementation publicly available at https://github.com/zhihanyang2022/off-policy-continuous-control.