00BER's picture
Upload 36 files
e085e3b
# **Abstract**
On January 1, 2013, DeepMind published a paper called "Playing Atari
with Deep Reinforcement Learning" introducing their algorithm called
Deep Q-Network (DQN) which revolutionized the field of reinforcement
learning. For the first time they had brought together Deep Learning and
Q-learning and showed impressive results applying deep reinforcement
learning to Atari games with their agents performing at or over human
level expertise in almost all the games trained on.
A Deep Q-Network utilizes a deep neural network to estimate the q-values
for each action, allowing the policy to select the action with the
maximum q-values. This use of deep neural network to get q-values was
immensely superior to implementing q-table look-ups and widened the
applicability of q-learning to more complex reinforcement learning
environments.
While revolutionary, the original version of DQN had a few problems,
especially its slow/inefficient learning process. Over these past 9
years, a few improved versions of DQNs have become popular. This project
is an attempt to study the effectiveness of a few of these DQN flavors,
what problems they solve and compare their performance in the same
reinforcement learning environment.
# Deep Q-Networks and its flavors
- **Vanilla DQN**
The vanilla (original) DQN uses 2 neural networks: the **online**
network and the **target** network. The online network is the main
neural network that the agent uses to select the best action for a
given state. The target neural network is usually a copy of the
online network. It is used to get the "target" q-values for each
action for a particular state. i.e. During the learning phase, since
we don’t have actual ground truths for future q-values, these
q-values from the target network will be used as labels optimize the
network.
The target network calculates the target q-values by using the
following Bellman equation: \[\begin{aligned}
Q(s_t, a_t) =
r_{t+1} + \gamma \max _{a_{t+1} \in A} Q(s_{t+1}, a_{t+1})
\end{aligned}\] where,
\(Q(s_t, a_t)\) = The target q-value (ground truth) for a past
experience in the replay memory
\(r_{t+1}\)= The reward that was obtained for taking the chosen
action in that particular experience
\(\gamma\)= The discount factor for future rewards
\(Q(s_{t+1}, a_{t+1})\) = The q-value for best action (based on the
policy) for the next state for that particular experience
- **Double DQN**
One of the problems with vanilla DQN is the way it calculates its
target values (ground-truth). We can see from the bellman equation
above that the target network uses the **max** q-value directly in
the equation. This is found to almost always overestimate the
q-value because using the **max** function introduces the
maximization-bias to our estimates. Using max will give the largest
value even if that specific max value was an outlier, thus skewing
our estimates.
The Double DQN solves this problem by changing the original
algorithm to the following:
1. Instead of using the **max** function, first use the online
network to estimate the best action for the next state
2. Calculate target q-values for the next state for each possible
action using the target network
3. From the q-values calculated by the target network, use the
q-value of the action chosen in step 1.
This can be represented by the following equation: \[\begin{aligned}
Q(s_t, a_t) =
r_{t+1} + \gamma Q_{target}(s_{t+1}, a'_{t+1})
\end{aligned}\] where, \[\begin{aligned}
a'_{t+1} = argmax({Q_{online}(s_{t+1})})
\end{aligned}\]
- **Dueling DQN**
The Dueling DQN algorithm was an attempt to improve upon the
original DQN algorithm by changing the architecture of the neural
network used in Deep Q-learning. The Duelling DQN algorithm splits
the last layer of the DQN into to parts, a **value stream** and an
**advantage stream**, the outputs of which are aggregated in an
aggregating layer that gives the final q-value. One of the main
problems with the original DQN algorithm was that the difference in
Q-values for the actions were often very close. Thus, selecting the
action with the max q-value might always not be the best action to
take. The Dueling DQN attempts to mitigate this by using advantage,
which is a measure of how better an action is compared to other
actions for a given state. The value stream, on the other hand,
learns how good/bad it is to be in a specific state. eg. Moving
straight towards an obstacle in a racing game, being in the path of
a projectile in Space Invaders, etc. Instead of learning to predict
a single q-value, by separating into value and advantage streams
helps the network generalize better.
![image](./docs/dueling.png)
Fig: The Dueling DQN architecture (Image taken from the original
paper by Wang et al.)
The q-value in a Dueling DQN architecture is given by
\[\begin{aligned}
Q(s_t, a_t) = V(s_t) + A(a)
\end{aligned}\] where,
V(s\_t) = The value of the current state (how advantageous it is to
be in that state)
A(a) =The advantage of taking action an a at that state
# About the project
My original goal for the project was to train an agent using DQN to
play **Airstriker Genesis**, a space shooting game and evaluate the
same agent’s performance on another similar game called
**Starpilot**. Unfortunately, I was unable to train a decent enough
agent in the first game, which made it meaningless to evaluate it’s
performance on yet another game.
Because I still want to do the original project some time in the
future, to prepare myself for that I thought it would be better to
first learn in-depth about how Deep Q-Networks work, what their
shortcomings are and how they can be improved. This, and for
time-constraint reasons, I have changed my project for this class to
a comparison of various DQN versions.
# Dataset
I used the excellent [Gym](https://github.com/openai/gym) library to
run my environment. A total of 9 agents, 1 in Airstriker Genesis, 4
in Starpilot and 4 in Lunar Lander were trained.
| **Game** | **Observation Space** | **Action Space** |
| :----------------- | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Airstriker Genesis | RGB values of each pixel of the game screen (255, 255, 3) | Discrete(12) representing each of the buttons on the old Atari controllers. But since only three of those buttons were used in the game  the action space was reduced to 3 during training. ( Left, Right, Fire ) |
| Starpilot | RGB values of each pixel of the game screen (64, 64, 3) | Discrete(15) representing each of the button combos ( Left, Right, Up, Down, Up + Right, Up + Left, Down + Right, Down + Left, W, A, S, D, Q, E, Do nothing ) |
| Lunar Lander | 8-dimensional vector: ( X-coordinate, Y-coordinate, Linear velocity in X, Linear Velocity in Y, Angle, Angular Velocity, Boolean (Leg 1 in contact with ground), Boolean (Leg 2 in contact with ground) ) | Discrete(4)( Do nothing, Fire left engine, Fire main engine, Fire right engine ) |
**Environment/Libraries**:
Miniconda, Python 3.9, Gym, Pyorch, Numpy, Tensorboard on my
personal Macbook Pro (M1)
# ML Methodology
Each agent was trained using DQN or one of its flavors. Each agent
for a particular game was trained with the same hyperparameters with
just the underlying algorithm different. The following metrics for
each agent were used for evaluation:
- **Epsilon value over each episode** Shows what the exploration
rate was at the end of each episode.
- **Average Q-value for the last 100 episodes** A measure of the
average q-value (for the action chosen) for the last 100
episodes.
- **Average length for the last 100 episodes** A measure of the
average number of steps taken in each episode
- **Average loss for the last 100 episodes** A measure of loss
during learning in the last 100 episodes (A Huber Loss was used)
- **Average reward for the last 100 episodes** A measure of the
average reward the agent accumulated over the last 100 episodes
## Preprocessing
For the Airstriker and the Starpilot games:
1. Changed each frame to grayscale
Since the color shouldn’t matter to the agent, I decided to
change the RGB image to grayscale
2. Changed observation space shape from (height, width, channels)
to (channels, height, width) to make it compatible with
Pytorch
Apparently Pytorch uses a different format than the direct
output of the gym environment. For this reason, I had to reshape
each observation to match Pytorch’s scheme (this took me a very
long time to figure out, but had an "Aha\!" moment when I
remember you saying something similar in class).
3. Framestacking
Instead of processing 1 frame at a time, process 4 frames at a
time. This is because just 1 frame is not enough information for
the agent to decide what action to take.
For Lunar Lander, since the reward changes are very drastic (sudden
+100, -100, +200) rewards, I experimented with Reward Clipping
(clipping the rewards to \[-1, 1\] range) but this didn’t seem to
make much difference in my agent’s performance.
# Results
- **Airstriker Genesis**
The loss went down until about 5200 episodes but after that it
stopped going down any further. Consequently the average reward the
agent accumulated over the last 100 episodes pretty much plateaued
after about 5000 episodes. On analysis, I noticed that my
exploration rate at the end of the 7000th episode was still about
0.65, which means that the agent was taking random actions more than
half of the time. On hindsight, I feel like I should have trained
more, at least until the epsilon value (exploration rate) completely
decayed to 5%.
![image](./docs/air1.png) ![image](./docs/air2.png) ![image](./docs/air3.png)
- **Starpilot**
I trained DQN, Double DQN, Dueling DQN and Dueling Double DQN
versions for this game to compare the different algorithms.
From the graph of mean q-values, we can tell that the Vanilla DQN
versions indeed give high q-values, and their Double-DQN couterparts
give lower values, which makes me think that my implementation of
the Double DQN algorithm was OK. I had expected the agent to
accumulate higher rewards starting much earlier for the Double and
Dueling versions, but since the average rewards was almost similar
for all the agents, I could not notice any stark differences between
the performance of each agent.
![image](./docs/star1.png)
![image](./docs/star2.png)
| | |
| :------------------ | :------------------ |
| ![image](./docs/star3.png) | ![image](./docs/star4.png) |
- **Lunar Lander**
Since I did gain much insight from the agent in the Starpilot game,
I thought I was not training long enough. So I tried training the
same agents on Lunar Lander, which is a comparatively simpler game
with a smaller observation space and one that a DQN algorithm should
be able converge pretty quickly to (based on comments by other
people in the RL community).
![image](./docs/lunar1.png)
![image](./docs/lunar2.png)
| | |
| :------------------- | :------------------- |
| ![image](./docs/lunar3.png) | ![image](./docs/lunar4.png) |
The results for this were interesting. Although I did not find any
vast difference between the different variations of the DQN
algorithm, I found that the performance of my agent suddenly got
worse at around 300 episodes. Upon researching on why this may have
happened, I learned that DQN agents suffer from **catastrophic
forgetting** i.e. after training extensively, the network suddenly
forgets what it has learned in the past and the starts performing
worse. Initially, I thought this might have been the case, but since
I haven’t trained long enough, and because all models started
performing worse at almost exactly the same episode number, I think
this might be a problem with my code or some hyperparameter that I
used.
Upon checking what the agent was doing in the actual game, I found
that it was playing it very safe and just constantly hovering in the
air, not attempting to land the spaceship (the goal of the agent is
to land within the yellow flags). I thought maybe penalizing the
rewards for taking too many steps in the episode would work, but
that didn’t help either.
![image](./docs/check.png)
# Problems Faced
Here are a few of the problems that I faced while training my agents:
- Understanding the various hyperparameters in the algorithm. DQN uses
a lot of moving parts and thus, tuning each parameter was a
difficult task. There were about 8 different hyperparameters (some
correlated) that impacted the agent’s training performance. I
struggled with understanding how each parameter impacted the agent
and also with figuring out how to find optimal values for those. I
ended up tuning them by trial and error.
- I got stuck for a long time figuring out why my convolutional layer
was not working. I didn’t realize that Pytorch has the channels in
the first dimension, and because of that, I was passing huge numbers
like 255 (the height of the image) into the input dimension for a
Conv2D layer.
- I struggled with knowing how long is long enough to realize that a
model is not working. I trained a model on Airstriker Genesis for 14
hours just to realize later that I had set a parameter incorrectly
and had to retrain all over again.
# What Next?
Although I didn’t get a final working agent for any of the games I
tried, I feel like I have learned a lot about reinforcement learning,
especially about Deep Q-learning. I plan to improve upon this further,
and hopefully get an agent to go far into at least one of the games.
Next time, I will start with first debugging my current code and see if
I have any implementation mistakes. Then I will train them a lot longer
than I did this time and see if it works. While learning about the
different flavors of DQN, I also learned a little about NoisyNet DQN,
Rainbow-DQN and Prioritized Experience Replay. I couln’t implement these
for this project, but I would like to try them out some time soon.
# Lessons Learned
- Reinforcement learning is a very challenging problem. It takes a
substantially large amount of time to train, it is hard to debug and
it is very difficult to tune its hyperparameters just right. It is a
lot different from supervised learning in that there are no actual
labels and thus, this makes optimization very difficult.
- I tried training an agent on the Atari Airstriker Genesis and the
procgen Starpilot game using just the CPU, but this took a very long
time. This is understandable because the inputs are images and using
a GPU would have been obviously better. Next time, I will definitely
try using a GPU to make training faster.
- Upon being faced with the problem of my agent not learning, I went
into research mode and got to learn a lot about DQN and its improved
versions. I am not a master of the algorithms yet (I have yet to get
an agent to perform well in the game), but I feel like I understand
how each version works.
- Rather than just following someone’s tutorial, also reading the
actual papers for that particular algorithm helped me understand the
algorithm better and code it.
- Doing this project reinforced into me that I love the concept of
reinforcement learning. It has made me even more interested into
exploring the field further and learn more.
# References / Resources
- [Reinforcement Learning (DQN) Tutorial, Adam
Paszke](https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html)
- [Train a mario-playing RL agent, Yuansong Feng, Suraj Subramanian,
Howard Wang, Steven
Guo](https://pytorch.org/tutorials/intermediate/mario_rl_tutorial.html)
- [About Double DQN, Dueling
DQN](https://horomary.hatenablog.com/entry/2021/02/06/013412)
- [Dueling Network Architecture for Deep Reinforcement Learning (Wang
et al., 2015))](https://arxiv.org/abs/1511.06581)
*(Final source code for the project can be found*
[*here*](https://github.com/00ber/ml-reinforcement-learning)*)*.