File size: 5,327 Bytes
e8d7ae6
46b0409
e8d7ae6
345b944
 
e8d7ae6
 
45dcb54
e8d7ae6
 
 
46b0409
345b944
8a49a12
8f61dac
 
 
8a49a12
 
6e58943
 
 
0c8eecb
 
1663f39
 
0c8eecb
6e58943
e24c7c0
6e58943
 
 
 
 
 
f49bb60
 
 
6e58943
 
 
 
 
 
 
cb18290
6e58943
1663f39
f49bb60
f422d2f
e24c7c0
cb18290
f49bb60
1663f39
 
 
 
 
46b0409
cb18290
46b0409
cb18290
46b0409
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cb18290
 
e24c7c0
46b0409
cb18290
e24c7c0
46b0409
cb18290
46b0409
 
 
cb18290
 
 
e24c7c0
cb18290
 
6e58943
cb18290
e24c7c0
cb18290
e24c7c0
cb18290
6e58943
46b0409
 
 
6e58943
46b0409
 
6e58943
 
46b0409
 
cb18290
e24c7c0
46b0409
 
 
 
cb18290
e24c7c0
cb18290
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
---
title: CS581 Final Project Demo - Dynamic Programming & Monte-Carlo RL Methods
emoji: 🧠
colorFrom: yellow
colorTo: orange
sdk: gradio
app_file: demo.py
fullWidth: true
pinned: true
---

# CS581 Final Project - Dynamic Programming & Monte-Carlo RL Methods

Authors: Andrei Cozma and Landon Harris

Evolution of Reinforcement Learning methods from pure Dynamic Programming-based methods to Monte Carlo methods + Bellman Optimization Comparison  

[Google Slides](https://docs.google.com/presentation/d/1v4WwBQKoPnGiyCMXgUs-pCCJ8IwZqM3thUf-Ky00eTQ/edit?usp=sharing)

# 1. Requirements

Python 3.6+ with the following major dependencies:

- Gymnasium: <https://pypi.org/project/gymnasium/>
- WandB: <https://pypi.org/project/wandb/> (for logging)
- Gradio: <https://pypi.org/project/gradio/> (for demo web app)

Install all the dependencies using `pip`:

```bash
❯ pip3 install -r requirements.txt
```

# 2. Interactive Demo

HuggingFace Space: [acozma/CS581-Algos-Demo](https://huggingface.co/spaces/acozma/CS581-Algos-Demo)

Launch the Gradio demo web app locally:

```bash
❯ python3 demo.py
Running on local URL:  http://127.0.0.1:7860
```

<img src="./assets/gradio_demo.png"  height="600" />

# 2. Agents

## 2.1. Dynamic-Programming Agent

TODO

## 2.2. Monte-Carlo Agent

This is the implementation of an On-Policy Monte-Carlo agent to solve several toy problems from the OpenAI Gymnasium.  

The agent starts with a randomly initialized epsilon-greedy policy and uses either the first-visit or every-visit Monte-Carlo update method to learn the optimal policy. Training is performed using a soft (epsilon-greedy) policy and testing uses the resulting greedy policy.

### Parameter testing results

**CliffWalking-v0**  

<table>
  <tr>
    <td><img src="./plots/MC/MCAgent_CliffWalking-v0_gammas.png"/></td>
    <td><img src="./plots/MC/MCAgent_CliffWalking-v0_epsilons.png"/></td>
  </tr>
</table>

**FrozenLake-v1**  
<table>
  <tr>
    <td><img src="./plots/MC/MCAgent_FrozenLake-v1_gammas.png"/></td>
    <td><img src="./plots/MC/MCAgent_FrozenLake-v1_epsilons.png"/></td>
  </tr>
</table>

**Taxi-v3**  
<table>
  <tr>
    <td><img src="./plots/MC/MCAgent_Taxi-v3_gammas.png"/></td>
    <td><img src="./plots/MC/MCAgent_Taxi-v3_epsilons.png"/></td>
  </tr>
</table>

# 3. Run Script Usage

```bash
# Training: Policy will be saved as a `.npy` file.
❯ python3 run.py --agent "MCAgent" --train

# Testing: Use the `--test` flag with the path to the policy file.
❯ python3 run.py --agent "MCAgent" --test "./policies/[saved_policy_file].npy" --render_mode human

❯ python3 run.py --help
usage: run.py [-h] [--train] [--test TEST] [--n_train_episodes N_TRAIN_EPISODES] [--n_test_episodes N_TEST_EPISODES] [--test_every TEST_EVERY] [--max_steps MAX_STEPS] --agent {MCAgent,DPAgent} [--gamma GAMMA] [--epsilon EPSILON] [--update_type {first_visit,every_visit}]
              [--env {CliffWalking-v0,FrozenLake-v1,Taxi-v3}] [--seed SEED] [--size SIZE] [--render_mode RENDER_MODE] [--save_dir SAVE_DIR] [--no_save] [--run_name_suffix RUN_NAME_SUFFIX] [--wandb_project WANDB_PROJECT] [--wandb_job_type WANDB_JOB_TYPE]

options:
  -h, --help            show this help message and exit
  --train               Use this flag to train the agent.
  --test TEST           Use this flag to test the agent. Provide the path to the policy file.
  --n_train_episodes N_TRAIN_EPISODES
                        The number of episodes to train for. (default: 2500)
  --n_test_episodes N_TEST_EPISODES
                        The number of episodes to test for. (default: 100)
  --test_every TEST_EVERY
                        During training, test the agent every n episodes. (default: 100)
  --max_steps MAX_STEPS
                        The maximum number of steps per episode before the episode is forced to end. (default: 200)
  --agent {MCAgent,DPAgent}
                        The agent to use. Currently supports one of: ['MCAgent', 'DPAgent']
  --gamma GAMMA         The value for the discount factor to use. (default: 0.99)
  --epsilon EPSILON     The value for the epsilon-greedy policy to use. (default: 0.4)
  --update_type {first_visit,every_visit}
                        The type of update to use. Only supported by Monte-Carlo agent. (default: first_visit)
  --env {CliffWalking-v0,FrozenLake-v1,Taxi-v3}
                        The Gymnasium environment to use. (default: CliffWalking-v0)
  --seed SEED           The seed to use when generating the FrozenLake environment. If not provided, a random seed is used. (default: None)
  --size SIZE           The size to use when generating the FrozenLake environment. (default: 8)
  --render_mode RENDER_MODE
                        Render mode passed to the gym.make() function. Use 'human' to render the environment. (default: None)
  --save_dir SAVE_DIR   The directory to save the policy to. (default: policies)
  --no_save             Use this flag to disable saving the policy.
  --run_name_suffix RUN_NAME_SUFFIX
                        Run name suffix for logging and policy checkpointing. (default: None)
  --wandb_project WANDB_PROJECT
                        WandB project name for logging. If not provided, no logging is done. (default: None)
  --wandb_job_type WANDB_JOB_TYPE
                        WandB job type for logging. (default: train)
```