Spaces:

acozma
/

CS581-Algos-Demo

Sleeping

App Files Files Community

Andrei Cozma commited on Apr 20, 2023

Commit

e24c7c0

1 Parent(s): ed9cf21

Updates

Browse files

Files changed (2) hide show

README.md +37 -29
demo.py +54 -12

README.md CHANGED Viewed

@@ -8,68 +8,76 @@ Evolution of Reinforcement Learning methods from pure Dynamic Programming-based
 - Python 3
 - Gymnasium: <https://pypi.org/project/gymnasium/>
-- WandB: <https://pypi.org/project/wandb/> (optional for logging)
-## Monte-Carlo Agent
-The implementation of the Monte-Carlo agent for the [Cliff Walking](https://gymnasium.farama.org/environments/toy_text/cliff_walking/) toy environment.
-The agent starts with a randomly initialized epsilon-greedy policy, and uses the first-visit Monte-Carlo method to learn the optimal policy.
-### Training
 ```bash
-python3 MonteCarloAgent.py --train
 ```
-The final policy will be saved to a `.npy` file.
-### Testing
-Provide the path to the policy file as an argument to the `--test` flag.
 ```bash
-python3 MonteCarloAgent.py --test policy_mc_CliffWalking-v0_e2000_s500_g0.99_e0.1.npy
-```
-### Visualization
-```bash
 python3 MonteCarloAgent.py --test policy_mc_CliffWalking-v0_e2000_s500_g0.99_e0.1.npy --render_mode human
 ```
-### Default Parameters
-```python
-usage: MonteCarloAgent.py [-h] [--train] [--test TEST] [--n_train_episodes N_TRAIN_EPISODES] [--n_test_episodes N_TEST_EPISODES] [--test_every TEST_EVERY] [--max_steps MAX_STEPS] [--gamma GAMMA] [--epsilon EPSILON] [--env ENV]
-                          [--render_mode RENDER_MODE] [--wandb_project WANDB_PROJECT] [--wandb_group WANDB_GROUP] [--wandb_job_type WANDB_JOB_TYPE]
 options:
   -h, --help            show this help message and exit
-  --train               Use this flag to train the agent. (default: False)
   --test TEST           Use this flag to test the agent. Provide the path to the policy file.
   --n_train_episodes N_TRAIN_EPISODES
-                        The number of episodes to train for.
   --n_test_episodes N_TEST_EPISODES
-                        The number of episodes to test for.
   --test_every TEST_EVERY
-                        During training, test the agent every n episodes.
   --max_steps MAX_STEPS
-                        The maximum number of steps per episode before the episode is forced to end.
-  --gamma GAMMA         The value for the discount factor to use.
-  --epsilon EPSILON     The value for the epsilon-greedy policy to use.
-  --env ENV             The Gymnasium environment to use.
   --render_mode RENDER_MODE
-                        The render mode to use. By default, no rendering is done. To render the environment, set this to 'human'.
   --wandb_project WANDB_PROJECT
-                        WandB project name for logging. If not provided, no logging is done.
   --wandb_group WANDB_GROUP
                         WandB group name for logging. (default: monte-carlo)
   --wandb_job_type WANDB_JOB_TYPE
                         WandB job type for logging. (default: train)
 ```
-## Presentation Guide (Text Version)
 1. Title Slide: list the title of your talk along with your name

 - Python 3
 - Gymnasium: <https://pypi.org/project/gymnasium/>
+- WandB: <https://pypi.org/project/wandb/>
+- Gradio: <https://pypi.org/project/gradio/>
+## Interactive Demo
+TODO
+## Dynamic-Programming Agent
+TODO
+### Usage
 ```bash
+TODO
 ```
+## Monte-Carlo Agent
+The agent starts with a randomly initialized epsilon-greedy policy and uses either the first-visit or every-visit Monte-Carlo update method to learn the optimal policy.
+Primarily tested on the [Cliff Walking](https://gymnasium.farama.org/environments/toy_text/cliff_walking/) toy environment.
 ```bash
+# Training: Policy will be saved as a `.npy` file.
+python3 MonteCarloAgent.py --train
+# Testing: Use the `--test` flag with the path to the policy file.
 python3 MonteCarloAgent.py --test policy_mc_CliffWalking-v0_e2000_s500_g0.99_e0.1.npy --render_mode human
 ```
+### Usage
+```bash
+usage: MonteCarloAgent.py [-h] [--train] [--test TEST] [--n_train_episodes N_TRAIN_EPISODES] [--n_test_episodes N_TEST_EPISODES] [--test_every TEST_EVERY] [--max_steps MAX_STEPS] [--update_type {first_visit,every_visit}]
+                          [--save_dir SAVE_DIR] [--no_save] [--gamma GAMMA] [--epsilon EPSILON] [--env ENV] [--render_mode RENDER_MODE] [--wandb_project WANDB_PROJECT] [--wandb_group WANDB_GROUP]
+                          [--wandb_job_type WANDB_JOB_TYPE] [--wandb_run_name_suffix WANDB_RUN_NAME_SUFFIX]
 options:
   -h, --help            show this help message and exit
+  --train               Use this flag to train the agent.
   --test TEST           Use this flag to test the agent. Provide the path to the policy file.
   --n_train_episodes N_TRAIN_EPISODES
+                        The number of episodes to train for. (default: 2000)
   --n_test_episodes N_TEST_EPISODES
+                        The number of episodes to test for. (default: 100)
   --test_every TEST_EVERY
+                        During training, test the agent every n episodes. (default: 100)
   --max_steps MAX_STEPS
+                        The maximum number of steps per episode before the episode is forced to end. (default: 500)
+  --update_type {first_visit,every_visit}
+                        The type of update to use. (default: first_visit)
+  --save_dir SAVE_DIR   The directory to save the policy to. (default: policies)
+  --no_save             Use this flag to disable saving the policy.
+  --gamma GAMMA         The value for the discount factor to use. (default: 0.99)
+  --epsilon EPSILON     The value for the epsilon-greedy policy to use. (default: 0.1)
+  --env ENV             The Gymnasium environment to use. (default: CliffWalking-v0)
   --render_mode RENDER_MODE
+                        Render mode passed to the gym.make() function. Use 'human' to render the environment. (default: None)
   --wandb_project WANDB_PROJECT
+                        WandB project name for logging. If not provided, no logging is done. (default: None)
   --wandb_group WANDB_GROUP
                         WandB group name for logging. (default: monte-carlo)
   --wandb_job_type WANDB_JOB_TYPE
                         WandB job type for logging. (default: train)
+  --wandb_run_name_suffix WANDB_RUN_NAME_SUFFIX
+                        WandB run name suffix for logging. (default: None)
 ```
+## Presentation Guide
 1. Title Slide: list the title of your talk along with your name

demo.py CHANGED Viewed

@@ -1,6 +1,5 @@
 import os
 import time
-from matplotlib import interactive
 import numpy as np
 import gradio as gr
 from MonteCarloAgent import MonteCarloAgent
@@ -35,9 +34,10 @@ action_map = {
 }
 # Global variables to allow changing it on the fly
-live_render_fps = 10
 live_epsilon = 0.0
 live_paused = False
 def change_render_fps(x):
@@ -54,23 +54,44 @@ def change_epsilon(x):
 def change_paused(x):
     print("Changing paused:", x)
     global live_paused
-    live_paused = x
-    # change the text to resume
-    return gr.update(value="▶️ Resume" if x else "⏸️ Pause")
 def run(policy_fname, n_test_episodes, max_steps, render_fps, epsilon):
-    global live_render_fps, live_epsilon
     live_render_fps = render_fps
     live_epsilon = epsilon
     print("Running...")
     print(f"- n_test_episodes: {n_test_episodes}")
     print(f"- max_steps: {max_steps}")
     print(f"- render_fps: {live_render_fps}")
     policy_path = os.path.join(policies_folder, policy_fname)
     props = policy_fname.split("_")
     agent_type, env_name = props[0], props[1]
     agent = agent_map[agent_type](env_name, render_mode="rgb_array")
@@ -82,7 +103,9 @@ def run(policy_fname, n_test_episodes, max_steps, render_fps, epsilon):
     episodes_solved = 0
     def ep_str(episode):
-        return f"{episode} / {n_test_episodes} ({(episode + 1) / n_test_episodes * 100:.2f}%)"
     def step_str(step):
         return f"{step + 1}"
@@ -93,8 +116,13 @@ def run(policy_fname, n_test_episodes, max_steps, render_fps, epsilon):
                 max_steps=max_steps, render=True, override_epsilon=True
             )
         ):
-            while live_paused:
-                time.sleep(0.1)
             state, action, reward = episode_hist[-1]
             curr_policy = agent.Pi[state]
@@ -165,6 +193,14 @@ def run(policy_fname, n_test_episodes, max_steps, render_fps, epsilon):
             time.sleep(1 / live_render_fps)
         if solved:
             episodes_solved += 1
@@ -247,16 +283,22 @@ with gr.Blocks(title="CS581 Demo") as demo:
     with gr.Row():
         btn_pause = gr.components.Button("⏸️ Pause", interactive=True)
         btn_pause.click(
             fn=change_paused,
             inputs=[btn_pause],
-            outputs=[btn_pause],
         )
     out_msg = gr.components.Textbox(
         value=""
         if all_policies
-        else "<h2>🚫 ERROR: No policies found! Please train an agent first or add a policy to the policies folder.<h2>",
         label="Status Message",
     )
@@ -284,5 +326,5 @@ with gr.Blocks(title="CS581 Demo") as demo:
         ],
     )
-demo.queue(concurrency_count=3)
 demo.launch()

 import os
 import time
 import numpy as np
 import gradio as gr
 from MonteCarloAgent import MonteCarloAgent
 }
 # Global variables to allow changing it on the fly
+live_render_fps = 5
 live_epsilon = 0.0
 live_paused = False
+live_steps_forward = None
 def change_render_fps(x):
 def change_paused(x):
     print("Changing paused:", x)
+    val_map = {
+        "▶️ Resume": False,
+        "⏸️ Pause": True,
+    }
+    val_map_inv = {v: k for k, v in val_map.items()}
     global live_paused
+    live_paused = val_map[x]
+    next_val = val_map_inv[not live_paused]
+    return gr.update(value=next_val), gr.update(interactive=live_paused)
+def onclick_btn_forward():
+    print("Step forward")
+    global live_steps_forward
+    if live_steps_forward is None:
+        live_steps_forward = 0
+    live_steps_forward += 1
 def run(policy_fname, n_test_episodes, max_steps, render_fps, epsilon):
+    global live_render_fps, live_epsilon, live_paused, live_steps_forward
     live_render_fps = render_fps
     live_epsilon = epsilon
+    print("=" * 80)
     print("Running...")
+    print(f"- policy_fname: {policy_fname}")
     print(f"- n_test_episodes: {n_test_episodes}")
     print(f"- max_steps: {max_steps}")
     print(f"- render_fps: {live_render_fps}")
+    print(f"- epsilon: {live_epsilon}")
     policy_path = os.path.join(policies_folder, policy_fname)
     props = policy_fname.split("_")
+    if len(props) < 2:
+        yield None, None, None, None, None, None, None, None, None, None, "🚫 Please select a valid policy file."
+        return
     agent_type, env_name = props[0], props[1]
     agent = agent_map[agent_type](env_name, render_mode="rgb_array")
     episodes_solved = 0
     def ep_str(episode):
+        return (
+            f"{episode} / {n_test_episodes} ({(episode) / n_test_episodes * 100:.2f}%)"
+        )
     def step_str(step):
         return f"{step + 1}"
                 max_steps=max_steps, render=True, override_epsilon=True
             )
         ):
+            if live_steps_forward is not None:
+                if live_steps_forward > 0:
+                    live_steps_forward -= 1
+                if live_steps_forward == 0:
+                    live_steps_forward = None
+                    live_paused = True
             state, action, reward = episode_hist[-1]
             curr_policy = agent.Pi[state]
             time.sleep(1 / live_render_fps)
+            while live_paused and live_steps_forward is None:
+                yield agent_type, env_name, rgb_array, policy_viz, ep_str(
+                    episode + 1
+                ), ep_str(episodes_solved), step_str(
+                    step
+                ), state, action, reward, "Paused..."
+                time.sleep(1 / live_render_fps)
         if solved:
             episodes_solved += 1
     with gr.Row():
         btn_pause = gr.components.Button("⏸️ Pause", interactive=True)
+        btn_forward = gr.components.Button("⏩ Step", interactive=False)
         btn_pause.click(
             fn=change_paused,
             inputs=[btn_pause],
+            outputs=[btn_pause, btn_forward],
+        )
+        btn_forward.click(
+            fn=onclick_btn_forward,
         )
     out_msg = gr.components.Textbox(
         value=""
         if all_policies
+        else "ERROR: No policies found! Please train an agent first or add a policy to the policies folder.",
         label="Status Message",
     )
         ],
     )
+demo.queue(concurrency_count=2)
 demo.launch()