Spaces:
Runtime error
Runtime error
# Random Walks: Decision Tree Example | |
This example uses the Toy Problem described in [Decision Transformer (Lili Chen | |
et al. 2021)](https://arxiv.org/abs/2106.01345). | |
## Game Description | |
The task is to find the shortest path on a directed graph. The reward is based | |
on how optimal the path is compared to the shortest possible (bounded in [0, | |
1]). | |
Note this is different to the paper, which gave rewards of -1 for every | |
turn not at the goal state, and 0 at the goal state. Here the model instead | |
receives its reward at the end of the full trajectory, based on how optimal it | |
is compared to the minimum number of steps to reach the goal state (bounded in | |
[0, 1]). | |
Paths are represented as strings of letters, with each letter corresponding to a | |
node in the graph. | |
## Training | |
![Graph Example](graph-example.png) | |
Source: Decision Transformer (Lili Chen et al. 2021) | |
For PPO, a language model was fine-tuned to predict the next token in a sequence | |
of returns-to-go (sum of future rewards), states and actions. It was trained | |
only on random walk data. | |
ILQL by contrast learns from the samples directly. | |