File size: 22,194 Bytes
97b6013 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 |



# Deep Bayesian Bandits Library
This library corresponds to the *[Deep Bayesian Bandits Showdown: An Empirical
Comparison of Bayesian Deep Networks for Thompson
Sampling](https://arxiv.org/abs/1802.09127)* paper, published in
[ICLR](https://iclr.cc/) 2018. We provide a benchmark to test decision-making
algorithms for contextual-bandits. In particular, the current library implements
a variety of algorithms (many of them based on approximate Bayesian Neural
Networks and Thompson sampling), and a number of real and syntethic data
problems exhibiting a diverse set of properties.
It is a Python library that uses [TensorFlow](https://www.tensorflow.org/).
We encourage contributors to add new approximate Bayesian Neural Networks or,
more generally, contextual bandits algorithms to the library. Also, we would
like to extend the data sources over time, so we warmly encourage contributions
in this front too!
Please, use the following when citing the code or the paper:
```
@article{riquelme2018deep, title={Deep Bayesian Bandits Showdown: An Empirical
Comparison of Bayesian Deep Networks for Thompson Sampling},
author={Riquelme, Carlos and Tucker, George and Snoek, Jasper},
journal={International Conference on Learning Representations, ICLR.}, year={2018}}
```
**Contact**. This repository is maintained by [Carlos Riquelme](http://rikel.me) ([rikel](https://github.com/rikel)). Feel free to reach out directly at [[email protected]](mailto:[email protected]) with any questions or comments.
We first briefly introduce contextual bandits, Thompson sampling, enumerate the
implemented algorithms, and the available data sources. Then, we provide a
simple complete example illustrating how to use the library.
## Contextual Bandits
Contextual bandits are a rich decision-making framework where an algorithm has
to choose among a set of *k* actions at every time step *t*, after observing
a context (or side-information) denoted by *X<sub>t</sub>*. The general pseudocode for
the process if we use algorithm **A** is as follows:
```
At time t = 1, ..., T:
1. Observe new context: X_t
2. Choose action: a_t = A.action(X_t)
3. Observe reward: r_t
4. Update internal state of the algorithm: A.update((X_t, a_t, r_t))
```
The goal is to maximize the total sum of rewards: ∑<sub>t</sub> r<sub>t</sub>
For example, each *X<sub>t</sub>* could encode the properties of a specific user (and
the time or day), and we may have to choose an ad, discount coupon, treatment,
hyper-parameters, or version of a website to show or provide to the user.
Hopefully, over time, we will learn how to match each type of user to the most
beneficial personalized action under some metric (the reward).
## Thompson Sampling
Thompson Sampling is a meta-algorithm that chooses an action for the contextual
bandit in a statistically efficient manner, simultaneously finding the best arm
while attempting to incur low cost. Informally speaking, we assume the expected
reward is given by some function
**E**[r<sub>t</sub> | X<sub>t</sub>, a<sub>t</sub>] = f(X<sub>t</sub>, a<sub>t</sub>).
Unfortunately, function **f** is unknown, as otherwise we could just choose the
action with highest expected value:
a<sub>t</sub><sup>*</sup> = arg max<sub>i</sub> f(X<sub>t</sub>, a<sub>t</sub>).
The idea behind Thompson Sampling is based on keeping a posterior distribution
π<sub>t</sub> over functions in some family f ∈ F after observing the first
*t-1* datapoints. Then, at time *t*, we sample one potential explanation of
the underlying process: f<sub>t</sub> ∼ π<sub>t</sub>, and act optimally (i.e., greedily)
*according to f<sub>t</sub>*. In other words, we choose
a<sub>t</sub> = arg max<sub>i</sub> f<sub>t</sub>(X<sub>t</sub>, a<sub>i</sub>).
Finally, we update our posterior distribution with the new collected
datapoint (X<sub>t</sub>, a<sub>t</sub>, r<sub>t</sub>).
The main issue is that keeping an updated posterior π<sub>t</sub> (or, even,
sampling from it) is often intractable for highly parameterized models like deep
neural networks. The algorithms we list in the next section provide tractable
*approximations* that can be used in combination with Thompson Sampling to solve
the contextual bandit problem.
## Algorithms
The Deep Bayesian Bandits library includes the following algorithms (see the
[paper](https://arxiv.org/abs/1802.09127) for further details):
1. **Linear Algorithms**. As a powerful baseline, we provide linear algorithms.
In particular, we focus on the exact Bayesian linear regression
implementation, while it is easy to derive the greedy OLS version (possibly,
with epsilon-greedy exploration). The algorithm is implemented in
*linear_full_posterior_sampling.py*, and it is instantiated as follows:
```
linear_full = LinearFullPosteriorSampling('MyLinearTS', my_hparams)
```
2. **Neural Linear**. We introduce an algorithm we call Neural Linear, which
operates by learning a neural network to map contexts to rewards for each
action, and ---simultaneously--- it updates a Bayesian linear regression in
the last layer (i.e., the one that maps the final representation **z** to
the rewards **r**). Thompson Sampling samples the linear parameters
β<sub>i</sub> for each action *i*, but keeps the network that computes the
representation. Then, both parts (network and Bayesian linear regression)
are updated, possibly at different frequencies. The algorithm is implemented
in *neural_linear_sampling.py*, and we create an algorithm instance like
this:
```
neural_linear = NeuralLinearPosteriorSampling('MyNLinear', my_hparams)
```
3. **Neural Greedy**. Another standard benchmark is to train a neural network
that maps contexts to rewards, and at each time *t* just acts greedily
according to the current model. In particular, this approach does *not*
explicitly use Thompson Sampling. However, due to stochastic gradient
descent, there is still some randomness in its output. It is
straight-forward to add epsilon-greedy exploration to choose random
actions with probability ε ∈ (0, 1). The algorithm is
implemented in *neural_bandit_model.py*, and it is used together with
*PosteriorBNNSampling* (defined in *posterior_bnn_sampling.py*) by calling:
```
neural_greedy = PosteriorBNNSampling('MyNGreedy', my_hparams, 'RMSProp')
```
4. **Stochastic Variational Inference**, Bayes by Backpropagation. We implement
a Bayesian neural network by modeling each individual weight posterior as a
univariate Gaussian distribution: w<sub>ij</sub> ∼ N(μ<sub>ij</sub>, σ<sub>ij</sub><sup>2</sup>).
Thompson sampling then samples a network at each time step
by sampling each weight independently. The variational approach consists in
maximizing a proxy for maximum likelihood of the observed data, the ELBO or
variational lower bound, to fit the values of μ<sub>ij</sub>, σ<sub>ij</sub><sup>2</sup>
for every *i, j*.
See [Weight Uncertainty in Neural
Networks](https://arxiv.org/abs/1505.05424).
The BNN algorithm is implemented in *variational_neural_bandit_model.py*,
and it is used together with *PosteriorBNNSampling* (defined in
*posterior_bnn_sampling.py*) by calling:
```
bbb = PosteriorBNNSampling('myBBB', my_hparams, 'Variational')
```
5. **Expectation-Propagation**, Black-box alpha-divergence minimization.
The family of expectation-propagation algorithms is based on the message
passing framework . They iteratively approximate the posterior by updating a
single approximation factor (or site) at a time, which usually corresponds
to the likelihood of one data point. We focus on methods that directly
optimize the global EP objective via stochastic gradient descent, as, for
instance, Power EP. For further details see original paper below.
See [Black-box alpha-divergence
Minimization](https://arxiv.org/abs/1511.03243).
We create an instance of the algorithm like this:
```
bb_adiv = PosteriorBNNSampling('MyEP', my_hparams, 'AlphaDiv')
```
6. **Dropout**. Dropout is a training technique where the output of each neuron
is independently zeroed out with probability *p* at each forward pass.
Once the network has been trained, dropout can still be used to obtain a
distribution of predictions for a specific input. Following the best action
with respect to the random dropout prediction can be interpreted as an
implicit form of Thompson sampling. The code for dropout is the same as for
Neural Greedy (see above), but we need to set two hyper-parameters:
*use_dropout=True* and *keep_prob=p* where *p* takes the desired value in
(0, 1). Then:
```
dropout = PosteriorBNNSampling('MyDropout', my_hparams, 'RMSProp')
```
7. **Monte Carlo Methods**. To be added soon.
8. **Bootstrapped Networks**. This algorithm trains simultaneously and in
parallel **q** neural networks based on different datasets D<sub>1</sub>, ..., D<sub>q</sub>. The way those datasets are collected is by adding each new collected
datapoint (X<sub>t</sub>, a<sub>t</sub>, r<sub>t</sub>) to each dataset *D<sub>i</sub>* independently and with
probability p ∈ (0, 1]. Therefore, the main hyperparameters of the
algorithm are **(q, p)**. In order to choose an action for a new context,
one of the **q** networks is first selected with uniform probability (i.e.,
*1/q*). Then, the best action according to the *selected* network is
played.
See [Deep Exploration via Bootstrapped
DQN](https://arxiv.org/abs/1602.04621).
The algorithm is implemented in *bootstrapped_bnn_sampling.py*, and we
instantiate it as (where *my_hparams* contains both **q** and **p**):
```
bootstrap = BootstrappedBNNSampling('MyBoot', my_hparams)
```
9. **Parameter-Noise**. Another approach to approximate a distribution over
neural networks (or more generally, models) that map contexts to rewards,
consists in randomly perturbing a point estimate trained by Stochastic
Gradient Descent on the data. The Parameter-Noise algorithm uses a heuristic
to control the amount of noise σ<sub>t</sub><sup>2</sup> it adds independently to the
parameters representing a neural network: θ<sub>t</sub><sup>'</sup> = θ<sub>t</sub> + ε where
ε ∼ N(0, σ<sub>t</sub><sup>2</sup> Id).
After using θ<sub>t</sub><sup>'</sup> for decision making, the following SGD
training steps start again from θ<sub>t</sub>. The key hyperparameters to set
are those controlling the noise heuristic.
See [Parameter Space Noise for
Exploration](https://arxiv.org/abs/1706.01905).
The algorithm is implemented in *parameter_noise_sampling.py*, and we create
an instance by calling:
```
parameter_noise = ParameterNoiseSampling('MyParamNoise', my_hparams)
```
10. **Gaussian Processes**. Another standard benchmark are Gaussian Processes,
see *Gaussian Processes for Machine Learning* by Rasmussen and Williams for
an introduction. To model the expected reward of different actions, we fit a
multitask GP.
See [Multi-task Gaussian Process
Prediction](http://papers.nips.cc/paper/3189-multi-task-gaussian-process-prediction.pdf).
Our implementation is provided in *multitask_gp.py*, and it is instantiated
as follows:
```
gp = PosteriorBNNSampling('MyMultitaskGP', my_hparams, 'GP')
```
In the code snippet at the bottom, we show how to instantiate some of these
algorithms, and how to run the contextual bandit simulator, and display the
high-level results.
## Data
In the paper we use two types of contextual datasets: synthetic and based on
real-world data.
We provide functions that sample problems from those datasets. In the case of
real-world data, you first need to download the raw datasets, and pass the route
to the functions. Links for the datasets are provided below.
### Synthetic Datasets
Synthetic datasets are contained in the *synthetic_data_sampler.py* file. In
particular, it includes:
1. **Linear data**. Provides a number of linear arms, and Gaussian contexts.
2. **Sparse linear data**. Provides a number of sparse linear arms, and
Gaussian contexts.
3. **Wheel bandit data**. Provides sampled data from the wheel bandit data, see
[Section 5.4](https://arxiv.org/abs/1802.09127) in the paper.
### Real-World Datasets
Real-world data generating functions are contained in the *data_sampler.py*
file.
In particular, it includes:
1. **Mushroom data**. Each incoming context represents a different type of
mushroom, and the actions are eat or no-eat. Eating an edible mushroom
provides positive reward, while eating a poisonous one provides positive
reward with probability *p*, and a large negative reward with probability
*1-p*. All the rewards, and the value of *p* are customizable. The
[dataset](https://archive.ics.uci.edu/ml/datasets/mushroom) is part of the
UCI repository, and the bandit problem was proposed in Blundell et al.
(2015). Data is available [here](https://storage.googleapis.com/bandits_datasets/mushroom.data)
or alternatively [here](https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/),
use the *agaricus-lepiota.data* file.
2. **Stock data**. We created the Financial Dataset by pulling the stock prices
of *d = 21* publicly traded companies in NYSE and Nasdaq, for the last 14
years (*n = 3713*). For each day, the context was the price difference
between the beginning and end of the session for each stock. We
synthetically created the arms to be a linear combination of the contexts,
representing *k = 8* different potential portfolios. Data is available
[here](https://storage.googleapis.com/bandits_datasets/raw_stock_contexts).
3. **Jester data**. We create a recommendation system bandit problem as
follows. The Jester Dataset (Goldberg et al., 2001) provides continuous
ratings in *[-10, 10]* for 100 jokes from a total of 73421 users. We find
a *complete* subset of *n = 19181* users rating all 40 jokes. Following
Riquelme et al. (2017), we take *d = 32* of the ratings as the context of
the user, and *k = 8* as the arms. The agent recommends one joke, and
obtains the reward corresponding to the rating of the user for the selected
joke. Data is available [here](https://storage.googleapis.com/bandits_datasets/jester_data_40jokes_19181users.npy).
4. **Statlog data**. The Shuttle Statlog Dataset (Asuncion & Newman, 2007)
provides the value of *d = 9* indicators during a space shuttle flight,
and the goal is to predict the state of the radiator subsystem of the
shuttle. There are *k = 7* possible states, and if the agent selects the
right state, then reward 1 is generated. Otherwise, the agent obtains no
reward (*r = 0*). The most interesting aspect of the dataset is that one
action is the optimal one in 80% of the cases, and some algorithms may
commit to this action instead of further exploring. In this case, the number
of contexts is *n = 43500*. Data is available [here](https://storage.googleapis.com/bandits_datasets/shuttle.trn) or alternatively
[here](https://archive.ics.uci.edu/ml/datasets/Statlog+\(Shuttle\)), use
*shuttle.trn* file.
5. **Adult data**. The Adult Dataset (Kohavi, 1996; Asuncion & Newman, 2007)
comprises personal information from the US Census Bureau database, and the
standard prediction task is to determine if a person makes over 50K a year
or not. However, we consider the *k = 14* different occupations as
feasible actions, based on *d = 94* covariates (many of them binarized).
As in previous datasets, the agent obtains a reward of 1 for making the
right prediction, and 0 otherwise. The total number of contexts is *n =
45222*. Data is available [here](https://storage.googleapis.com/bandits_datasets/adult.full) or alternatively
[here](https://archive.ics.uci.edu/ml/datasets/adult), use *adult.data*
file.
6. **Census data**. The US Census (1990) Dataset (Asuncion & Newman, 2007)
contains a number of personal features (age, native language, education...)
which we summarize in *d = 389* covariates, including binary dummy
variables for categorical features. Our goal again is to predict the
occupation of the individual among *k = 9* classes. The agent obtains
reward 1 for making the right prediction, and 0 otherwise. Data is available
[here](https://storage.googleapis.com/bandits_datasets/USCensus1990.data.txt) or alternatively [here](https://archive.ics.uci.edu/ml/datasets/US+Census+Data+\(1990\)), use
*USCensus1990.data.txt* file.
7. **Covertype data**. The Covertype Dataset (Asuncion & Newman, 2007)
classifies the cover type of northern Colorado forest areas in *k = 7*
classes, based on *d = 54* features, including elevation, slope, aspect,
and soil type. Again, the agent obtains reward 1 if the correct class is
selected, and 0 otherwise. Data is available [here](https://storage.googleapis.com/bandits_datasets/covtype.data) or alternatively
[here](https://archive.ics.uci.edu/ml/datasets/covertype), use
*covtype.data* file.
In datasets 4-7, each feature of the dataset is normalized first.
## Usage: Basic Example
This library requires Tensorflow, Numpy, and Pandas.
The file *example_main.py* provides a complete example on how to use the
library. We run the code:
```
python example_main.py
```
**Do not forget to** configure the routes to the data files at the top of *example_main.py*.
For example, we can run the Mushroom bandit for 2000 contexts on a few
algorithms as follows:
```
# Problem parameters
num_contexts = 2000
# Choose data source among:
# {linear, sparse_linear, mushroom, financial, jester,
# statlog, adult, covertype, census, wheel}
data_type = 'mushroom'
# Create dataset
sampled_vals = sample_data(data_type, num_contexts)
dataset, opt_rewards, opt_actions, num_actions, context_dim = sampled_vals
# Define hyperparameters and algorithms
hparams_linear = tf.contrib.training.HParams(num_actions=num_actions,
context_dim=context_dim,
a0=6,
b0=6,
lambda_prior=0.25,
initial_pulls=2)
hparams_dropout = tf.contrib.training.HParams(num_actions=num_actions,
context_dim=context_dim,
init_scale=0.3,
activation=tf.nn.relu,
layer_sizes=[50],
batch_size=512,
activate_decay=True,
initial_lr=0.1,
max_grad_norm=5.0,
show_training=False,
freq_summary=1000,
buffer_s=-1,
initial_pulls=2,
optimizer='RMS',
reset_lr=True,
lr_decay_rate=0.5,
training_freq=50,
training_epochs=100,
keep_prob=0.80,
use_dropout=True)
### Create hyper-parameter configurations for other algorithms
[...]
algos = [
UniformSampling('Uniform Sampling', hparams),
PosteriorBNNSampling('Dropout', hparams_dropout, 'RMSProp'),
PosteriorBNNSampling('BBB', hparams_bbb, 'Variational'),
NeuralLinearPosteriorSampling('NeuralLinear', hparams_nlinear),
LinearFullPosteriorSampling('LinFullPost', hparams_linear),
BootstrappedBNNSampling('BootRMS', hparams_boot),
ParameterNoiseSampling('ParamNoise', hparams_pnoise),
]
# Run contextual bandit problem
t_init = time.time()
results = run_contextual_bandit(context_dim, num_actions, dataset, algos)
_, h_rewards = results
# Display results
display_results(algos, opt_rewards, opt_actions, h_rewards, t_init, data_type)
```
The previous code leads to final results that look like:
```
---------------------------------------------------
---------------------------------------------------
mushroom bandit completed after 69.8401839733 seconds.
---------------------------------------------------
0) LinFullPost | total reward = 4365.0.
1) NeuralLinear | total reward = 4110.0.
2) Dropout | total reward = 3430.0.
3) ParamNoise | total reward = 3270.0.
4) BootRMS | total reward = 3050.0.
5) BBB | total reward = 2505.0.
6) Uniform Sampling | total reward = -4930.0.
---------------------------------------------------
Optimal total reward = 5235.
Frequency of optimal actions (action, frequency):
[[0, 953], [1, 1047]]
---------------------------------------------------
---------------------------------------------------
```
|