Weber’s proof of Gittins Index Theorem February 10, 2018 in Operations research, Uncategorized | Tags: dynamic programming, multiarmed bandit After presenting Richard Weber’s remarkable proof of Gittins’ index theorem in my dynamic optimization class, I claimed that the best way to make sure that you understand a proof is to identify where the assumptions of the theorem are used. Here is the proof again, slightly modified from Weber’s paper, followed by the question I gave in class. First, an arm or a bandit process is given by a countable state space , a transition function and a payoff function . The interpretation is that at every period, when the arm is at state , playing it gives a reward and the arm’s state changes according to . In the multi-armed bandit problem, at
Eran considers the following as important: dynamic programming, multiarmed bandit, Operations research, Uncategorized
This could be interesting, too:
Tyler Cowen writes Thursday assorted links
Tyler Cowen writes Who again has the power to end the government shutdown?
Tyler Cowen writes Wednesday assorted links
After presenting Richard Weber’s remarkable proof of Gittins’ index theorem in my dynamic optimization class, I claimed that the best way to make sure that you understand a proof is to identify where the assumptions of the theorem are used. Here is the proof again, slightly modified from Weber’s paper, followed by the question I gave in class.
First, an arm or a bandit process is given by a countable state space , a transition function and a payoff function . The interpretation is that at every period, when the arm is at state , playing it gives a reward and the arm’s state changes according to .
In the multi-armed bandit problem, at every period you choose an arm to play. The states of the arms you didn’t choose remain fixed. Your goal is to maximize expected total discounted rewards. Gittins’ theorem says that for each arm there exists a function called the Gittins Index (GI from now on) such that, in a multi armed problem, the optimal strategy is to play at each period the arm whose current state has the largest GI. In fancy words, the theorem establishes that the choice which arm to play at each period satisfies Independent of Irrelevance Alternatives: Suppose there are three arms whose current states are . If you were going to start by playing if only and were available, then you should not start with when are available.
The proof proceeds in several steps:
- Define the Gittins Index at state to be the amount such that, if the casino charges every time you play the arm, then both playing and not playing are optimal actions at the state . We need to prove that there exists a unique such . This is not completely obvious, but can be shown by appealing to standard dynamic programming arguments.
- Assume that you enter a casino with a single arm at some state with GI . Assume also that the casino charges every time you play the arm. At every period, you can play, or quit playing, or take a break. From step 1, it follows that regardless of your strategy, the casino will always get a nonnegative net expected net payoff, and if you play optimally then the net expected payoff to the casino (and therefore also to you) is zero. For this reason this (the GI of the initial state) is called the fair charge. Here, playing optimally means that you either not play at all or start playing and continue to play every period until the arm reaches a state with GI strictly smaller then , in which case you must quit. It is important that as long as the arm is at a state with GI strictly greater than you continue playing. If you need to take a restroom break you must wait until the arm reaches a state with GI .
- Continuing with a single arm, assume now that the casino announces a new policy that at every period, if the arm reaches a state with GI that is strictly smaller than the GI of all previous states, then the charge for playing the arm drops to the new GI. We call these new (random) charges the prevailing charges. Again, the casino will always get a nonnegative net expected payoff, and if you play optimally then the net expected payoff is zero. Here, playing optimally means that you either not play at all or start playing and continue to play foreover. You can quit or take a bathroom break only at periods in which the prevailing charge equals the GI of the current state.
- Consider now the multi-arms problem, and assume again that in order to play an arm you have to pay its current prevailing charge as defined in step 3. Then again, regardless of how you play, the Casino will get a nonnegative net payoff (since by step 3 this is the case for every arm separately), and you can still get an expected net payoff if you play optimally. Playing optimally means that you either not play or start playing. If you start playing you can quit, take a break, or switch to another arm only in periods in which the prevailing charge of the arm you are currently playing equals the GI of its current state.
- Forget for a moment about your profits and assume that what you care about is maximizing payments to the casino (I don’t mean net payoff, I mean just the charges that the casino receives from your playing). Since the sequence of prevailing charges of every arm is decreasing, and since the discount factor makes the casino like higher payments early, the Gittins strategy — the one in which you play at each period the arm with highest current GI, which by definition of the prevailing charge is also the arm with highest current prevailing charge — is the one that maximizes the Casino’s payments. In fact, this would be the case even if you knew the realization of the charges sequence in advance.
- The Gittins strategy is one of the optimal strategies from step 4. Therefore, its net expected payoff is .
- Therefore, for every strategy ,
Rewards from Charges from Charges from Gittins strategy
(First inequality is step 4 and second is step 5)
And Charges from Gittins strategy = Rewards from Gittins Strategy
- Therefore, Gittins strategy gives the optimal possible total rewards.
Now, here is the question. Suppose that instead of arms we would have dynamic optimization problems, each given by a state space, an action space, a transition function, and a payoff function. Let’s call them projects. The difference between a project and an arm is that when you decide to work on a project you also decide which action to take, and the current reward and next state depend on the current state and on your action. Now read again the proof with projects in mind. Every time I said “play arm ”, what I meant is work on project and choose the optimal action. We can still define an “index”, as in the first step: the unique charge such that, if you need to pay every period you work on the project (using one of the actions) then both not working and working with some action are optimal. The conclusion is not true for the projects problem though. At which step does the argument break down ?