PowerPoint PresentationOutlineSparse SamplingOnline Planning with Look-Ahead TreesSlide 5Slide 6Expectimax TreeExact Expectimax Tree for V*(s,H)Sparse Sampling TreeSparse Sampling [Kearns et. al. 2002]Sparse Sampling (Cont’d)Slide 12Bandit View of Expectimax TreeBandit View Assuming we Know V*But we don’t know V*Recursive UniformBanditSlide 17Bandit ViewUniform vs. Adaptive BanditsSlide 20Regret Minimization Bandit ObjectiveUCB Adaptive Bandit Algorithm [Auer, Cesa-Bianchi, & Fischer, 2002]UCB Algorithm [Auer, Cesa-Bianchi, & Fischer, 2002]UCB for Multi-State MDPsUCB-based Sparse Sampling [Chang et. al. 2005]Slide 26UCT Algorithm [Kocsis & Szepesvari, 2006]Monte-Carlo Tree SearchRollout PoliciesSlide 30Slide 31Slide 32Slide 33Slide 34Slide 35Slide 36Slide 37Slide 38Slide 39Slide 40Slide 41Slide 42Slide 43Slide 44UCT RecapComputer GoA Brief History of Computer GoOther SuccessesSome ImprovementsSummary1Monte-Carlo Planning IIAlan Fern2OutlinePreliminaries: Markov Decision ProcessesWhat is Monte-Carlo Planning?Uniform Monte-CarloSingle State Case (UniformBandit)Policy rolloutApproximate Policy IterationSparse SamplingAdaptive Monte-CarloSingle State Case (UCB Bandit)UCT Monte-Carlo Tree Search3Sparse SamplingRollout does not guarantee optimality or near optimalityNeither does approximate policy iteration in generalCan we develop Monte-Carlo methods that give us near optimal policies?With computation that does NOT depend on number of states!This was an open problem until late 90’s.In deterministic games and search problems it is common to build a look-ahead tree at a state to select best actionCan we generalize this to general stochastic MDPs? Sparse Sampling is one such algorithmStrong theoretical guarantees of near optimality4Online Planning with Look-Ahead TreesAt each state we encounter in the environment we build a look-ahead tree of depth h and use it to estimate optimal Q-values of each actionSelect action with highest Q-value estimates = current stateRepeat T = BuildLookAheadTree(s) ;; sparse sampling or UCT ;; tree provides Q-value estimates for root actiona = BestRootAction(T) ;; action with best Q-valueExecute action a in environments is the resulting stateSparse SamplingAgain focus on finite-horizonsArbitrarily good approximation for large enough horizon hh-horizon optimal Q-functionValue of taking a in s and following for * for h-1 stepsQ*(s,a,h) = E[R(s,a) + βV*(T(s,a),h-1)]Key identity (Bellman’s equations):V*(s,h) = maxa Q*(s,a,h)*(x) = argmaxa Q*(x,a,h)Sparse sampling estimates Q-values by building sparse expectimax treeSparse SamplingWill present two views of algorithmThe first is perhaps easier to digestThe second is more generalizable and can leverage advances in bandit algorithms1. Approximation to the full expectimax tree2. Recursive bandit algorithmExpectimax TreeKey definitions:V*(s,h) = maxa Q*(s,a,h)Q*(s,a,h) = E[R(s,a) + βV*(T(s,a),h-1)]Expand definitions recursively to compute V*(s,h)V*(s,h) = maxa1 Q(s,a1,h) = maxa1 E[R(s,a1) + β V*(T(s,a1),h-1)]= maxa1 E[R(s,a1) + β maxa2 E[R(T(s,a1),a2)+Q*(T(s,a1),a2,h-1)] = ……Can view this expansion as an expectimax treeEach expectation is really a weighted sum over statesExact Expectimax Tree for V*(s,H)MaxExp ExpMax MaxHorizonHk# states.......................................... ...... ...... ............ ...... ............ ...... ...... ......# actions( kn )H leavesnAlternate max &expectationCompute root V* and Q* via recursive procedureDepends on size of the state-space. Bad!V*(s,H)Q*(s,a,H)Sparse Sampling TreeMaxExp ExpMax MaxHorizon Hk# states.......................................... ...... ...... ......Samplingdepth HsSamplingwidth C...... ...... ............ ...... ...... ......( kC )H s leaves# actions( kn )H leavesnV*(s,H)Q*(s,a,H)Replace expectation with average over w samplesw will typically be much smaller than n.(kw)H leavesSampling width wSparse Sampling [Kearns et. al. 2002]SparseSampleTree(s,h,w) For each action a in sQ*(s,a,h) = 0For i = 1 to w Simulate taking a in s resulting in si and reward ri [V*(si,h),a*] = SparseSample(si,h-1,w) Q*(s,a,h) = Q*(s,a,h) + ri + β V*(si,h)Q*(s,a,h) = Q*(s,a,h) / w ;; estimate of Q*(s,a,h)V*(s,h) = maxa Q*(s,a,h) ;; estimate of V*(s,h)a* = argmaxa Q*(s,a,h)Return [V*(s,h), a*]The Sparse Sampling algorithm computes root value via depth first expansionReturn value estimate V*(s,h) of state s and estimated optimal action a*Sparse Sampling (Cont’d)For a given desired accuracy, how largeshould sampling width and depth be?Answered: Kearns, Mansour, and Ng (1999)Good news: gives values for w and H to achieve policy arbitrarily close to optimalValues are independent of state-space size!First near-optimal general MDP planning algorithm whose runtime didn’t depend on size of state-spaceBad news: the theoretical values are typically still intractably large---also exponential in H Exponential in H is the best we can do in generalIn practice: use small H and use heuristic at leavesSparse SamplingWill present two views of algorithmThe first is perhaps easier to digestThe second is more generalizable and can leverage advances in bandit algorithms1. Approximation to the full expectimax tree2. Recursive bandit algorithmBandit View of Expectimax Tree MaxExp ExpMax MaxHorizonHk# states.......................................... ...... ...... ............ ...... ............ ...... ...... ......# actions( kn )H leavesnAlternate maxand expectatonEach max node in tree is just a bandit problem.I.e. must choose action with highest Q*(s,a,h)---approximate via bandit.V*(s,H)Q*(s,a,H)14Bandit View Assuming we Know V*sa1a2akSimQ*(s,a1,h)SimQ*(s,a2,h)SimQ*(s,ak,h)…SimQ*(s,a,h) s’ = T(s,a) r = R(s,a) Return r + β V*(s’,h-1)Expected value of SimQ*(s,a,h) is Q*(s,a,h)Use UniformBandit to select approximately optimal action SimQ*(s,ai,h) = R(s, ai) + β V*(T(s, ai),h-1)But we don’t know V*To compute SimQ*(s,a,h) need V*(s’,h-1) for any s’Use recursive identity (Bellman’s equation):V*(s,h-1) = maxa Q*(s,a,h-1)Idea: Can recursively estimate V*(s,h-1) by running h-1 horizon bandit based on SimQ*Bandit returns estimated value of best action rather than just returning best actionBase Case: V*(s,0) = 0,
View Full Document