OSU CS 553 - Monte-Carlo Planning II - D3020872

Home> Schools> Oregon State University> Computer Science (CS) > CS 553> Monte-Carlo Planning II

DOC PREVIEW

OSU CS 553 - Monte-Carlo Planning II

School name Oregon State University

Course Cs 553- Scientific Visualization

Pages 50

This preview shows page 1-2-3-24-25-26-27-48-49-50 out of 50 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 50 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 50 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 50 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 50 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 50 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 50 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 50 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 50 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 50 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 50 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 50 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

PowerPoint PresentationOutlineSparse SamplingOnline Planning with Look-Ahead TreesSlide 5Slide 6Expectimax TreeExact Expectimax Tree for V*(s,H)Sparse Sampling TreeSparse Sampling [Kearns et. al. 2002]Sparse Sampling (Cont’d)Slide 12Bandit View of Expectimax TreeBandit View Assuming we Know V*But we don’t know V*Recursive UniformBanditSlide 17Bandit ViewUniform vs. Adaptive BanditsSlide 20Regret Minimization Bandit ObjectiveUCB Adaptive Bandit Algorithm [Auer, Cesa-Bianchi, & Fischer, 2002]UCB Algorithm [Auer, Cesa-Bianchi, & Fischer, 2002]UCB for Multi-State MDPsUCB-based Sparse Sampling [Chang et. al. 2005]Slide 26UCT Algorithm [Kocsis & Szepesvari, 2006]Monte-Carlo Tree SearchRollout PoliciesSlide 30Slide 31Slide 32Slide 33Slide 34Slide 35Slide 36Slide 37Slide 38Slide 39Slide 40Slide 41Slide 42Slide 43Slide 44UCT RecapComputer GoA Brief History of Computer GoOther SuccessesSome ImprovementsSummary1Monte-Carlo Planning IIAlan Fern2OutlinePreliminaries: Markov Decision ProcessesWhat is Monte-Carlo Planning?Uniform Monte-CarloSingle State Case (UniformBandit)Policy rolloutApproximate Policy IterationSparse SamplingAdaptive Monte-CarloSingle State Case (UCB Bandit)UCT Monte-Carlo Tree Search3Sparse SamplingRollout does not guarantee optimality or near optimalityNeither does approximate policy iteration in generalCan we develop Monte-Carlo methods that give us near optimal policies?With computation that does NOT depend on number of states!This was an open problem until late 90’s.In deterministic games and search problems it is common to build a look-ahead tree at a state to select best actionCan we generalize this to general stochastic MDPs? Sparse Sampling is one such algorithmStrong theoretical guarantees of near optimality4Online Planning with Look-Ahead TreesAt each state we encounter in the environment we build a look-ahead tree of depth h and use it to estimate optimal Q-values of each actionSelect action with highest Q-value estimates = current stateRepeat T = BuildLookAheadTree(s) ;; sparse sampling or UCT ;; tree provides Q-value estimates for root actiona = BestRootAction(T) ;; action with best Q-valueExecute action a in environments is the resulting stateSparse SamplingAgain focus on finite-horizonsArbitrarily good approximation for large enough horizon hh-horizon optimal Q-functionValue of taking a in s and following for * for h-1 stepsQ*(s,a,h) = E[R(s,a) + βV*(T(s,a),h-1)]Key identity (Bellman’s equations):V*(s,h) = maxa Q*(s,a,h)*(x) = argmaxa Q*(x,a,h)Sparse sampling estimates Q-values by building sparse expectimax treeSparse SamplingWill present two views of algorithmThe first is perhaps easier to digestThe second is more generalizable and can leverage advances in bandit algorithms1. Approximation to the full expectimax tree2. Recursive bandit algorithmExpectimax TreeKey definitions:V*(s,h) = maxa Q*(s,a,h)Q*(s,a,h) = E[R(s,a) + βV*(T(s,a),h-1)]Expand definitions recursively to compute V*(s,h)V*(s,h) = maxa1 Q(s,a1,h) = maxa1 E[R(s,a1) + β V*(T(s,a1),h-1)]= maxa1 E[R(s,a1) + β maxa2 E[R(T(s,a1),a2)+Q*(T(s,a1),a2,h-1)] = ……Can view this expansion as an expectimax treeEach expectation is really a weighted sum over statesExact Expectimax Tree for V*(s,H)MaxExp ExpMax MaxHorizonHk# states.......................................... ...... ...... ............ ...... ............ ...... ...... ......# actions( kn )H leavesnAlternate max &expectationCompute root V* and Q* via recursive procedureDepends on size of the state-space. Bad!V*(s,H)Q*(s,a,H)Sparse Sampling TreeMaxExp ExpMax MaxHorizon Hk# states.......................................... ...... ...... ......Samplingdepth HsSamplingwidth C...... ...... ............ ...... ...... ......( kC )H s leaves# actions( kn )H leavesnV*(s,H)Q*(s,a,H)Replace expectation with average over w samplesw will typically be much smaller than n.(kw)H leavesSampling width wSparse Sampling [Kearns et. al. 2002]SparseSampleTree(s,h,w) For each action a in sQ*(s,a,h) = 0For i = 1 to w Simulate taking a in s resulting in si and reward ri [V*(si,h),a*] = SparseSample(si,h-1,w) Q*(s,a,h) = Q*(s,a,h) + ri + β V*(si,h)Q*(s,a,h) = Q*(s,a,h) / w ;; estimate of Q*(s,a,h)V*(s,h) = maxa Q*(s,a,h) ;; estimate of V*(s,h)a* = argmaxa Q*(s,a,h)Return [V*(s,h), a*]The Sparse Sampling algorithm computes root value via depth first expansionReturn value estimate V*(s,h) of state s and estimated optimal action a*Sparse Sampling (Cont’d)For a given desired accuracy, how largeshould sampling width and depth be?Answered: Kearns, Mansour, and Ng (1999)Good news: gives values for w and H to achieve policy arbitrarily close to optimalValues are independent of state-space size!First near-optimal general MDP planning algorithm whose runtime didn’t depend on size of state-spaceBad news: the theoretical values are typically still intractably large---also exponential in H Exponential in H is the best we can do in generalIn practice: use small H and use heuristic at leavesSparse SamplingWill present two views of algorithmThe first is perhaps easier to digestThe second is more generalizable and can leverage advances in bandit algorithms1. Approximation to the full expectimax tree2. Recursive bandit algorithmBandit View of Expectimax Tree MaxExp ExpMax MaxHorizonHk# states.......................................... ...... ...... ............ ...... ............ ...... ...... ......# actions( kn )H leavesnAlternate maxand expectatonEach max node in tree is just a bandit problem.I.e. must choose action with highest Q*(s,a,h)---approximate via bandit.V*(s,H)Q*(s,a,H)14Bandit View Assuming we Know V*sa1a2akSimQ*(s,a1,h)SimQ*(s,a2,h)SimQ*(s,ak,h)…SimQ*(s,a,h) s’ = T(s,a) r = R(s,a) Return r + β V*(s’,h-1)Expected value of SimQ*(s,a,h) is Q*(s,a,h)Use UniformBandit to select approximately optimal action SimQ*(s,ai,h) = R(s, ai) + β V*(T(s, ai),h-1)But we don’t know V*To compute SimQ*(s,a,h) need V*(s’,h-1) for any s’Use recursive identity (Bellman’s equation):V*(s,h-1) = maxa Q*(s,a,h-1)Idea: Can recursively estimate V*(s,h-1) by running h-1 horizon bandit based on SimQ*Bandit returns estimated value of best action rather than just returning best actionBase Case: V*(s,0) = 0,

View Full Document