PowerPoint PresentationLarge WorldsApproaches for Large WorldsSlide 4Approaches for Large Worlds: Monte-Carlo PlanningLarge Worlds: Monte-Carlo ApproachExample Domains with SimulatorsMDP: Simulation-Based RepresentationOutlineSingle State Monte-Carlo PlanningPAC Bandit Objective: InformalPAC Bandit AlgorithmsUniformBandit Algorithm NaiveBandit from [Even-Dar et. al., 2002]Aside: Additive Chernoff BoundAside: Coin Flip ExampleSlide 16UniformBandit PAC BoundSlide 18Slide 19# Simulator Calls for UniformBanditSlide 21Policy Improvement via Monte-CarloRecall: Policy Improvement TheoremPolicy Improvement via BanditsQ-value EstimationSlide 26Slide 27Slide 28Policy Rollout AlgorithmPolicy Rollout: # of Simulator CallsPolicy Rollout QualityPolicy Rollout: QualitySlide 33Multi-Stage RolloutSlide 35Rollout SummaryExample: Rollout for Solitaire [Yan et al. NIPS’04]Slide 38Approximate Policy Iteration: Main IdeaReturn to Policy IterationApproximate Policy IterationGenerating Rollout TrajectoriesSlide 43Slide 44Slide 45Slide 46Slide 47Slide 48Slide 49API for Inverted PendulumSlide 51Slide 52API for Stacking BlocksExperimental ResultsSummary of APIAnother Useful Technique: Policy SwitchingSlide 57Policy SwitchingPolicy Switching: # of Simulator CallsPolicy Switching: QualitySlide 61Slide 62Slide 63Slide 64Slide 65Policy Switching Summary1Monte-Carlo Planning:Introduction and Bandit BasicsAlan Fern2Large WorldsWe have considered basic model-based planning algorithmsModel-based planning: assumes MDP model is availableMethods we learned so far are at least poly-time in the number of states and actionsDifficult to apply to large state and action spaces (though this is a rich research area)We will consider various methods for overcoming this issue3Approaches for Large WorldsPlanning with compact MDP representations1. Define a language for compactly describing an MDPMDP is exponentially larger than descriptionE.g. via Dynamic Bayesian Networks2. Design a planning algorithm that directly works with that languageScalability is still an issueCan be difficult to encode the problem you care about in a given languageStudy in last part of course4Approaches for Large WorldsReinforcement learning w/ function approx.1. Have a learning agent directly interact with environment2. Learn a compact description of policy or value functionOften works quite well for large problemsDoesn’t fully exploit a simulator of the environment when availableWe will study reinforcement learning later in the course5Approaches for Large Worlds: Monte-Carlo PlanningOften a simulator of a planning domain is availableor can be learned from data5Klondike SolitaireFire & Emergency Response6Large Worlds: Monte-Carlo ApproachOften a simulator of a planning domain is availableor can be learned from dataMonte-Carlo Planning: compute a good policy for an MDP by interacting with an MDP simulator6World SimulatorRealWorldactionState + reward7Example Domains with SimulatorsTraffic simulatorsRobotics simulatorsMilitary campaign simulatorsComputer network simulatorsEmergency planning simulators large-scale disaster and municipalSports domains (Madden Football)Board games / Video gamesGo / RTSIn many cases Monte-Carlo techniques yield state-of-the-artperformance. Even in domains where model-based plannersare applicable.8MDP: Simulation-Based RepresentationA simulation-based representation gives: S, A, R, T, I:finite state set S (|S|=n and is generally very large)finite action set A (|A|=m and will assume is of reasonable size)Stochastic, real-valued, bounded reward function R(s,a) = rStochastically returns a reward r given input s and aStochastic transition function T(s,a) = s’ (i.e. a simulator)Stochastically returns a state s’ given input s and aProbability of returning s’ is dictated by Pr(s’ | s,a) of MDP Stochastic initial state function I. Stochastically returns a state according to an initial state distributionThese stochastic functions can be implemented in any language!9OutlineUniform Monte-CarloSingle State Case (Uniform Bandit)Policy rolloutApproximate Policy IterationSparse SamplingAdaptive Monte-CarloSingle State Case (UCB Bandit)UCT Monte-Carlo Tree Search10Single State Monte-Carlo PlanningSuppose MDP has a single state and k actionsGoal: figure out which action has best expected rewardCan sample rewards of actions using calls to simulatorSampling action a is like pulling slot machine arm with random payoff function R(s,a)sa1a2akR(s,a1)R(s,a2)R(s,ak)Multi-Armed Bandit Problem……11PAC Bandit Objective: InformalProbably Approximately Correct (PAC) Select an arm that probably (w/ high probability) has approximately the best expected rewardUse as few simulator calls (or pulls) as possible to guarantee thissa1a2akR(s,a1)R(s,a2)R(s,ak)Multi-Armed Bandit Problem……12PAC Bandit AlgorithmsLet k be the number of arms, Rmax be an upper bound on reward, and (i.e. R* is the best arm in expectation)Such an algorithm is efficient in terms of # of arm pulls, and is probably (with probability 1- ) approximately correct (picks an arm with expected reward within of optimal).)],([max*iiasRER Definition (Efficient PAC Bandit Algorithm): An algorithm ALG is an efficient PAC bandit algorithm iff for any multi-armed bandit problem, for any 0<<1 and any 0<<1, ALG pulls a number of arms that is polynomial in 1/, 1/ , k, and Rmax and returns an arm index j such that with probability at least 1- )],([*jasRER13UniformBandit AlgorithmNaiveBandit from [Even-Dar et. al., 2002]1. Pull each arm w times (uniform pulling).2. Return arm with best average reward.Can we make this an efficient PAC bandit algorithm?sa1a2ak…r11 r12 … r1wr21 r22 … r2wrk1 rk2 … rkw14Aside: Additive Chernoff Bound• Let R be a random variable with maximum absolute value Z. An let ri i=1,…,w be i.i.d. samples of R• The Chernoff bound gives a bound on the probability that the average of the ri are far from E[R]1111ln][wwiiwZrRE With probability at least we have that, 1wZrREwiiw211exp][PrChernoff BoundEquivalently:15Aside: Coin Flip Example• Suppose we have a coin with probability of heads equal to p.• Let X be a random
View Full Document