Unformatted text preview:

PowerPoint PresentationLarge WorldsApproaches for Large WorldsSlide 4Approaches for Large Worlds: Monte-Carlo PlanningLarge Worlds: Monte-Carlo ApproachExample Domains with SimulatorsMDP: Simulation-Based RepresentationOutlineSingle State Monte-Carlo PlanningPAC Bandit Objective: InformalPAC Bandit AlgorithmsUniformBandit Algorithm NaiveBandit from [Even-Dar et. al., 2002]Aside: Additive Chernoff BoundAside: Coin Flip ExampleSlide 16UniformBandit PAC BoundSlide 18Slide 19# Simulator Calls for UniformBanditSlide 21Policy Improvement via Monte-CarloRecall: Policy Improvement TheoremPolicy Improvement via BanditsQ-value EstimationSlide 26Slide 27Slide 28Policy Rollout AlgorithmPolicy Rollout: # of Simulator CallsPolicy Rollout QualityPolicy Rollout: QualitySlide 33Multi-Stage RolloutSlide 35Rollout SummaryExample: Rollout for Solitaire [Yan et al. NIPS’04]Slide 38Approximate Policy Iteration: Main IdeaReturn to Policy IterationApproximate Policy IterationGenerating Rollout TrajectoriesSlide 43Slide 44Slide 45Slide 46Slide 47Slide 48Slide 49API for Inverted PendulumSlide 51Slide 52API for Stacking BlocksExperimental ResultsSummary of APIAnother Useful Technique: Policy SwitchingSlide 57Policy SwitchingPolicy Switching: # of Simulator CallsPolicy Switching: QualitySlide 61Slide 62Slide 63Slide 64Slide 65Policy Switching Summary1Monte-Carlo Planning:Introduction and Bandit BasicsAlan Fern2Large WorldsWe have considered basic model-based planning algorithmsModel-based planning: assumes MDP model is availableMethods we learned so far are at least poly-time in the number of states and actionsDifficult to apply to large state and action spaces (though this is a rich research area)We will consider various methods for overcoming this issue3Approaches for Large WorldsPlanning with compact MDP representations1. Define a language for compactly describing an MDPMDP is exponentially larger than descriptionE.g. via Dynamic Bayesian Networks2. Design a planning algorithm that directly works with that languageScalability is still an issueCan be difficult to encode the problem you care about in a given languageStudy in last part of course4Approaches for Large WorldsReinforcement learning w/ function approx.1. Have a learning agent directly interact with environment2. Learn a compact description of policy or value functionOften works quite well for large problemsDoesn’t fully exploit a simulator of the environment when availableWe will study reinforcement learning later in the course5Approaches for Large Worlds: Monte-Carlo PlanningOften a simulator of a planning domain is availableor can be learned from data5Klondike SolitaireFire & Emergency Response6Large Worlds: Monte-Carlo ApproachOften a simulator of a planning domain is availableor can be learned from dataMonte-Carlo Planning: compute a good policy for an MDP by interacting with an MDP simulator6World SimulatorRealWorldactionState + reward7Example Domains with SimulatorsTraffic simulatorsRobotics simulatorsMilitary campaign simulatorsComputer network simulatorsEmergency planning simulators large-scale disaster and municipalSports domains (Madden Football)Board games / Video gamesGo / RTSIn many cases Monte-Carlo techniques yield state-of-the-artperformance. Even in domains where model-based plannersare applicable.8MDP: Simulation-Based RepresentationA simulation-based representation gives: S, A, R, T, I:finite state set S (|S|=n and is generally very large)finite action set A (|A|=m and will assume is of reasonable size)Stochastic, real-valued, bounded reward function R(s,a) = rStochastically returns a reward r given input s and aStochastic transition function T(s,a) = s’ (i.e. a simulator)Stochastically returns a state s’ given input s and aProbability of returning s’ is dictated by Pr(s’ | s,a) of MDP Stochastic initial state function I. Stochastically returns a state according to an initial state distributionThese stochastic functions can be implemented in any language!9OutlineUniform Monte-CarloSingle State Case (Uniform Bandit)Policy rolloutApproximate Policy IterationSparse SamplingAdaptive Monte-CarloSingle State Case (UCB Bandit)UCT Monte-Carlo Tree Search10Single State Monte-Carlo PlanningSuppose MDP has a single state and k actionsGoal: figure out which action has best expected rewardCan sample rewards of actions using calls to simulatorSampling action a is like pulling slot machine arm with random payoff function R(s,a)sa1a2akR(s,a1)R(s,a2)R(s,ak)Multi-Armed Bandit Problem……11PAC Bandit Objective: InformalProbably Approximately Correct (PAC) Select an arm that probably (w/ high probability) has approximately the best expected rewardUse as few simulator calls (or pulls) as possible to guarantee thissa1a2akR(s,a1)R(s,a2)R(s,ak)Multi-Armed Bandit Problem……12PAC Bandit AlgorithmsLet k be the number of arms, Rmax be an upper bound on reward, and (i.e. R* is the best arm in expectation)Such an algorithm is efficient in terms of # of arm pulls, and is probably (with probability 1- ) approximately correct (picks an arm with expected reward within  of optimal).)],([max*iiasRER Definition (Efficient PAC Bandit Algorithm): An algorithm ALG is an efficient PAC bandit algorithm iff for any multi-armed bandit problem, for any 0<<1 and any 0<<1, ALG pulls a number of arms that is polynomial in 1/, 1/ , k, and Rmax and returns an arm index j such that with probability at least 1-  )],([*jasRER13UniformBandit AlgorithmNaiveBandit from [Even-Dar et. al., 2002]1. Pull each arm w times (uniform pulling).2. Return arm with best average reward.Can we make this an efficient PAC bandit algorithm?sa1a2ak…r11 r12 … r1wr21 r22 … r2wrk1 rk2 … rkw14Aside: Additive Chernoff Bound• Let R be a random variable with maximum absolute value Z. An let ri i=1,…,w be i.i.d. samples of R• The Chernoff bound gives a bound on the probability that the average of the ri are far from E[R]1111ln][wwiiwZrRE With probability at least we have that, 1wZrREwiiw211exp][PrChernoff BoundEquivalently:15Aside: Coin Flip Example• Suppose we have a coin with probability of heads equal to p.• Let X be a random


View Full Document

OSU CS 533 - Introduction and Bandit Basics

Download Introduction and Bandit Basics
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Introduction and Bandit Basics and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Introduction and Bandit Basics 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?