1CS 188: Artificial IntelligenceSpring 2006Lecture 22: Reinforcement Learning II4/13/2006Dan Klein – UC BerkeleyToday Reminder: P3 lab Friday, 2-4pm, 275 Soda Reinforcement learning Temporal-difference learning Q-learning Function approximation2Recap: Passive Learning Learning about an unknown MDP Simplified task You don’t know the transitions T(s,a,s’) You don’t know the rewards R(s) You DO know the policy π(s) Goal: learn the state values (and maybe the model) Last time: try to learn T, R and then solve as a known MDPModel-Free Learning Big idea: why bother learning T? Update each time we experience a transition Frequent outcomes will contribute more updates (over time) Temporal difference learning (TD) Policy still fixed! Move values toward value of whatever successor occurs3Example: Passive TD(1,1) -1 up(1,2) -1 up(1,2) -1 up(1,3) -1 right(2,3) -1 right(3,3) -1 right(3,2) -1 up(3,3) -1 right(4,3) +100(1,1) -1 up(1,2) -1 up(1,3) -1 right(2,3) -1 right(3,3) -1 right(3,2) -1 up(4,2) -100Take γ = 1, α = 0.1(Greedy) Active Learning In general, want to learn the optimal policy Idea: Learn an initial model of the environment: Solve for the optimal policy for this model (value or policy iteration) Refine model through experience and repeat4Example: Greedy Active Learning Imagine we find the lower path to the good exit first Some states will never be visited following this policy from (1,1) We’ll keep re-using this policy because following it never collects the regions of the model we need to learn the optimal policy ??What Went Wrong? Problem with following optimal policy for current model: Never learn about better regions of the space Fundamental tradeoff: exploration vs. exploitation Exploration: must take actions with suboptimal estimates to discover new rewards and increase eventual utility Exploitation: once the true optimal policy is learned, exploration reduces utility Systems must explore in the beginning and exploit in the limit??5Q-Functions Alternate way to learn: Utilities for state-action pairs rather than states AKA Q-functionsLearning Q-Functions: MDPs Just like Bellman updates for state values: For fixed policy π For optimal policy Main advantage of Q-functions over values U is that you don’t need a model for learning or action selection!6Q-Learning Model free, TD learning with Q-functions:Example [DEMOS]7Exploration / Exploitation Several schemes for forcing exploration Simplest: random actions Every time step, flip a coin With probability ε, act randomly With probability 1-ε, act according to current policy Problems with random actions? Will take an non-optimal long route to reduce risk which stems from exploration actions! Solution: lower ε over timeESESExploration Functions When to explore Random actions: explore a fixed amount Better idea: explore areas whose badness is not (yet) established Exploration function Takes a value estimate and a count, and returns an optimistic utility, e.g.8Function Approximation Problem: too slow to learn each state’s utility one by one Solution: what we learn about one state should generalize to similar states Very much like supervised learning If states are treated entirely independently, we can only learn on very small state spacesDiscretization Can put states into buckets of various sizes E.g. can have all angles between 0 and 5 degrees share the same Q estimate Buckets too fine ⇒ takes a long time to learn Buckets too coarse ⇒ learn suboptimal, often jerky control Real systems that use discretizationusually require clever bucketing schemes Adaptive sizes Tile coding [DEMOS]9Linear Value Functions Another option: values are linear functions of features of states (or action-state pairs) Good if you can describe states well using a few features (e.g. for game playing board evaluations) Now we only have to learn a few weights rather than a value for each state0.600.700.80 0.850.65 0.700.800.900.750.850.95TD Updates for Linear Values Can use TD learning with linear values (Actually it’s just like the perceptron!) Old Q-learning update: Simply update weights of features in
View Full Document