Berkeley COMPSCI 188 - Lecture 22: Reinforcement Learning II - D2064430

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 188> Lecture 22: Reinforcement Learning II

DOC PREVIEW

Berkeley COMPSCI 188 - Lecture 22: Reinforcement Learning II

School name University of California, Berkeley

Course Compsci 188- Introduction to Artificial Intelligence

Pages 9

This preview shows page 1-2-3 out of 9 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1CS 188: Artificial IntelligenceSpring 2006Lecture 22: Reinforcement Learning II4/13/2006Dan Klein – UC BerkeleyToday Reminder: P3 lab Friday, 2-4pm, 275 Soda Reinforcement learning Temporal-difference learning Q-learning Function approximation2Recap: Passive Learning Learning about an unknown MDP Simplified task You don’t know the transitions T(s,a,s’) You don’t know the rewards R(s) You DO know the policy π(s) Goal: learn the state values (and maybe the model) Last time: try to learn T, R and then solve as a known MDPModel-Free Learning Big idea: why bother learning T? Update each time we experience a transition Frequent outcomes will contribute more updates (over time) Temporal difference learning (TD) Policy still fixed! Move values toward value of whatever successor occurs3Example: Passive TD(1,1) -1 up(1,2) -1 up(1,2) -1 up(1,3) -1 right(2,3) -1 right(3,3) -1 right(3,2) -1 up(3,3) -1 right(4,3) +100(1,1) -1 up(1,2) -1 up(1,3) -1 right(2,3) -1 right(3,3) -1 right(3,2) -1 up(4,2) -100Take γ = 1, α = 0.1(Greedy) Active Learning In general, want to learn the optimal policy Idea: Learn an initial model of the environment: Solve for the optimal policy for this model (value or policy iteration) Refine model through experience and repeat4Example: Greedy Active Learning Imagine we find the lower path to the good exit first Some states will never be visited following this policy from (1,1) We’ll keep re-using this policy because following it never collects the regions of the model we need to learn the optimal policy ??What Went Wrong? Problem with following optimal policy for current model: Never learn about better regions of the space Fundamental tradeoff: exploration vs. exploitation Exploration: must take actions with suboptimal estimates to discover new rewards and increase eventual utility Exploitation: once the true optimal policy is learned, exploration reduces utility Systems must explore in the beginning and exploit in the limit??5Q-Functions Alternate way to learn: Utilities for state-action pairs rather than states AKA Q-functionsLearning Q-Functions: MDPs Just like Bellman updates for state values: For fixed policy π For optimal policy Main advantage of Q-functions over values U is that you don’t need a model for learning or action selection!6Q-Learning Model free, TD learning with Q-functions:Example [DEMOS]7Exploration / Exploitation Several schemes for forcing exploration Simplest: random actions Every time step, flip a coin With probability ε, act randomly With probability 1-ε, act according to current policy Problems with random actions? Will take an non-optimal long route to reduce risk which stems from exploration actions! Solution: lower ε over timeESESExploration Functions When to explore Random actions: explore a fixed amount Better idea: explore areas whose badness is not (yet) established Exploration function Takes a value estimate and a count, and returns an optimistic utility, e.g.8Function Approximation Problem: too slow to learn each state’s utility one by one Solution: what we learn about one state should generalize to similar states Very much like supervised learning If states are treated entirely independently, we can only learn on very small state spacesDiscretization Can put states into buckets of various sizes E.g. can have all angles between 0 and 5 degrees share the same Q estimate Buckets too fine ⇒ takes a long time to learn Buckets too coarse ⇒ learn suboptimal, often jerky control Real systems that use discretizationusually require clever bucketing schemes Adaptive sizes Tile coding [DEMOS]9Linear Value Functions Another option: values are linear functions of features of states (or action-state pairs) Good if you can describe states well using a few features (e.g. for game playing board evaluations) Now we only have to learn a few weights rather than a value for each state0.600.700.80 0.850.65 0.700.800.900.750.850.95TD Updates for Linear Values Can use TD learning with linear values (Actually it’s just like the perceptron!) Old Q-learning update: Simply update weights of features in

View Full Document

Berkeley COMPSCI 188 - Lecture 22: Reinforcement Learning II

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 9 pages.

Berkeley COMPSCI 188 - Lecture 22: Reinforcement Learning II

Sign up for free to view:

Please select your school