Berkeley COMPSCI 287 - Lecture Notes 15 - D1844369

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 287> Lecture Notes 15

DOC PREVIEW

Berkeley COMPSCI 287 - Lecture Notes 15

School name University of California, Berkeley

Course Compsci 287- Advanced Robotics

Pages 6

This preview shows page 1-2 out of 6 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 6 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Page 1CS 287: Advanced RoboticsFall 2009Lecture 15: LSTD, LSPI, RLSTD, imitation learningPieter AbbeelUC Berkeley EECS Stochastic approximation of the following operations: Back-up: Weighted linear regression: Batch version (for large state spaces): Let {(s,a,s’)} have been sampled according to D Iterate: Back-up for sampled (s,a,s’):  Perform regression: TD(0) with linear function approximation guarantees(TπV )(s) =s′T (s, π(s), s′) [R(s, π(s), s′) + γV (s′)]minθsD(s)((TπV )(s)−θ⊤φ(s))2minθ(s,a,s′)(V (s) − θ⊤φ(s))2minθ(s,a,s′)(R(s, a, s′) + γθ(old)⊤φ(s′) − θ⊤φ(s))2V (s)←[R(s, a, s′) + γV (s′)] = R(s, a, s′) + γθ⊤φ(s′) Iterate: Can we find the fixed point directly? Rewrite the least squares problem in matrix notation: Solution: TD(0) with linear function approximation guaranteesθ(new)= arg minθ(s,a,s′)(R(s, a, s′) + γθ(old)⊤φ(s′) − θ⊤φ(s))2θ(new)= arg minθ R + γΦ′θ(old)−Φθ 22θ(new)= (Φ⊤Φ)−1Φ⊤(R + γΦ′θ(old)) Solution: Fixed point? TD(0) with linear function approximation guaranteesθ(new)= (Φ⊤Φ)−1Φ⊤(R + γΦ′θ(old))θ = (Φ⊤Φ)−1Φ⊤(R + γΦ′θ)(Φ⊤Φ)θ = Φ⊤(R + γΦ′θ)Φ⊤Φ − γΦ⊤Φ′θ = Φ⊤Rθ =Φ⊤Φ−γΦ⊤Φ′−1Φ⊤R Collect state-action-state triples (si,ai,s’i) according to a policy π Build the matrices: Find an approximation of the value function LSTD(0)Φ =φ(s1)⊤φ(s2)⊤. . .φ(sm)⊤, Φ′=φ(s′1)⊤φ(s′2)⊤. . .φ(s′m)⊤, R =R(s1, a1, s′1)R(s2, a2, s′2). . .R(sm, am, s′m)Vπ(s) ≈ θ⊤φ(s)for θ =Φ⊤Φ−γΦ⊤Φ′−1Φ⊤R Iterate Collect state-action-state triples (si,ai,s’i) according to current policy π Use LSTD(0) to compute Vπ Tweaks: Can re-use triples (si,ai,s’i) from previous policies as long as they are consistent with the current policy Can redo the derivation with Q functions rather than V In case of stochastic policies, can weight contribution of a triple according to Prob(ai|si) under the current policy Doing all three results in “Least squares policy iteration,” (Lagoudakis and Parr, 2003).LSTD(0) in policy iterationPage 2 Collect state-action-state triples (si,ai,s’i) according to a policy π Build the matrices: Find an approximation of the value function  One more datapoint  “m+1” :Sherman-Morrison formula:LSTD(0) --- batch vs. incremental updatesΦm=φ(s1)⊤φ(s2)⊤. . .φ(sm)⊤, Φ′m=φ(s′1)⊤φ(s′2)⊤. . .φ(s′m)⊤, Rm=R(s1, a1, s′1)R(s2, a2, s′2). . .R(sm, am, s′m)Vπ(s) ≈ θ⊤mφ(s)for θm=Φ⊤m(Φm−γΦ′m)−1Φ⊤mRmθm+1=Φ⊤m(Φm−γΦ′m) + φm+1(φm−γφ′m)⊤−1Φ⊤mRm+ φm+1rm+1 Recursively compute approximation of the value function by leveraging the Sherman-Morrison formula One more datapoint  “m+1” : Note: there exist orthogonal matrix techniques to do the same thing but in a numerically more stable fashion (essentially: keep track of the QR decomposition of Am)RLSTDA−1m=Φ⊤m(Φm− γΦ′m)−1bm= ΦmRmθm= A−1mbmA−1m+1= A−1m−A−1mφm+1(φm+1− γφ′m+1)⊤A−1m1 + (φm+1− γφ′m+1)⊤A−1mφm+1bm+1= bm+ φm+1rm+1 RLSTD with linear function approximation with a Gaussian prior on \theta  Kalman filter Can be applied to non-linear setting too: simply linearize the non-linear function approximator around the current estimate of \theta; not globally optimal, but likely still better than “naïve” gradient descent (+prior  Extended Kalman filter)RLSTD: for non-linear function approximators?Recursive Least Squares (1)[From: Boyd, ee263]Recursive Least Squares (2)[From: Boyd, ee263]Recursive Least Squares (3)[From: Boyd, ee263]Page 3 Model-free RL: learn V, Q directly from experience: TD(λ), sarsa(λ): on policy updates Q: off policy updates Large MDPs: include function Approximation Some guarantees for linear function approximation Batch version  No need to tweak various constants Same solution can be obtained incrementally by using recursive updates! This is generally true for least squares type systems.TD methods recap Backgammon Standard RL testbeds (all in simulation) Cartpole balancing Acrobot swing-up  Gridworld --- Assignment #2 Bicycle riding  Tetris --- Assignment #2 As part of actor-critic methods (=policy gradient + TD) Fine-tuning / Learning some robotics tasks Many financial institutions use some linear TD for pricing of optionsApplications of TD methods Small MDPs: VI, PI, GPI, LP Large MDPs: Value iteration + function approximation Iterate: Bellman back-up, project, … TD methods: TD, sarsa, Q with function approximation Simplicity, limited storage can be a convenience LSTD, LSPI, RLSTD Built upon in and compared to in many current RL papers Main current direction: feature selection  You should be able to read/understand many RL papers Which important ideas are we missing (and will I try to cover between today and the next 3-5 lectures) ?RL: our learning status Imitation learning Learn from observing an expert Linear programming w/function approximation and constraint sampling  Guarantees, Generally applicable idea of constraint sampling Policy gradient, Actor-Critic (=TD+policy gradient in one) Fine tuning policies through running trials on a real system, Robotic success stories Partial observability POMDPS  Hierarchical methods Incorporate your knowledge to enable scaling to larger systems Reward shaping Can we choose reward functions such as to enable faster learning? Exploration vs. exploitation How/W hen should we explore? Stochastic approximation Basic intuition behind how/when sampled versions work?Imitation learning If expert available, could use expert trace s1, a1, s2, a2, s3, a3, … to learn “something” from the expert Behavioral cloning: use supervised learning to directly learn a policy SA. No model of the system dynamics required No MDP / optimal control solution algorithm required Inverse reinforcement learning: Learn the reward function Often most

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 6 pages.

Berkeley COMPSCI 287 - Lecture Notes 15

Sign up for free to view:

Please select your school