Page 1CS 287: Advanced RoboticsFall 2009Lecture 15: LSTD, LSPI, RLSTD, imitation learningPieter AbbeelUC Berkeley EECS Stochastic approximation of the following operations: Back-up: Weighted linear regression: Batch version (for large state spaces): Let {(s,a,s’)} have been sampled according to D Iterate: Back-up for sampled (s,a,s’): Perform regression: TD(0) with linear function approximation guarantees(TπV )(s) =s′T (s, π(s), s′) [R(s, π(s), s′) + γV (s′)]minθsD(s)((TπV )(s)−θ⊤φ(s))2minθ(s,a,s′)(V (s) − θ⊤φ(s))2minθ(s,a,s′)(R(s, a, s′) + γθ(old)⊤φ(s′) − θ⊤φ(s))2V (s)←[R(s, a, s′) + γV (s′)] = R(s, a, s′) + γθ⊤φ(s′) Iterate: Can we find the fixed point directly? Rewrite the least squares problem in matrix notation: Solution: TD(0) with linear function approximation guaranteesθ(new)= arg minθ(s,a,s′)(R(s, a, s′) + γθ(old)⊤φ(s′) − θ⊤φ(s))2θ(new)= arg minθ R + γΦ′θ(old)−Φθ 22θ(new)= (Φ⊤Φ)−1Φ⊤(R + γΦ′θ(old)) Solution: Fixed point? TD(0) with linear function approximation guaranteesθ(new)= (Φ⊤Φ)−1Φ⊤(R + γΦ′θ(old))θ = (Φ⊤Φ)−1Φ⊤(R + γΦ′θ)(Φ⊤Φ)θ = Φ⊤(R + γΦ′θ)Φ⊤Φ − γΦ⊤Φ′θ = Φ⊤Rθ =Φ⊤Φ−γΦ⊤Φ′−1Φ⊤R Collect state-action-state triples (si,ai,s’i) according to a policy π Build the matrices: Find an approximation of the value function LSTD(0)Φ =φ(s1)⊤φ(s2)⊤. . .φ(sm)⊤, Φ′=φ(s′1)⊤φ(s′2)⊤. . .φ(s′m)⊤, R =R(s1, a1, s′1)R(s2, a2, s′2). . .R(sm, am, s′m)Vπ(s) ≈ θ⊤φ(s)for θ =Φ⊤Φ−γΦ⊤Φ′−1Φ⊤R Iterate Collect state-action-state triples (si,ai,s’i) according to current policy π Use LSTD(0) to compute Vπ Tweaks: Can re-use triples (si,ai,s’i) from previous policies as long as they are consistent with the current policy Can redo the derivation with Q functions rather than V In case of stochastic policies, can weight contribution of a triple according to Prob(ai|si) under the current policy Doing all three results in “Least squares policy iteration,” (Lagoudakis and Parr, 2003).LSTD(0) in policy iterationPage 2 Collect state-action-state triples (si,ai,s’i) according to a policy π Build the matrices: Find an approximation of the value function One more datapoint “m+1” :Sherman-Morrison formula:LSTD(0) --- batch vs. incremental updatesΦm=φ(s1)⊤φ(s2)⊤. . .φ(sm)⊤, Φ′m=φ(s′1)⊤φ(s′2)⊤. . .φ(s′m)⊤, Rm=R(s1, a1, s′1)R(s2, a2, s′2). . .R(sm, am, s′m)Vπ(s) ≈ θ⊤mφ(s)for θm=Φ⊤m(Φm−γΦ′m)−1Φ⊤mRmθm+1=Φ⊤m(Φm−γΦ′m) + φm+1(φm−γφ′m)⊤−1Φ⊤mRm+ φm+1rm+1 Recursively compute approximation of the value function by leveraging the Sherman-Morrison formula One more datapoint “m+1” : Note: there exist orthogonal matrix techniques to do the same thing but in a numerically more stable fashion (essentially: keep track of the QR decomposition of Am)RLSTDA−1m=Φ⊤m(Φm− γΦ′m)−1bm= ΦmRmθm= A−1mbmA−1m+1= A−1m−A−1mφm+1(φm+1− γφ′m+1)⊤A−1m1 + (φm+1− γφ′m+1)⊤A−1mφm+1bm+1= bm+ φm+1rm+1 RLSTD with linear function approximation with a Gaussian prior on \theta Kalman filter Can be applied to non-linear setting too: simply linearize the non-linear function approximator around the current estimate of \theta; not globally optimal, but likely still better than “naïve” gradient descent (+prior Extended Kalman filter)RLSTD: for non-linear function approximators?Recursive Least Squares (1)[From: Boyd, ee263]Recursive Least Squares (2)[From: Boyd, ee263]Recursive Least Squares (3)[From: Boyd, ee263]Page 3 Model-free RL: learn V, Q directly from experience: TD(λ), sarsa(λ): on policy updates Q: off policy updates Large MDPs: include function Approximation Some guarantees for linear function approximation Batch version No need to tweak various constants Same solution can be obtained incrementally by using recursive updates! This is generally true for least squares type systems.TD methods recap Backgammon Standard RL testbeds (all in simulation) Cartpole balancing Acrobot swing-up Gridworld --- Assignment #2 Bicycle riding Tetris --- Assignment #2 As part of actor-critic methods (=policy gradient + TD) Fine-tuning / Learning some robotics tasks Many financial institutions use some linear TD for pricing of optionsApplications of TD methods Small MDPs: VI, PI, GPI, LP Large MDPs: Value iteration + function approximation Iterate: Bellman back-up, project, … TD methods: TD, sarsa, Q with function approximation Simplicity, limited storage can be a convenience LSTD, LSPI, RLSTD Built upon in and compared to in many current RL papers Main current direction: feature selection You should be able to read/understand many RL papers Which important ideas are we missing (and will I try to cover between today and the next 3-5 lectures) ?RL: our learning status Imitation learning Learn from observing an expert Linear programming w/function approximation and constraint sampling Guarantees, Generally applicable idea of constraint sampling Policy gradient, Actor-Critic (=TD+policy gradient in one) Fine tuning policies through running trials on a real system, Robotic success stories Partial observability POMDPS Hierarchical methods Incorporate your knowledge to enable scaling to larger systems Reward shaping Can we choose reward functions such as to enable faster learning? Exploration vs. exploitation How/W hen should we explore? Stochastic approximation Basic intuition behind how/when sampled versions work?Imitation learning If expert available, could use expert trace s1, a1, s2, a2, s3, a3, … to learn “something” from the expert Behavioral cloning: use supervised learning to directly learn a policy SA. No model of the system dynamics required No MDP / optimal control solution algorithm required Inverse reinforcement learning: Learn the reward function Often most
View Full Document