CS294-40 Learning for Robotics and Control Lecture 1 - 8/28/2008IntroductionLecturer: Pieter Abbeel Scribe: Pieter Abbeel1 Lecture outline• Class logistics.• Slideshow and movies on current autonomous robotics, on algorithms they use, and on future directions.• Markov decision processes.2 Markov decision processes (MDPs)2.1 DefinitionA (discounted infinite horizon) Markov decision process (MDP) is a tuple (S, A, T , γ, D, R).Here1. S is the set of poss ible states for the system;2. A is the set of possible actions;3. T represents the (typically stochastic) system dynamics;4. D is the initial-state distribution, from which the start state s0is drawn;5. R : S 7→ < is the reward function.Acting in a Markov decision process results in a sequence of states and actions s0, a0, s1, a1, s2, . . ..A policy π is a sequence of mappings (µ0, µ1, µ2, . . .), where, at time t the mapping µt(·) determines theaction at= µt(st) to take when in state st.The objective is to find policies that maximize the expected sum of rewards accumulated over time. Inparticular, a policy π is good if its utilityU(π) = E[∞Xt=0γtR(st)|π]is high.To represent the system dynamics, we can use the state-transition distribution notationst+1∼ Psa(·|st, at).We will also often use the following notation:st+1= F (st, at, wt).Here F is a deterministic function, and wtis a random disturbance.12.2 Examples2.2.1 CarOne (approximate) way to model the state of a car is to use the following six state variables: northing (n),easting (e), north velocity ( ˙n), east velocity ( ˙e), heading (θ), angular rate (˙θ). Hence the state space S = <6.The actions (or control inputs) are (i) steering angle, (ii) throttle, (iii) brake.The perturbances capture both environmental perturbations as well as unmodeled aspects of the cardynamics.We could have the following dynamics model st+1= F (st, at, wt):nt+1= nt+ ˙nt∆t,et+1= et+ ˙et∆t,θt+1= θt+˙θt∆t,˙nt+1= fn( ˙nt, ˙et,˙θt, at, wt)˙et+1= fe( ˙nt, ˙et,˙θt, at, wt)˙θt+1= fθ( ˙nt, ˙et,˙θt, at, wt)The reward function could be R(st) = 1{in goal region} − 100 ∗ 1{in collision}. Here 1{·} is an indicatorfunction, taking the value “1” when its argument is true, and “0” otherwise. The functions fn, fe, fθaredeterministic functions modeling the car’s
View Full Document