DOC PREVIEW
Berkeley COMPSCI 287 - Lecture Notes

This preview shows page 1-2-3 out of 9 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Page 1“MODULARITY, POLYRHYTHMS, AND WHAT ROBOTICS AND CONTROL MAY YET LEARN FROM THE BRAIN”Jean-Jacques Slotine, Nonlinear Systems Laboratory, MITThursday, Nov 5th, 4:00 p.m., 3110 Etcheverry HallABSTRACTAlthough neurons as computational elements are 7 orders of magnitude slower than their artificial counterparts, the primate brain grossly outperforms robotic algorithms in all but the most structured tasks. Parallelism alone is a poor explanation, and much recent functional modelling of the central nervous system focuses on its modular, heavily feedback-based computational architecture, the result of accumulation of subsystems throughout evolution. We discuss this architecture from a global functionality point of view, and show why evolution is likely to favor certain types of aggregate stability. We then study synchronization as a model of computations at different scales in the brain, such as pattern matching, restoration, priming, temporal binding of sensory data, and mirror neuron response. We derive a simple condition for a general dynamical system to globally converge to a regime where diverse groups of fully synchronized elements coexist, and show accordingly how patterns can be transiently selected and controlled by a very small number of inputs or connections. We also quantify how synchronization mechanisms can protect general nonlinear systems from noise. Applications to some classical questions in robotics, control, and systems neuroscience are discussed.The development makes extensive use of nonlinear contraction theory, a comparativelyrecent analysis tool whose main features will be briefly reviewed.CS 287: Advanced RoboticsFall 2009Lecture 19: Actor-Critic/Policy gradient for learning to walk in 20 minutesNatural gradientPieter AbbeelUC Berkeley EECSPage 2 Dynamic gait: A bipedal walking gait is considered dynamic if the ground projection of the center of mass leaves the convex hull of the ground contact points during some portion of the walking cycle.  Why hard? Achieving stable dynamic walking on a bipedal robot is a difficult control problem because bipeds can only control the trajectory of their center of mass through the unilateral, intermittent, uncertain force contacts with the ground.  “fully actuated walking”Case study: learning bipedal walkingPassive dynamic walkersPage 3 The energy lost due to friction and collisions when the swing leg returns to the ground are balanced by the gradual conversion of potential energy into kinetic energy as the walker moves down the slope. Can we actuate them to have them walk on flat terrains?  John E. Wilson. Walking toy. Technical report, United States Patent Office, October 15 1936. Tad McGeer. Passive dynamic walking. International Journal of Robotics Research, 9(2):62.82, April 1990.Passive dynamic walkersLearning to walk in 20 minutes --- Tedrake, Zhang, Seung 2005Page 4Learning to walk in 20 minutes --- Tedrake, Zhang, Seung 2005passive hip joint [1DOF]2 x 2 (roll, pitch) position controlled servo motors [4 DOF]44 cmNatural gait down 0.03 radians ramp:0.8Hz, 6.5cm stepsArms: coupled to the opposite leg to reduce yaw momentfreely swinging load [1DOF]9DOFs:* 6 internal DOFs* 3 DOFs for the robot’s orientation (always assumed in contact with ground at a single point, absolute (x,y) ignored) q: vector of joint angles u: control vector (4D) d(t): time-varying vector of random disturbances Discrete footstep-to-footstep dynamics: consider state at touchdown of robot’s left leg Stochasticity due to Sensor noise Disturbances d(t)Dynamics¨q = f(q, ˙q, u, d(t))Fπ(x′, x) = P (ˆxn+1= x′|ˆxn= x; π)Page 5 Goal: stabilize the limit cycle trajectory that the passive robot follows when walking down the ramp, making it invariant to slope. Reward function:  x* is taken from the gait of the walker down a slope of 0.03 radians Action space: At the beginning of each step cycle (=when a foot touches down) we choose an action in the discrete time RL formulation Our action choice is a feedback control policy to be deployed during the step, in this particular example it is a column vector w Choosing this action means that throughout the following step cycle, the following continous-time feedback controls will be exerted: Goal: find the (constant) action choice w which maximizes expected sum of rewardsReinforcement learning formulationR(x(n)) =−12 x(n)−x∗ 22u(t) =iwiφi(ˆx(t)) = w⊤φ(ˆx(t)) To apply the likelihood gradient ratio method, we need to define a stochastic policy class. A natural choice is to choose our action vector w to be sampled from a Gaussian:Which gives us:[Note: it does not depend on x, this is the case b/c the actions we consider are feedback policies themselves!] The policy optimization becomes optimizing the mean of this Gaussian. [In other papers people have also included the optimization of the variance parameter.]Policy classw∼N(θ, σ2I)πθ(w|x) =1(2π)dσdexp−12σ2(w − θ)⊤(w − θ) Page 6Policy updateLikelihood ratio based gradient estimate from a single trace of H footsteps:ˆg =H−1n=0∇θlog πθ(w(n)|ˆx(n))H−1k=nR(ˆx(k)) − bRather than waiting till horizon H is reached, we can perform the updatesonline as follows: (here ηθis a step-size parameter, b(n) is the amount of baselinewe allocate to time n–see next slide)e(n) = e(n − 1) +12σ2(w(n) − θ(n))θ(n + 1) = θ(n) + ηθe(n)(R(ˆx(n))−b(n))We have:∇θlog πθ(w|ˆx) =12σ2(w−θ)To reduce variance, can discount the eligibilities:e(n) = γe(n−1) +12σ2(w(n)−θ(n))Choosing the baseline b(n)A good choice for the baseline is such that it corresponds to an estimate ofthe expected reward we should have obtained under the current policy.Assuming we have estimates of the value functionˆV under the current policy,we can estimate such a baseline as follows:b(n) =ˆV (ˆx(n)) − γˆV (ˆx(n + 1))To estimateˆV we can use TD(0) with function approximation. Using linearvalue function approximation, we have:ˆV (ˆx) =iviψi(ˆx).This gives us the following update equations to learnˆV with TD(0):δ(n) = R(ˆx(n)) + γˆV (ˆx(n + 1)) −ˆV (ˆx(n))v(n + 1) = v(n) + ηvδ(n)ψ(ˆx(n))Page 7The complete actor critic learning algorithmBefore each foot step, sample the feedback control policy parameters w(n)from N (θ(n), σ2I).During the foot step, execute the following controls in


View Full Document
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?