DOC PREVIEW
Hierarchy, Behavior, and Off- policy Learning

This preview shows page 1-2 out of 5 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 5 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Hierarchy, Behavior, and Off-policy LearningRich SuttonUniversity of AlbertaOutlinea “micro-scale model” of cognition, in which abstractions play no role in producing behavior abstraction in state and time can be supported by options, but off-policy learning is requireda new actor-critic-advantage algorithm for off-policy learningis hierarchical behavior a “user illusion”?is it something we use it to explain our behavior, to others and to ourselves,but not what are brains are really doing?or is it a real phenomena involved in every muscle we twitch?there is no current optionno goal stackno hierarchical executionno execution of high-level anything, everall execution is at a very low level (say 100hz)Working hypothesis: Hierarchy and abstraction play no role in producing behaviorsome definitions:action = lowest level action, 100hzobservation = lowest level sensation, 100hzstate = some representation/memory of the state of the world, updated at 100hzpolicy = the mapping from state to action used to produce behavior, at 100hzsome definitions:action = lowest level action, 100hzobservation = lowest level sensation, 100hzstate = some representation/memory of the state of the world, updated at 100hzpolicy = the mapping from state to action used to produce behavior, at 100hzSTATE UPDATE. . . . . .OBSACTIONPOLICYSTATETSTATET+1on every step, 100hzby learningby planningabstractions are also used in the design of the state representationbut in the end, to produce behavior, there is just a low-level policyAbstractions are used only for changing the policyOutlinea “micro-scale model” of cognition, in which abstractions play no role in producing behavior abstraction in state and time can be supported by options, but off-policy learning is requireda new actor-critic-advantage algorithm for off-policy learningDefinitions re: optionsoption = a way of behaving that terminates when one of a set of states is reacheddefined entirely in low-level terms (100hz)actions are a special case of optionsoption outcome = how the option terminateswhat state? how much reward along the way?Option models as world knowledgeoption model = a mapping from states to predicted outcomes for some optioneach option model is a tiny abstract model of the worldif i tried to, could i open the door?if i dropped this, would it make a sound?if i waited, would this talk ever end?if i tried to sit, would i fall on the floor?Option models as state representationsoption models are predictions of option outcomessuch predictions can make great abstract state variablesif i opened the box, what would i see?if i shifted lanes, would i hit another car?if i ring Joe’s room, will he answer?option models are PSRs (Littman et al., 2002) on steroidsOBSSTATETSTATETon every step, at 100hzACTIONABA PREDICTS B AFTER OPTION OPREDICTION OF A PREDICTIONREWARDVALUE FUNCTIONeverything is still running at 100hzall options, option models, predictive state representations can be learned off-policy, in parallel, at 100hzeven planning can run at 100hzDyna strategy: plan by learning on simulated transitions use option models to generate transitions from the beginning to the end of optionslong/variable time spans are reduced to single stepsa parallel machine, running at the smallest time scale, yet always learning and thinking about large-scale behavior and abstract states option! " #targetA THOUSAND POINTS OFLIGHT, EACH DOING OFF-POLICY LEARNINGABOUT ITS OWN OPTIONyyw ezz!!"dd+!#divergencefrom recognizerterminationtargetnode'spredictionUPDATE PROCESSLEARNING (WEIGHT UPDATE)LOCAL TD ERRORELIGIBILITY TRACESCIRCUIT DIAGRAMFOR AN OFF-POLICYLEARNING ALGORITHMCONTINUOUS-TIMEEQUATIONSall this presumes we can do off-policy, intra-option learning with function approximationoff-policy learning = learning about one policy while following anotherwe must learn off-policy in order to learn efficiently about optionsyou can only behave one way, but you want to learn about many different ways of behavingintra-option learning = learning about an overall option while only doing per-time-step operationsfunction approximation = generalizing across statesDo we know how to do off-policy learning?there are known, sound, off-policy learning methods based on importance sampling (Precup et al, 2000)based on averagers (Gordon, 1995)but they learn much more slowly than seems necessaryand they are not “elegant”RL Algorithm SpaceTD Linear FAOff-policyLinearTD(!)Q-learning,optionsstableWe needall 3But we canonly get 2 at a timeTsitsiklis & Van Roy 1997Tadic 2000Baird 1995Gordon 1995NDP 1996Boom!under on-policy training, learning occurs along whole trajectorieseach region of state space is exited the same number of times as it is enteredeach region’s estimated value is corrected as many times as it is usedSTATESPACETrajectories are goodunder off-policy training, learning occurs along segments of broken trajectoriesregions of state space may be entered more times than they are exiteda region’s estimated value may be used many times as a target, yet rarely be updated itselfwith function approximation, its estimate can get severely out of wackSTATESPACETHIS REGION IS BACKED-UP FROM,BUT NEVER BACKED-UP TOBEHAVIOR POLICYBEHAVIORDIVERGESFROM THE TARGET POLICYBACKUPTARGET POLICYTARGET POLICYBEHAVIOR POLICYBEHAVIORDIVERGESFROM THE TARGET POLICYVALUE WRTTARGET POLICY“ADVANTAGE”The OPACA algorithmOff-Policy Actor-Critic-Advantage algorithmadvantages: A!(s,a) = Q!(s,a) - V!(s)the advantages are estimated using a third independent linear function approximator (in addition those for the actor and the critic)Actor:π(s, a, θ)=eθ!φ(s,a)!beθ!φ(s,b)θ, φ ∈"m(1)Critic:Vt(s)=v!tf(s) v, f ∈"n(2)Advantages:At(s, a)=w!tψ(s, a) w ∈"m(3)using the actor-compatible feature vectors:ψ(s, a)=φ(s, a) −"bπ(s, b)φ(s, b) (4)Rationale:Vπ(s)+Aπ(s, a)=E {rt+1+ γVπ(st+1) | st= s, at= a} (5)thus0=E {rt+1+ γVπ(st+1) | st= s, at= a} − Vπ(s) − Aπ(s, a) (6)Leading to the advantage-based TD error:δt= rt+1+ γVt(st+1) − Vt(st) − At(st,at) (7)= rt+1+ γv!tf(st+1) − v!tf(st) − w!tψ(st,at) (8)and to these one-step TD updates:vt+1= vt+ αδtf(st) (9)wt+1= wt+ αδtψ(st,at) (10)Note: there must be analogous equations using eligibility traces.Note: i think there is some way to do without an explicit policy and work directlyin compatible feature vectors. this way one would not have to make sure they arecompatible: by construction they would be compatible with some policy.


Hierarchy, Behavior, and Off- policy Learning

Download Hierarchy, Behavior, and Off- policy Learning
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Hierarchy, Behavior, and Off- policy Learning and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Hierarchy, Behavior, and Off- policy Learning 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?