UCSD CSE 254 - Policy Invariance - D2154585

Home> Schools> University of California, San Diego> Computer Science & Engineering (CSE) > CSE 254> Policy Invariance

UCSD CSE 254 - Policy Invariance

School name University of California, San Diego

Course Cse 254- Seminar on Learning Algorithms

Pages 30

Download Save

Unformatted text preview:

Policy Invariance Under Reward Transformations: Theory and Application to Reward ShapingPaper by Andrew Ng, Daishi Harada, Stuart RussellPresentation by Kristin BransonApril 9, 2002Reinforcement LearningThe reinforcement learning (RL) task: Learn the best action to take in any situation to maximize the total rewards earned.The agent learns which actions to take by interacting with its environment.Markov Decision Processes (MDP)In an MDP, the environment is represented as a Finite State Automaton with which the agent interacts in discrete episodes. All information relevant to which action to take is in the environment’s state (the Markov Property). M = (S, A, T = {Psa(s’)}, γ, R):Sis the set of statesof the environment.A is the set of actionsthe agent can take.Psa(s’) is the probability s’is the state after state sand action a. γis the discount factor.Ris the immediate reward function.MDPs: Example 1stis the state of the environment at time t.rt is the reward received at time t.A policy, π,is a mapping from states to actions. rtAgentat = π(st )rt+1st+1stEnvironmenthighlowrechargesearch1-α, ℜsearchα, ℜsearchwait1, ℜwaitsearch1,0β, ℜsearch[Sutton and Bartow, 2000]Example: Foraging RobotsThe robots’ objective is to collectively find pucks and bring them home.Represent the 12D environment by the state variables:have-puck?at-home?near-intruder?What should the immediate reward function be?[http://www-robotics.usc.edu/~agents/][Mataric, 1994]Reward ShapingIf a reward is given only when a robot drops a puck at home, learning will be extremely difficult. The delay between the action and the reward is large. Solution: Reward shaping (intermediate rewards).Add rewards/penalties for achieving subgoals/errors:subgoal: grasped-pucksubgoal: dropped-puck-at-homeerror: dropped-puck-away-from-homeAdd progress estimators:Intruder-avoiding progress functionHoming progress functionAdding intermediate rewards will potentially allow RL to handle more complex problems.Reward Shaping ResultsPercentage of policy learned after 15 minutesPercentage of the correct policy learnedNo ShapingSubgoalingSubgoaling plus progress estimators[Mataric, 1994]Policy Invariant Reward TransformationsWhat reward function transformations do not change the (near-) optimal policy?The answer is useful for reward shaping.Often, reward shaping does not aim to change the ultimate policy. Instead, it aims to help the agent learn the policy.Policy invariant shaping avoids reward shaping that misleads theagents.Example: Positive reward cycles.Understanding policy-preserving reward transformations will help us understand how precisely a reward function can be deduced from observing optimal behavior.OutlineIntroduction to reinforcement learning and reward shapingValue functions Potential-based shaping functionsProof that potential-based shaping functions are policy invariant.Proof that, given no other knowledge about the domain, potential-based shaping functions are necessary for policy invariance. Experiments investigating the effects of different potential-based shaping reward functions on RL.Value Function DefinitionsThe state-value function estimates how good it is to be in a certain state, given the policy:The optimal state-value function isThe Q-function estimates how good it is to perform an action in a state, given the policy:The optimal Q-function is*() sup ()Vs V sππ=*(, ) sup (, )Qsa Q saππ=[]0100() | |kkkVsERssE rssγ∞+==== =∑[]00 1000(, ) | , | ,kkkQsa E R s sa E r s sa aγ∞+==== ==∑Recursive Relationships of Value FunctionsThe Q-functions are recursively related:There is one Bellman equation for each state.()[]()()1'2020'1110|'(, , ',|,,'('' )(, , ') ),,||,''saktkkktk t tktttttttssastktkkMrrsasrQsa E s sa aEssarErssaaQsaRrPsasaEssaasPsππππππγγγγγγ∞++=∞++=∞++ + ++===========++===+∑∑∑∑∑Bellman EquationRepresentation of Shaping RewardsUnder reward shaping, an increment is added to each immediate reward. The MDP (S, A, T, γ, R) is transformed to the MDP(S, A, T, γ, R’), where:and Fis a function A potential-basedshaping function is the difference of potentials, for all s, a, s’,'( , , ') ( , , ') ( , , ').r sas rsas Fsas=+(, , ') ( ') ()Fsas s sγ=Φ −ΦDiscount factor:FS AS××ROutline of Sufficiency ProofSufficiency Theorem: IfF is a potential-based shaping function, then it is policy invariant. Mis the original MDP; M’ is the original MDP plus the shaping function .Claim 1: Any policy that optimizes also optimizes . Claim 2: Any policy that optimizes also optimizes . From claims 1 and 2, any optimal policy for Mis also an optimal policy for M’, and vice-versa. Therefore, Fis policy invariant. *(, )MQsa*'(, )MQsa*(, )MQsa*'(, )MQsa(, , ') ( ') ()Fsas s sγ=Φ −ΦSufficiency TheoremIfF is a potential-based shaping function, then it is policy invariant. Claim 1: Any policy that optimizes also optimizes .Begin with the Bellman equation for the optimal Q-function for M.() ()() ()()()**''* *''*''*','(,,')max(','),'(,,')max(',')' ( , , ') max ( ', ')'(,,') max ('() ()()(') () ,) ('Msa MaAsMsa MaAssa MaAssa MaAQsa Psrsas QsaQsa Psrsas QsaPs rsas QsaPs rsas Qsasssss sγγγγγ∈∈∈∈ΦΦ=+−= + −=−+ΦΦ−Φ Φ=++−∑∑∑()()'*''*''' ( , , ') max ( ', ') ( )'max(',')(')(, , ')'( , , ') )ssa MaAssa MaAsFsasrPs rsas Qsa sPs Qsas sa sγγ∈∈=++−Φ=+−Φ∑∑∑*(, )MQsa*'(, )MQsaProof of Sufficiency Theorem, cont.Above is the Bellman equation for , if we substitute:Therefore, any policy that optimizes also optimizes. Since Φ(s) does not depend on the action chosen in state s, this policy also maximizes . That is, an optimal policy for M’is also an optimal policy for M.Since and() ()() () ()''''* ***'',() (',')(),,''(,,')max''(,,')maxM MMMsaaAssaaAsQsa s Qsa sQsa QsPs rsasPs rsa asγγ∈∈=+=+−−ΦΦ∑∑*'(, )MQsa**'()MMQQ s=−Φ*(, )MQsa*'(, )MQsa**'()MMQQ s=−Φ() ()**','(,,')('),Msa MsQsa Psrsas Vsγ=+∑**'()MMVV s=−Φ**'()MMQQ s=−ΦProof of Sufficiency Theorem, cont.Claim 2: Any policy that optimizes also optimizesBegin with the Bellman

View Full Document


School:
Email:
New Password:
Confirm Password:

UCSD CSE 254 - Policy Invariance

Sign up for free to view:

Please select your school