CORNELL CS 472 - Reinforcement Learning - D2647388

Home> Schools> Cornell University> Computer Science (CS) > CS 472> Reinforcement Learning

CORNELL CS 472 - Reinforcement Learning

Course Cs 472- Foundations of Artif Intllgnce

Pages 5

Download Save

Unformatted text preview:

Reinforcement LearningCS472/CS473 – Fall 2005Reinforcement Learning• Supervised Learning:– Training examples: (x,y)– Direct feedback y for each input x• Reinforcement Learning– Sequence of decisions with eventual feedback– No teacher that critiques individual actions– Learn to act and to assign blame/credit to individual actions–Examples• when playing a game, only after many actions final result: win, loss, or draw. • Robot fetching bagels from bakery• Navigating the Web for collecting all CS pages• Control problems (reactor control)Reinforcement Learning• Issues– Agent knows the full environment a priori vs. unknown environment– Agent can be passive (watch) or active (explore)– Feedback (i.e. rewards) in terminal states only; or a bit of feedback in any state– How to measure and estimate the utility of each action– Environment fully observable, or partially observable– Have model of environment and effects of action…or notÆ Reinforcement Learning will adress these issues!Markov Decision Process• Representation of Environment:– finite set of states S– set of actions A for each state s ∈ S• Process– At each discrete time step, the agent • observes state st∈ S and then• chooses action at∈ A.– After that, the environment• gives agent an immediate reward rt• changes state to st+1(can be probabilistic)Markov Decision Process• Model: – Initial state: S0– Transition function: T(s,a,s’) Æ T(s,a,s’) is the probability of moving from state s to s’when executing action a.– Reward function: R(s)Æ Real valued reward that the agent receives for entering state s.• Assumptions– Markov property: T(s,a,s’) and R(s) only depend on current state s, but not on any states visited earlier.– Extension: Function R may be non-deterministic as well ExampleEach other state has a reward of -0.04.123123− 1+ 14START0.80.10.1• move into desired direction with prob 80%• move 90 degrees to left with prob 10%• move 90 degrees to right with prob 10%Policy• Definition:–A policy π describes whichaction an agent selects ineach state–a=π(s)• Utility– For now: U([s0,…,sN]) = ΣiR(si)–Let P([s0,…,sN] | π, s0) be the probability of state sequence [s0,…,sN] when following policy π from state s0– Expected utility: Uπ(s) = Σ U([s0,…,sN]) P([s0,…,sN] | π, s0)Æ measure of quality of policy π– Optimal policy π*: Policy with maximal Uπ(s) in each state s123123− 1+ 14Optimal Policies for Other RewardsUtility (revisited)• Problem:– What happens to utility value when• either the state space has no terminal states• or the policy never directs the agent to a terminal stateÆ Utility becomes infinite• Solution– Discount factor 0 < γ < 1 Æ closer rewards count more than awards far in the future– U([s0,…,sN]) = ΣiγiR(si)Æ finite utility even for infinite state sequencesHow to Compute the Utility for a given Policy?• Definition: Uπ(s) = Σ [ ΣiγiR(si) ] P([s0, s1,…] | π, s0=s)• Recursive computation:–Uπ(s) = R(s) + γΣs’T(s, π(s), s’) Uπ(s’) 123123− 1+ 14Here: γ=1.0, R(s)=-0.04123123− 1+ 140.6110.8120.6550.7620.9180.7050.6600.868 0.388Bellman Update (for fixed π)Goal: Solve set of n=|S| equations (one for each state)Uπ(s0) = R(s0) + γΣs’T(s0, π(s), s’) Uπ(s’) …Uπ(sn) = R(sn) + γΣs’T(sn, π(s), s’) Uπ(s’) Algorithm [Policy Evaluation]:– i=0; Uπ0(s)=0 for all s– repeat• i = i +1• for each state s in S do–Uπi(s) = R(s) + γΣs’T(s, π(s), s’) Uπi-1(s’) • endfor– until difference between Uπiand Uπi-1small enough– return UπiHow to Find the Optimal Policy π*?Is policy π optimal? How can we tell?–If π is not optimal, then there exists some state whereπ(s) ≠ argmaxaΣs’T(s, a, s’) Uπ(s’) – How to find the optimal policy π*?123123− 1+ 14123123− 1+ 14How to Find the Optimal Policy π*?Algorithm [Policy Iteration]:– repeat•Uπ= PolicyEvaluation(π,S,T,R)• for each state s in S do– If [ maxaΣs’T(s, a, s’) Uπ(s’) > Σs’T(s, π(s), s’) Uπ(s’) ] then» π(s) = argmaxaΣs’T(s, a, s’) Uπ(s’) • endfor– until π does not change any more– return πUtility Ù PolicyEquivalence:– If we know the optimal utility U(s) of each state, we can derive the optimal policy:π*(s) = argmaxaΣs’T(s, a, s’) U(s’)– If we know the optimal policy π*, we can compute the optimal utility of each state:PolicyEvaluation algorithmBellman Equation:U(s) = R(s) + γ maxaΣs’T(s, a, s’) U(s’) Î Necessary and sufficient condition for optimal U(s).Value Iteration Algorithm• Algorithm [Value Iteration]:– i=0; U0(s)=0 for all s– repeat• i = i +1• for each state s in S do–Ui(s) = R(s) + γ maxaΣs’T(s, a, s’) Ui-1(s’) • endfor– until difference between Uiand Ui-1small enough– return UiÆ derive optimal policy via π*(s) = argmaxaΣs’T(s, a, s’) U(s’)Convergence of Value Iteration• Value iteration is guaranteed to converge to optimal U for 0 ≤γ< 1• Faster convergence for smaller γ-1-0.500.510 5 10 15 20 25 30Utility estimatesNumber of iterations(4,3)(3,3)(2,3)(1,1)(3,1)(4,1)(4,2)Reinforcement LearningAssumptions we made so far:– Known state space S– Known transition model T(s, a, s’) – Known reward function R(s)Înot realistic for many real agentsReinforcement Learning:– Learn optimal policy with a priori unknown environment– Assume fully observable environment (i.e. agent can tell it’s state)– Agent needs to explore environment (i.e. experimentation)Passive Reinforcement LearningTask: Given a policy π, what is the utility function Uπ?– Similar to Policy Evaluation, but unknown T(s, a, s’) and R(s) Approach: Agent experiments in the environment– Trials: execute policy from start state until in terminal state.(1,1)-0.04Æ (1,2)-0.04Æ (1,3)-0.04Æ (1,2)-0.04Æ (1,3)-0.04Æ (2,3)-0.04Æ (3,3)-0.04Æ (4,3)1.0(1,1)-0.04Æ (1,2)-0.04Æ (1,3)-0.04Æ (2,3)-0.04Æ (3,3)-0.04Æ (3,2)-0.04Æ (3,3)-0.04Æ (4,3)1.0(1,1)-0.04Æ (2,1)-0.04Æ (3,1)-0.04Æ (3,2)-0.04Æ (4,2)-1.0123123− 1+ 14Direct Utility Estimation• Data: Trials of the form– (1,1)-0.04Æ (1,2)-0.04Æ (1,3)-0.04Æ (1,2)-0.04Æ (1,3)-0.04Æ(2,3)-0.04Æ (3,3)-0.04Æ (4,3)1.0– (1,1)-0.04Æ (1,2)-0.04Æ (1,3)-0.04Æ (2,3)-0.04Æ (3,3)-0.04Æ(3,2)-0.04Æ (3,3)-0.04Æ (4,3)1.0– (1,1)-0.04Æ (2,1)-0.04Æ (3,1)-0.04Æ (3,2)-0.04Æ (4,2)-1.0• Idea:– Average

View Full Document


School:
Email:
New Password:
Confirm Password:

CORNELL CS 472 - Reinforcement Learning

Sign up for free to view:

Please select your school