Policy Search-0.04 -0.04 -0.04 -0.04 -0.04 -0.04 -0.04 -0.040.8 0.1 0.1 • move into desired direction with prob 80% • move 90 degrees to left with prob 10% • move 90 degrees to right with prob 10% -0.04 -0.04 -0.04 -0.04 -0.04 -0.04 -0.04 -0.04 What is the probability that [up, up, right, right, right] ends in (4,3)? A=1 B=0.32768 C=0.32776 D=0.5 We still assume the problem is fully observable – the agent knows where it is and what the rewards are…A PolicyOther representations? • How can a policy be represented? – An arrow for every grid cell – What about a continuous world? • Θ(x,y) • NN • C++ function-0.04 -0.04 -0.04 -0.04 -0.04 -0.04 -0.04 -0.04 •R(s) is the short term reward in state s •Uπ(s) is the long term reward when following policy π from state s R(s)Optimal Policies for Other RewardsMarkovian Model • Model: – Initial state: S0 – Transition function: T(s,a,s’) T(s,a,s’) is the probability of moving from state s to s’ when executing action a. – Reward function: R(s) Real valued reward that the agent receives for entering state s. • Assumptions – Markov property: T(s,a,s’) and R(s) only depend on current state s, but not on any states visited earlier. – Extension: Function R may be non-deterministic as wellMarkov Decision Process • Representation of Environment: – finite set of states S – set of actions A for each state s in S • Process – At each discrete time step, the agent • observes state st in S and then • chooses action at in A. – After that, the environment • gives agent an immediate reward rt • changes state to st+1 (can be probabilistic)Utilities • Rating a state sequence [s0, s1, s2, …+ – U([s0,…,sN]) = Σi R(si) – Additive rewards: • Uh([s0, s1, s2, …+ ) = R(s0) + R(s1) + R(s2) + … – We want preferences to be stationary – If [s0, s1, s2, …+ better than *s0, s’1, s’2, …+ implies [s1, s2, …+ better than *s’1, s’2, …+ • Reward vs. UtilityDiscounted Utility • Problem: – What happens to utility value when • either the state space has no terminal states • or the policy never directs the agent to a terminal state Utility becomes infinite • Solution – Use discounted utility closer rewards count more than awards far in the future finite utility even for infinite state sequences U([s0,…,sN]) = Σi γi R(si) ≤ Σi γi Rmax = Rmax / (1 – γ)Policy • Definition: – A policy π describes which action an agent should select in each state – a=π(s) • Utility of a policy – For now additive utility – Let P([s0,…,sN] | π, s0) be the probability of state sequence [s0,…,sN] when following policy π from state s0 – Expected utility: Uπ(s) = Σ U([s0,…,sN]) P([s0,…,sN] | π, s0) measure of quality of policy π – Optimal policy π*: Policy with maximal Uπ(s) in each state sUtility Policy • Equivalence: – If we know the utility U(s) of each state, we can derive the correseponding policy: π*(s) = argmaxa Σs’ T(s, a, s’) U(s’) – If we know the policy π, we can compute the corresponsing utility of each state: PolicyEvaluation algorithm Bellman Equation: U(s) = R(s) + γ maxa Σs’ T(s, a, s’) U(s’) Optimal Utility Optimal PolicyHow to Compute the Utility for a given Policy? • Definition: • Uπ(s) = Σ [ Σi γi R(si) ] P([s0, s1,…+ | π, s0=s) • Recursive computation: – Uπ(s) = R(s) + γ Σs’ T(s, π(s), s’) Uπ(s’) – What is P([s0, s1,…+ | π)? – P([s0, s1,…+ | π) = T(s0, π(s0), s1) * T(s1, π(s1), s2) * … Utility of path Likelihood of path0 0 0 0 0 0 0 0 •Uπ((3,3)) -0.04 + 0.8 * 1 + 0.1 * 0 + 0.1 * 0 = 0.76 R(s)= -0.04 everywhere except goals0 0 0 0 0 0 0.76 ? •Uπ((3,2)) = ? A= -0.04 B=0.312 C=0.428 D=0.4680 0 0 0 0 0 0.468 ? •Uπ((3,3)) = ? A= -0.04 B=0.760 C=0.883 D=0.885Uπ(s) = R(s) + γ Σs’ T(s, π(s), s’) Uπ(s’) Here: γ=1.0, R(s)=-0.04 ConvergenceBellman Update • Goal: Solve set of n=|S| equations (one for each state) Uπ(s0) = R(s0) + γ Σs’ T(s0, π(s), s’) Uπ(s’) … Uπ(sn) = R(sn) + γ Σs’ T(sn, π(s), s’) Uπ(s’) • Is this a set of linear equations? Why or why not? • Algorithm [Policy Evaluation] (fix π): – i=0; Uπ0(s)=0 for all s – repeat • i = i +1 • for each state s in S do – Uπi(s) = R(s) + γ Σs’ T(s, π(s), s’) Uπi-1(s’) • endfor – until difference between Uπi and Uπi-1 small enough – return UπiHow to Find the Optimal Policy π*? • Is policy π optimal? How can we tell? – If π is not optimal, then there exists some state where π(s) ≠ argmaxa Σs’ T(s, a, s’) Uπ(s’) – How to find the optimal policy π*?How to Find the Optimal Policy π*? • Algorithm [Policy Iteration]: – repeat • Uπ = PolicyEvaluation(π,S,T,R) • for each state s in S do – If [ maxa Σs’ T(s, a, s’) Uπ(s’) > Σs’ T(s, π(s), s’) Uπ(s’) ] then » π(s) = argmaxa Σs’ T(s, a, s’) Uπ(s’) • endfor – until π does not change any more – return πValue Iteration • Optimal Utility Optimal Policy • Find optimal utility directly 1 -2 5 7 6 3 6 5 A B C D What is the best utility possible for the center? ?Value Iteration • Optimal Utility Optimal Policy • Find optimal utility directly 1 -2 5 7 6 3 6 5 1 -2 5 7 6 3 6 5 1 -2 5 7 6 3 6 5 1 -2 5 7 6 3 6 5 -0.04 + 0.8*-2 + 0.1*7 + 0.1*6 =-0.34 -0.04 + 0.8*7 + 0.1*-2 + 0.1*6 =5.96 -0.04 + 0.8*6 + 0.1*7 + 0.1*6 =6.06 -0.04 + 0.8*6 + 0.1*6 + 0.1*-2 =5.16Value Iteration Algorithm • Algorithm [Value Iteration]: – i=0; U0(s)=0 for all s – repeat • i = i +1 • for each state s in S do – Ui(s) = R(s) + γ maxa Σs’ T(s, a, s’) Ui-1(s’) • endfor – until difference between Ui and Ui-1 small enough – return Ui derive optimal policy via π*(s) = argmaxa Σs’ T(s, a, s’) U(s’)Two approaches to finding π* • Policy Iteration – Local search: Start with random policy – Evaluate a candidate policy using bellman update – Uπ(s0) = R(s0) + γ Σs’ T(s0, π(s), s’) Uπ(s’) – Linear set of equations – Confirm policy is optimal or make changes • Value Iteration – find optimal utilities by iteratively solving • Ui(s) = R(s) + γ maxa Σs’ T(s, a, s’) Ui-1(s’) – Derive optimal policy to follow
View Full Document