DOC PREVIEW
CORNELL CS 4700 - Policy Search

This preview shows page 1-2-3-24-25-26 out of 26 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 26 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Policy Search-0.04 -0.04 -0.04 -0.04 -0.04 -0.04 -0.04 -0.040.8 0.1 0.1 • move into desired direction with prob 80% • move 90 degrees to left with prob 10% • move 90 degrees to right with prob 10% -0.04 -0.04 -0.04 -0.04 -0.04 -0.04 -0.04 -0.04 What is the probability that [up, up, right, right, right] ends in (4,3)? A=1 B=0.32768 C=0.32776 D=0.5 We still assume the problem is fully observable – the agent knows where it is and what the rewards are…A PolicyOther representations? • How can a policy be represented? – An arrow for every grid cell – What about a continuous world? • Θ(x,y) • NN • C++ function-0.04 -0.04 -0.04 -0.04 -0.04 -0.04 -0.04 -0.04 •R(s) is the short term reward in state s •Uπ(s) is the long term reward when following policy π from state s R(s)Optimal Policies for Other RewardsMarkovian Model • Model: – Initial state: S0 – Transition function: T(s,a,s’)  T(s,a,s’) is the probability of moving from state s to s’ when executing action a. – Reward function: R(s)  Real valued reward that the agent receives for entering state s. • Assumptions – Markov property: T(s,a,s’) and R(s) only depend on current state s, but not on any states visited earlier. – Extension: Function R may be non-deterministic as wellMarkov Decision Process • Representation of Environment: – finite set of states S – set of actions A for each state s in S • Process – At each discrete time step, the agent • observes state st in S and then • chooses action at in A. – After that, the environment • gives agent an immediate reward rt • changes state to st+1 (can be probabilistic)Utilities • Rating a state sequence [s0, s1, s2, …+ – U([s0,…,sN]) = Σi R(si) – Additive rewards: • Uh([s0, s1, s2, …+ ) = R(s0) + R(s1) + R(s2) + … – We want preferences to be stationary – If [s0, s1, s2, …+ better than *s0, s’1, s’2, …+ implies [s1, s2, …+ better than *s’1, s’2, …+ • Reward vs. UtilityDiscounted Utility • Problem: – What happens to utility value when • either the state space has no terminal states • or the policy never directs the agent to a terminal state  Utility becomes infinite • Solution – Use discounted utility  closer rewards count more than awards far in the future  finite utility even for infinite state sequences U([s0,…,sN]) = Σi γi R(si) ≤ Σi γi Rmax = Rmax / (1 – γ)Policy • Definition: – A policy π describes which action an agent should select in each state – a=π(s) • Utility of a policy – For now additive utility – Let P([s0,…,sN] | π, s0) be the probability of state sequence [s0,…,sN] when following policy π from state s0 – Expected utility: Uπ(s) = Σ U([s0,…,sN]) P([s0,…,sN] | π, s0)  measure of quality of policy π – Optimal policy π*: Policy with maximal Uπ(s) in each state sUtility  Policy • Equivalence: – If we know the utility U(s) of each state, we can derive the correseponding policy: π*(s) = argmaxa Σs’ T(s, a, s’) U(s’) – If we know the policy π, we can compute the corresponsing utility of each state: PolicyEvaluation algorithm Bellman Equation: U(s) = R(s) + γ maxa Σs’ T(s, a, s’) U(s’) Optimal Utility  Optimal PolicyHow to Compute the Utility for a given Policy? • Definition: • Uπ(s) = Σ [ Σi γi R(si) ] P([s0, s1,…+ | π, s0=s) • Recursive computation: – Uπ(s) = R(s) + γ Σs’ T(s, π(s), s’) Uπ(s’) – What is P([s0, s1,…+ | π)? – P([s0, s1,…+ | π) = T(s0, π(s0), s1) * T(s1, π(s1), s2) * … Utility of path Likelihood of path0 0 0 0 0 0 0 0 •Uπ((3,3))  -0.04 + 0.8 * 1 + 0.1 * 0 + 0.1 * 0 = 0.76 R(s)= -0.04 everywhere except goals0 0 0 0 0 0 0.76 ? •Uπ((3,2)) = ? A= -0.04 B=0.312 C=0.428 D=0.4680 0 0 0 0 0 0.468 ? •Uπ((3,3)) = ? A= -0.04 B=0.760 C=0.883 D=0.885Uπ(s) = R(s) + γ Σs’ T(s, π(s), s’) Uπ(s’) Here: γ=1.0, R(s)=-0.04 ConvergenceBellman Update • Goal: Solve set of n=|S| equations (one for each state) Uπ(s0) = R(s0) + γ Σs’ T(s0, π(s), s’) Uπ(s’) … Uπ(sn) = R(sn) + γ Σs’ T(sn, π(s), s’) Uπ(s’) • Is this a set of linear equations? Why or why not? • Algorithm [Policy Evaluation] (fix π): – i=0; Uπ0(s)=0 for all s – repeat • i = i +1 • for each state s in S do – Uπi(s) = R(s) + γ Σs’ T(s, π(s), s’) Uπi-1(s’) • endfor – until difference between Uπi and Uπi-1 small enough – return UπiHow to Find the Optimal Policy π*? • Is policy π optimal? How can we tell? – If π is not optimal, then there exists some state where π(s) ≠ argmaxa Σs’ T(s, a, s’) Uπ(s’) – How to find the optimal policy π*?How to Find the Optimal Policy π*? • Algorithm [Policy Iteration]: – repeat • Uπ = PolicyEvaluation(π,S,T,R) • for each state s in S do – If [ maxa Σs’ T(s, a, s’) Uπ(s’) > Σs’ T(s, π(s), s’) Uπ(s’) ] then » π(s) = argmaxa Σs’ T(s, a, s’) Uπ(s’) • endfor – until π does not change any more – return πValue Iteration • Optimal Utility  Optimal Policy • Find optimal utility directly 1 -2 5 7 6 3 6 5 A B C D What is the best utility possible for the center? ?Value Iteration • Optimal Utility  Optimal Policy • Find optimal utility directly 1 -2 5 7 6 3 6 5 1 -2 5 7 6 3 6 5 1 -2 5 7 6 3 6 5 1 -2 5 7 6 3 6 5 -0.04 + 0.8*-2 + 0.1*7 + 0.1*6 =-0.34 -0.04 + 0.8*7 + 0.1*-2 + 0.1*6 =5.96 -0.04 + 0.8*6 + 0.1*7 + 0.1*6 =6.06 -0.04 + 0.8*6 + 0.1*6 + 0.1*-2 =5.16Value Iteration Algorithm • Algorithm [Value Iteration]: – i=0; U0(s)=0 for all s – repeat • i = i +1 • for each state s in S do – Ui(s) = R(s) + γ maxa Σs’ T(s, a, s’) Ui-1(s’) • endfor – until difference between Ui and Ui-1 small enough – return Ui  derive optimal policy via π*(s) = argmaxa Σs’ T(s, a, s’) U(s’)Two approaches to finding π* • Policy Iteration – Local search: Start with random policy – Evaluate a candidate policy using bellman update – Uπ(s0) = R(s0) + γ Σs’ T(s0, π(s), s’) Uπ(s’) – Linear set of equations – Confirm policy is optimal or make changes • Value Iteration – find optimal utilities by iteratively solving • Ui(s) = R(s) + γ maxa Σs’ T(s, a, s’) Ui-1(s’) – Derive optimal policy to follow


View Full Document

CORNELL CS 4700 - Policy Search

Download Policy Search
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Policy Search and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Policy Search 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?