Markov Decision Processes (MDPs)Slide 2Slide 3Slide 4Calculating V*(s)Slide 6Markov Decision Processes (MDPs)•read Ch 17.1-17.2•utility-based agents–goals encoded in utility function U(s), or U:S •effects of actions encoded in state transition function: T:SxAS–or T:SxApdf(S) for non-deterministic•rewards/costs encoded in reward function: R:SxA•Markov property: effects of actions only depend on current state, not previous history•the goal: maximize reward over time–long-term discounted reward–handles infinite horizon; encourages quicker achievement•“plans” are encoded in policies–mappings from states to actions: :SA•how to compute optimal policy * that maximizes long-term discounted reward?•value function V(s): expected long-term reward from starting in state s and following policy •derive policy from V(s):• (s)=maxaA E[R(s,a)+V(T(s,(s)))] • = max p(s’|s,a)·(R+V(s’))•optimal policy comes from optimal value function: (s)= max p(s’|s,a)·V*(s’)=•Bellman’s equations–(eqn 17.5)•method 1: linear programming–n coupled linear equations–v1 = max(v2,v3,v4...)–v2 = max(v1,v3,v4...)–v3 = max(v1,v2,v4...) –solve for {v1,v2,v3...} using Gnu LP kit, etc.Calculating V*(s)•method 2: Value Iteration–initialize V(s)=0 for all states–iteratively update value of each state based on neighbors–...until
View Full Document