DOC PREVIEW
Berkeley COMPSCI 188 - mdps

This preview shows page 1 out of 2 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 2 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 2 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Markov Decision ProcessesCS 188: Section HandoutDefining a Markov Decision Process (MDP)• State Space: {S0, S1, S2, ...}• Actions: {A0, A1, ...}• Initial State: S0• Transition Model: T (s, a, s0), the probability of going from s to s0with action a.• Reward Function: R(s), the reward for being in state s.1• Discount Factor: γ, the discount for rewards: a reward r in t steps is worth rγtnow. 0 < γ ≤ 1.A solution to an MDP is called a policy, which is a function π(s) that maps from states to actions.For a particular p olicy π, every state has exactly one chosen action.Which of the following are MDPs?Exercise: For each of the following tasks/games, describe an MDP formulation or state why it isnot amenable to the MDP framework.• Blackjack (21) with no betting• Rock, Paper, Scissors• Person trapped in a container ship who can yell for help on sunny days• Tightrope-walking robot1Sometimes rewards have different stru cture s, such as R(s, a, s0): the reward for moving from s to s0via action a.1A Very Simple Example: Golf• State Space: {T ee, F airway, Sand, Green}• Actions: {Conservative, P ower shot}• Initial State: T ee• Transition Model: T (s, a, s0), the probability of going from s to s0with action a.s a s0T (s, a, s0)Tee Conservative Fairway 0.9Tee Conservative Sand 0.1Tee Power shot Green 0.5Tee Power shot Sand 0.5Fairway Conservative Green 0.8Fairway Conservative Sand 0.2Sand Conservative Green 1.0• Reward Function:s R(s)Tee -1Fairway -1Sand -2Green 3Question: For the Conservative policy, what is the utility of being at the Tee? What about thePower shot policy?Value ite rat io n: an exact solution to MDPsThe quick and dirty story of value iteration:• Solves the Bellman equation: U (s) = R(s) + γ maxaPs0T (s, a, s0)U(s0)• Starts withˆU(s) = 0 for all s. Iterates through each state many times, updatingˆU(s).• Iteration always converges to the correct answer given infinite time.Exercise: Compute the value iteration updates for


View Full Document

Berkeley COMPSCI 188 - mdps

Documents in this Course
CSP

CSP

42 pages

Metrics

Metrics

4 pages

HMMs II

HMMs II

19 pages

NLP

NLP

23 pages

Midterm

Midterm

9 pages

Agents

Agents

8 pages

Lecture 4

Lecture 4

53 pages

CSPs

CSPs

16 pages

Midterm

Midterm

6 pages

MDPs

MDPs

20 pages

Games II

Games II

18 pages

Load more
Download mdps
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view mdps and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view mdps 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?