Berkeley COMPSCI 188 - FORMULATING AND SOLVING MDPS - D2814751

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 188> FORMULATING AND SOLVING MDPS

DOC PREVIEW

Berkeley COMPSCI 188 - FORMULATING AND SOLVING MDPS

School name University of California, Berkeley

Course Compsci 188- Introduction to Artificial Intelligence

Pages 5

This preview shows page 1-2 out of 5 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS188 – Introduction to Artificial IntelligenceSection Handout #5, FORMULATING AND SOLVING MDPSKlein, Fall 2007Question 1 (Class)Consider the above MDP, representing a robot on a balance beam. Each grid square is a state and the available actions are right and left. The agent starts in state s2 , and all states have reward 0 aside from the ends of the grid s1 and s8 and the ground state, which have the rewards shown. Moving left or right results in a move left or right (respectively) with probability p. With probability 1 p, the robot falls− off the beam (transitions to ground, and receives a reward of -1). Falling off, or reaching either endpoint, result in the end of the episode (i.e., they are terminal states). Note that terminal states receive no future reward.a. For what values of p is the optimal action from s2 to move right if the discount γ is 1?b. For what values of is the optimal action from sγ2 to move right if p = 1?CS188 – Introduction to Artificial IntelligenceSection Handout #5, FORMULATING AND SOLVING MDPSKlein, Fall 2007c. Given initial value estimates of zero, show the results of one, then two rounds of value iteration.d. We can develop learning updates that involve two actions instead of one. Write down the utility Uπ(s) of a state s under policy in terms of the next two states s'π and s'', given that Us=s 'T s ,s , s ' [ R s , s , s '  Us ' ]e. Write a two-step-look-ahead value iteration update that involves U (s) and U (s''), where s'' is the state two time steps later. Why would this update not be used in practice?f. Write a two-step-look-ahead TD-learning update that involves U (s) and U (s'') for the observed state-action-state-action-state sequence s, a, s', a', s''CS188 – Introduction to Artificial IntelligenceSection Handout #5, FORMULATING AND SOLVING MDPSKlein, Fall 2007g. Given initial q-value estimates of zero, show the result of Q-learning with learning rate = 0.5 after two epsiodes: [s2 , s3 , ground] and [s2 , s3 , s4 , s5 ,α ground] where the agent always moves right. You need only write down the non-zero entries. For the purposes of Q-learning updates, terminal states should be treated as having a single action die which leads to future rewards of zero. Hint: q-values of terminal states which have been visited should not be zero.CS188 – Introduction to Artificial IntelligenceSection Handout #5, FORMULATING AND SOLVING MDPSKlein, Fall 2007Question 1 (Class)Golf as an MDPWe formulate golf as an MDP as follows:State Space : {Tee, Fairway,Sand, Green}Actions : {Conservative shot, Power shot}Initial State : TeeTransition model : (note that action not on this list have probability 0)Rewards:(note: R(·,·,s) means that the reward is received for transitioning to state s, regardless of action taken or previous state)sR(·,·,s)FairwaySandGr e en-1-23CS188 – Introduction to Artificial IntelligenceSection Handout #5, FORMULATING AND SOLVING MDPSKlein, Fall 2007a. Consider the policy of always taking the “Conservative Shot”. What is the utility of the initial state under this policy?b. Compute estimates of the utility of each state under the optimal policy using Value Iteration with 3 iterations. Show the utilities of each state at each iteration. Assume we start with all utilities set to

View Full Document

Berkeley COMPSCI 188 - FORMULATING AND SOLVING MDPS

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 5 pages.

Berkeley COMPSCI 188 - FORMULATING AND SOLVING MDPS

Sign up for free to view:

Please select your school