DOC PREVIEW
CMU CS 10701 - Reinforcement Learning

This preview shows page 1-2-17-18-19-36-37 out of 37 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 37 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide 25Slide 26Slide 27Slide 28Slide 29Slide 30Slide 31Slide 32Slide 33Slide 34Slide 35Slide 36Slide 37Reinforcement Learning Some slides taken from previous 10701 recitations/lecturesA (Fully Deterministic) World R=50 R=100World R=50 R=100World R=50 R=100Long Term Reward R=50 R=100 Start here 50 Total Reward: Reward is discounted by the time I obtained it value=∑ttrt; =0.8R=50 R=100 Start here 32 We can Reuse Computation! 50 40 Long Term Reward Total Reward: Reward is discounted by the time I obtained it value=∑ttrt; =0.8Value of a Policy if I run for 0 time stepsR=50 R=100 0 0 0 0 0 V0 0R=50 R=100 50 100 0 0 0 V1 0 Value of a Policy if I run for 1 time stepR=50 R=100 50 100 80 0 40 V2 0 Value of a Policy if I run for 2 time stepsR=50 R=100 50 100 80 64 40 V3 0 Value of a Policy if I run for 3 time stepsNon‐deterministic World R=50 R=100 50 100 0 0 0 V1 0 P=0.7 P=0.3R=50 R=100 50 100 0 0 0 V1 0 P=0.7 P=0.3 Non‐deterministic WorldR=50 R=100 50 100 56 0 40 V2 0 P=0.7 P=0.3 Non‐deterministic WorldR=50 R=100 50 100 69.44 44.8 40 V3 0 P=0.7 P=0.3 Non‐deterministic WorldValue Iteration Immediate reward of following policy Discounted future rewardFind BEST Policy Ask the question in a slightly different way. What is the Value of the Best Policy? Immediate reward of following policy Discounted future reward Immediate reward of following policy Discounted future rewardFind BEST Policy What is the Value of the Best Policy? Immediate reward of following policy Discounted future reward The optimal policy is optimal at every state!Policy Learning ExampleR=50 R=100 0 0 0 0 ? 0 P=0.7 P=0.3Policy Learning ExampleR=50 R=100 ? 0 0 0 0 0 P=0.7 P=0.3Policy Learning ExampleR=50 R=100 50 0 0 ? 0 0 P=0.7 P=0.3Policy Learning ExampleR=50 R=100 50 0 ? 0 0 0 P=0.7 P=0.3Policy Learning ExampleR=50 R=100 50 ? 40 0 0 0 P=0.7 P=0.3Policy Learning ExampleR=50 R=100 50 100 40 0 ? 0 P=0.7 P=0.3Policy Learning ExampleR=50 R=100 ? 100 40 0 40 0 P=0.7 P=0.3BackgammonSomething iswrong here...BackgammonEstimate V*(x) instead of Π(x)Can be estimated from our current networkApproximate V*(x) using a neural netSince V* is a neural net, we can't 'set' the value V*(x)In this case, P(x'|x,a=a') is 0 or 1 for all x'Instead, use target V*(x) as a training example for the NNDealing with huge state spacesCan't visit every state, so instead play games against yourselfto visit the most likely ones.0 except when you win or loseUnknown World Possible Questions. 1: I am in state X. What is the value of following a particular policy? 2: What is the best policy? Do not know the transitions. Do not know the probabilities. Do not know the rewards. Only know a state when we actually get there! ?Value of Policy? If I know the rewards: If I do not know the rewards: Vt1 xt= rtVt xt1 1−Vt xtWhere is the result of taking action a in state sLearning a Policy: Q LearningDefine Q which estimates both values and rewards:Learning a Policy: Q Learning ? Estimate Q the same way we estimated VVt1 xt= rtVt xt1 1−Vt xtQt1 xt, at=rtmaxa 'Qt xt1, a ' 1−Qtxt, atQ Learning Example0 0 0 0 0 0 ? 0 0 0 0 0 R=0 =.8, =.5? 0 0 0 0 0 0 0 0 0 0 0 R=50 Q Learning Example=.8, =.525 0 0 0 0 0 0 0 0 0 0 0 Q Learning Example=.8, =.525 0 0 0 0 0 ? 0 0 0 0 0 R=0 Q Learning Example=.8, =.5? 0 0 0 0 0 10 0 0 0 0 0 R=50 Q Learning Example=.8, =.537.5 0 0 0 0 0 10 0 0 0 0 0 Q Learning Example=.8,


View Full Document

CMU CS 10701 - Reinforcement Learning

Documents in this Course
lecture

lecture

12 pages

lecture

lecture

17 pages

HMMs

HMMs

40 pages

lecture

lecture

15 pages

lecture

lecture

20 pages

Notes

Notes

10 pages

Notes

Notes

15 pages

Lecture

Lecture

22 pages

Lecture

Lecture

13 pages

Lecture

Lecture

24 pages

Lecture9

Lecture9

38 pages

lecture

lecture

26 pages

lecture

lecture

13 pages

Lecture

Lecture

5 pages

lecture

lecture

18 pages

lecture

lecture

22 pages

Boosting

Boosting

11 pages

lecture

lecture

16 pages

lecture

lecture

20 pages

Lecture

Lecture

20 pages

Lecture

Lecture

39 pages

Lecture

Lecture

14 pages

Lecture

Lecture

18 pages

Lecture

Lecture

13 pages

Exam

Exam

10 pages

Lecture

Lecture

27 pages

Lecture

Lecture

15 pages

Lecture

Lecture

24 pages

Lecture

Lecture

16 pages

Lecture

Lecture

23 pages

Lecture6

Lecture6

28 pages

Notes

Notes

34 pages

lecture

lecture

15 pages

Midterm

Midterm

11 pages

lecture

lecture

11 pages

lecture

lecture

23 pages

Boosting

Boosting

35 pages

Lecture

Lecture

49 pages

Lecture

Lecture

22 pages

Lecture

Lecture

16 pages

Lecture

Lecture

18 pages

Lecture

Lecture

35 pages

lecture

lecture

22 pages

lecture

lecture

24 pages

Midterm

Midterm

17 pages

exam

exam

15 pages

Lecture12

Lecture12

32 pages

lecture

lecture

19 pages

Lecture

Lecture

32 pages

boosting

boosting

11 pages

pca-mdps

pca-mdps

56 pages

bns

bns

45 pages

mdps

mdps

42 pages

svms

svms

10 pages

Notes

Notes

12 pages

lecture

lecture

42 pages

lecture

lecture

29 pages

lecture

lecture

15 pages

Lecture

Lecture

12 pages

Lecture

Lecture

24 pages

Lecture

Lecture

22 pages

Midterm

Midterm

5 pages

mdps-rl

mdps-rl

26 pages

Load more
Download Reinforcement Learning
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Reinforcement Learning and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Reinforcement Learning 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?