Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide 25Slide 26Slide 27Slide 28Slide 29Slide 30Slide 31Slide 32Slide 33Slide 34Slide 35Slide 36Slide 37Reinforcement Learning Some slides taken from previous 10701 recitations/lecturesA (Fully Deterministic) World R=50 R=100World R=50 R=100World R=50 R=100Long Term Reward R=50 R=100 Start here 50 Total Reward: Reward is discounted by the time I obtained it value=∑ttrt; =0.8R=50 R=100 Start here 32 We can Reuse Computation! 50 40 Long Term Reward Total Reward: Reward is discounted by the time I obtained it value=∑ttrt; =0.8Value of a Policy if I run for 0 time stepsR=50 R=100 0 0 0 0 0 V0 0R=50 R=100 50 100 0 0 0 V1 0 Value of a Policy if I run for 1 time stepR=50 R=100 50 100 80 0 40 V2 0 Value of a Policy if I run for 2 time stepsR=50 R=100 50 100 80 64 40 V3 0 Value of a Policy if I run for 3 time stepsNon‐deterministic World R=50 R=100 50 100 0 0 0 V1 0 P=0.7 P=0.3R=50 R=100 50 100 0 0 0 V1 0 P=0.7 P=0.3 Non‐deterministic WorldR=50 R=100 50 100 56 0 40 V2 0 P=0.7 P=0.3 Non‐deterministic WorldR=50 R=100 50 100 69.44 44.8 40 V3 0 P=0.7 P=0.3 Non‐deterministic WorldValue Iteration Immediate reward of following policy Discounted future rewardFind BEST Policy Ask the question in a slightly different way. What is the Value of the Best Policy? Immediate reward of following policy Discounted future reward Immediate reward of following policy Discounted future rewardFind BEST Policy What is the Value of the Best Policy? Immediate reward of following policy Discounted future reward The optimal policy is optimal at every state!Policy Learning ExampleR=50 R=100 0 0 0 0 ? 0 P=0.7 P=0.3Policy Learning ExampleR=50 R=100 ? 0 0 0 0 0 P=0.7 P=0.3Policy Learning ExampleR=50 R=100 50 0 0 ? 0 0 P=0.7 P=0.3Policy Learning ExampleR=50 R=100 50 0 ? 0 0 0 P=0.7 P=0.3Policy Learning ExampleR=50 R=100 50 ? 40 0 0 0 P=0.7 P=0.3Policy Learning ExampleR=50 R=100 50 100 40 0 ? 0 P=0.7 P=0.3Policy Learning ExampleR=50 R=100 ? 100 40 0 40 0 P=0.7 P=0.3BackgammonSomething iswrong here...BackgammonEstimate V*(x) instead of Π(x)Can be estimated from our current networkApproximate V*(x) using a neural netSince V* is a neural net, we can't 'set' the value V*(x)In this case, P(x'|x,a=a') is 0 or 1 for all x'Instead, use target V*(x) as a training example for the NNDealing with huge state spacesCan't visit every state, so instead play games against yourselfto visit the most likely ones.0 except when you win or loseUnknown World Possible Questions. 1: I am in state X. What is the value of following a particular policy? 2: What is the best policy? Do not know the transitions. Do not know the probabilities. Do not know the rewards. Only know a state when we actually get there! ?Value of Policy? If I know the rewards: If I do not know the rewards: Vt1 xt= rtVt xt1 1−Vt xtWhere is the result of taking action a in state sLearning a Policy: Q LearningDefine Q which estimates both values and rewards:Learning a Policy: Q Learning ? Estimate Q the same way we estimated VVt1 xt= rtVt xt1 1−Vt xtQt1 xt, at=rtmaxa 'Qt xt1, a ' 1−Qtxt, atQ Learning Example0 0 0 0 0 0 ? 0 0 0 0 0 R=0 =.8, =.5? 0 0 0 0 0 0 0 0 0 0 0 R=50 Q Learning Example=.8, =.525 0 0 0 0 0 0 0 0 0 0 0 Q Learning Example=.8, =.525 0 0 0 0 0 ? 0 0 0 0 0 R=0 Q Learning Example=.8, =.5? 0 0 0 0 0 10 0 0 0 0 0 R=50 Q Learning Example=.8, =.537.5 0 0 0 0 0 10 0 0 0 0 0 Q Learning Example=.8,
View Full Document