DOC PREVIEW
UT Dallas CS 6375 - rl

This preview shows page 1-2-3-4-5-6 out of 18 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

11Machine LearningCS6375 --- Spring 2015aReinforcement Learning2Markov Decision Process• Finite set of states S• Set of actions A• At each discrete time agent observes state stin S and choose action atin A• Then receives immediate reward rt• And state changes to st+1• Markov assumption: st+1=δ(st,at) and rt=r(st,at), i.e.,– rtand st+1depend only on current state and action– Function δ and r may be nondeterministic– Function δ and r not necessarily known to agent23Reinforcement Learning Problem Goal: learn to choose actions that maximize Example:• Robot learning to dock on battery charger• Learning to choose actions to optimize factory output• Learning to play Backgammon4Agent’s Learning TaskExecute actions in environment, observe results, and • Learn action policy π: S->A that maximizes from any starting state in SNote: • Target function is π: S->A• But we have no training examples of form <s,a>• Training examples are of form <(s,a),r>35Backgammon GameLearn to play BackgammonImmediate reward• 100 if win• -100 if lose• 0 for all other statesCan be trained by playing many (millions of) games against itself, approximately equal to best human player6Start with learning Markov systems47Predicting Delayed RewardsGiven: Prob(next state = S5| this state = S4) = 0.8 etc,what is expected sum of future rewards (discounted) ?Just Solve It! We use standard Markov System Theory8Learning Delayed RewardsAll you can see is a series of states and rewards:S1(r=0)  S2(r=0)  S3(r=4)  S2(r=0)  S4(r=0)  S5(r=0)Task: Based on this sequence, estimate J*(S1),J*(S2)···J*(S6)59Idea 1: Supervised LearningAssume γ = 0.5.S1(r=0)  S2(r=0)  S3(r=4)  S2(r=0)  S4(r=0)  S5(r=0)At t=1 we were in state S1and eventually got a long term discountedreward of 0+γ0+γ24+γ30+γ40…= 1At t=2 in state S2ltdr = 2 At t=5 in state S4ltdr = 0At t=3 in state S3ltdr = 4 At t=6 in state S5ltdr = 0At t=4 in state S2ltdr = 0State LTDR Mean of LTDRs1 1 1 =Jest(s1)s2 2, 0 1 Jest(s2)s3 4 4 Jest(s3)s4 0 0 Jest(s4)s5 0 0 Jest(s5)10Supervised Learning Algorithm• Watch a trajectoryS[0] r[0] S[1] r[1] ···· S[T] r[T]• For t=0,1, ··· T , compute• Compute Jest(Si) = mean value of J[t] among all transitions beginning in state Sion the trajectory y• You’re done!611Idea 2: Certainty-Equivalent (CE) LearningIdea: Do model-based learning, i.e., use your data to estimate the underlying Markov system, instead of trying to estimate J directly.S1(r=0)  S2(r=0)  S3(r=4)  S2(r=0)  S4(r=0)  S5(r=0)Estimated Markov System: You draw in the transitions +probsWhat are the estimated J values?12C.E. Method for Markov SystemsInitialize:Count[Si] = 0 #Times visited SiSumR[Si] = 0 ∀Si,,SjSum of rewards from SiTrans[Si,Sj] = 0 #Times transitioned from Si SjWhen we are in state Si, and we receive reward r, and wemove to Sj…Count[Si]  Count[Si] + 1SumR[Si]  SumR[Si] + rTrans[Si,Sj]  Trans[Si,Sj] + 1Then at any timerest(Si) = SumR[Si] / Count[Si]Pestij= Estimated Prob(next = Sj| this = Si)= Trans[Si,Sj] / Count[Si]713C.E. for Markov Systems (cont’d)So at any time we have ∀Si,,Sjrest(Si) and Pest(next=Sj| this=Si) = PestijSo at any time we can solve the set of linear equations14Temporal Difference LearningModel-free learning: learning without trying to estimate the parameters of a Markov systemOnly maintain a Jestarray … nothing elseSo you’ve got Jest(S1), Jest(S2) , ···, Jest(SN)and you observe Si r  Sjwhat should you do?815TD LearningSi r  SjWe update We nudge it to be closer to expected future rewardsα is the learning rate.16Decaying Learning Rategood: αt= 1/tbad: αt= α0good: αt= ß/(ß+t) [e.g., ß=1000]• Start with large α Not confident in our current estimate so we canchange it a lot• Decrease α as we explore more We are more and more confident in our estimate so we don’t want to change it a lot917Learning Policies for MDPsThe heart of Reinforcement Learning18General ProblemWorld: You are in state 34.Your immediate reward is 3. You have 3 actions.Robot: I’ll take action 2.World: You are in state 77.Your immediate reward is -7. You have 2 actions.Robot: I’ll take action 1.World: You’re in state 34 (again).Your immediate reward is 3. You have 3 actions.The Markov property means once you’ve selected anaction the P.D.F. of your next state is the same as thelast time you tried the action in this state.Learning from experience …1019The Task• Hence, all you can see is a series of states and rewards,except that each arrow is associated with an action.a1 a2 a3 a4 a5S(r=1)  S2(r=0)  S3(r=1)  S4(r=0)  S5(r=1)  S6(r=0)– We don’t know the model of the environment– We don’t know the T(.,.,.,): transition probabilities– We don’t know the R(.): reward associated with a state• Task is still the same as in an MDP: find an optimal policy20Classes of Techniques1121Model-Based Methods• The C.E. methods are very similar to the Markov Systemcase, except now do value-iteration-for-MDP22Model-Free Reinforcement LearningWhy not use T.D. ?ObserveupdateWhat’s wrong with this? Different actions associated with a state1223Q-Learning: Model-Free R.L.Define Q*(Si,a)= Expected sum of discounted future rewards if I start in state Si, if I then take action a, and if I’m subsequently optimal24Q-Learning UpdateNote thatIn Q-learning we maintain a table of Qest values insteadof Jestvalues …When you see doPolicy estimation:∑∈+=)(*'*)',(max),|(),(ijSSUCCSjaijiaSQaSSPraSQγ[]),()1()',(max),('aSQaSQraSQiestjestaiiestαγα−++=1325Q-Learning: Convergence• Q-learning guaranteed to converge to an optimal policy• Very general procedure (because completely model-free)• May be slow (because completely model free)26Model Estimation: An ExampleI observed a trajectory duringwhich, when I moved Up froms = (1,1), I ended up ins1= (1,2) 10 timess2= (2,1) 2 timesP(s1| s, Up) = 10/(10+2) = 0.83P(s2| s, Up) = 2/(10+2) = 0.17But … how to explore the environment? We have not said how we generate the actions a1427Exploration Strategy• How to choose the next action while we’re learning?In principle, we can compute a current estimate of the best policy:• What is then the best strategy for exploration?– Greedy: Always use π*(s) when in state s?– Random– Mixed: Sometimes use the best and sometimes


View Full Document

UT Dallas CS 6375 - rl

Documents in this Course
ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

vc

vc

24 pages

svm-2

svm-2

16 pages

svm-1

svm-1

18 pages

rl

rl

18 pages

mle

mle

16 pages

mdp

mdp

19 pages

knn

knn

11 pages

intro

intro

19 pages

hmm-train

hmm-train

26 pages

hmm

hmm

28 pages

ensemble

ensemble

17 pages

em

em

17 pages

dtree

dtree

41 pages

cv

cv

9 pages

bayes

bayes

19 pages

hw2

hw2

2 pages

hw1

hw1

4 pages

hw0

hw0

2 pages

hw5

hw5

2 pages

hw3

hw3

3 pages

20.mdp

20.mdp

19 pages

19.em

19.em

17 pages

16.svm-2

16.svm-2

16 pages

15.svm-1

15.svm-1

18 pages

14.vc

14.vc

24 pages

9.hmm

9.hmm

28 pages

5.mle

5.mle

16 pages

3.bayes

3.bayes

19 pages

2.dtree

2.dtree

41 pages

1.intro

1.intro

19 pages

21.rl

21.rl

18 pages

CNF-DNF

CNF-DNF

2 pages

ID3

ID3

4 pages

mlHw6

mlHw6

3 pages

MLHW3

MLHW3

4 pages

MLHW4

MLHW4

3 pages

ML-HW2

ML-HW2

3 pages

vcdimCMU

vcdimCMU

20 pages

hw0

hw0

2 pages

hw3

hw3

3 pages

hw2

hw2

2 pages

hw1

hw1

4 pages

9.hmm

9.hmm

28 pages

5.mle

5.mle

16 pages

3.bayes

3.bayes

19 pages

2.dtree

2.dtree

41 pages

1.intro

1.intro

19 pages

15.svm-1

15.svm-1

18 pages

14.vc

14.vc

24 pages

hw2

hw2

2 pages

hw1

hw1

4 pages

hw0

hw0

2 pages

hw3

hw3

3 pages

9.hmm

9.hmm

28 pages

5.mle

5.mle

16 pages

3.bayes

3.bayes

19 pages

2.dtree

2.dtree

41 pages

1.intro

1.intro

19 pages

Load more
Download rl
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view rl and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view rl 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?