MIT 6 825 - Reinforcement Learning - D1328642

Home> Schools> Massachusetts Institute of Technology> Electrical Engineering and Computer Science (6) > 6 825> Reinforcement Learning

MIT 6 825 - Reinforcement Learning

School name Massachusetts Institute of Technology

Course 6 825- Techniques in Artificial Intelligence

Pages 33

Download Save

Unformatted text preview:

1Lecture 22 • 16.825 Techniques in Artificial IntelligenceReinforcement LearningWhen we talked about MDPs, we assumed that we knew the agent’s reward function, R, and a model of how the world works, expressed as the transition probability distribution. In reinforcement learning, we would like an agent to learn to behave well in an MDP world, but without knowing anything about R or P when it starts out.2Lecture 22 • 26.825 Techniques in Artificial IntelligenceReinforcement LearningIt’s called reinforcement learning because it’s related to early mathematical psychology models of conditioning, or behavior learning, in animals.3Lecture 22 • 36.825 Techniques in Artificial IntelligenceReinforcement Learning• Exploration•Q learning• Extensions and examplesWe’ll look at the issue of exploration, talk about Q-learning, which is one of the most successful approaches to reinforcement learning, and then consider some extensions and examples.4Lecture 22 • 4Reinforcement LearningWhat do you do when you don’t know how the world works?So, how should you behave when you don’t even know how the world works?5Lecture 22 • 5Reinforcement LearningWhat do you do when you don’t know how the world works?One option:One obvious strategy for dealing with an unknown world is to learn a model of the world, and then solve it using known techniques.6Lecture 22 • 6Reinforcement LearningWhat do you do when you don’t know how the world works?One option:• estimate R (reward function) and P (transition function) from dataYou all know how to do parameter estimation now, by counting the number of times various events occur and taking ratios. So, you can estimate the next-state distribution P(s’|s,a) by counting the number of times the agent has taken action a in state s and looking at the proportion of the time that s’has been the next state. Similarly, you can estimate R(s) just by averaging all the rewards you’ve received when you were in state s.7Lecture 22 • 7Reinforcement LearningWhat do you do when you don’t know how the world works?One option:• estimate R (reward function) and P (transition function) from data• solve for optimal policy given estimated R and POnce you have estimated values for R and P, then you can use value iteration to compute a policy that’s optimal, if not for the real world, for the world described by the estimated model.8Lecture 22 • 8Reinforcement LearningWhat do you do when you don’t know how the world works?One option:• estimate R (reward function) and P (transition function) from data• solve for optimal policy given estimated R and POf course, there’s a question of how long you should gather data to estimate the model before it’s good enough to use to find a policy. One way to get around this problem is to re-estimate the model on every step. Each time you get a piece of experience, you update your model estimates. Then, you run value iteration on the updated model. This might be too computationally expensive; if so, you can run value iteration over the continually updated model as a kind of background job, at whatever rate you can afford computationally.9Lecture 22 • 9Reinforcement LearningWhat do you do when you don’t know how the world works?One option:• estimate R (reward function) and P (transition function) from data• solve for optimal policy given estimated R and PAnother option: • estimate a value function directlyThis approach is sound, and in a lot of cases, it’s probably the right thing to do. But, interestingly enough, it’s possible to find the optimal value function without ever estimating the state transition probabilities directly. We’ll investigate an algorithm for doing that.10Lecture 22 • 10Bandit Problems?But first, let’s think about how the agent should choose its actions. In most supervised learning problems, the agent is simply given a big collection of data and asked to learn something from it. In reinforcement learning, there is an interesting added dimension: the agent gets to choose its own actions and, therefore, it has very direct influence on the data it will receive. Now there are two, possibly opposing reasons for the agent to choose an action: because it thinks the action will have a good result in the world, or because it thinks the action will give it more information about how the world works. Both of these things are valuable, and it’s interesting to think about how to trade them off.11Lecture 22 • 11Bandit Problems?One view of the very simplest reinforcement learning problem, where there’s a single state, is as a bandit problem. Imagine you enter a casino with k slot machines. Every time you pull an arm on a machine, it either pays off a dollar or nothing. Assume that each machine has a hidden probability of paying off, and that whenever you pull an arm, the outcome is independent of previous outcomes and is determined by the hidden payoff probability. This is called a k-armed-bandit problem, because, in English slang, slot machines are sometimes called one-armed bandits (presumably because they take all your money).12Lecture 22 • 12Bandit Problems?Now, assume that I let you into the casino and tell you that you have 1000 chances to pull the arm of a machine, and that you should try to make as much money as possible during that time. What should you do?13Lecture 22 • 13Bandit Strategies•switch on a loserThere are a lot of strategies, even for this simple problem. One of the first ones that people studied was “switch on a loser”. It says that you should pick one arm. As long as it keeps paying off, you should keep pulling it. As soon as it loses, go to the next arm, and so on. It can be shown that this strategy is better than behaving at random, but it’s not optimal!14Lecture 22 • 14Bandit Strategies•switch on a loser• always choose the apparent bestAnother strategy would be to keep estimates of the payoff probabilities of each arm (by counting). Then, always choose the arm with the highest estimated probability of paying off. The problem with this strategy, which initially seems quite appealing, is that you might have initial bad luck with an arm that is actually pretty good, and arrive at a very low estimate of its probability of paying off. If this happens, you might never choose it again, thus preventing yourself from discovering that it’s actually better than it seems to be now.15Lecture 22 • 15Bandit

View Full Document


School:
Email:
New Password:
Confirm Password:

MIT 6 825 - Reinforcement Learning

Sign up for free to view:

Please select your school