DOC PREVIEW
MIT 16 412J - Problem Set 4

This preview shows page 1-2 out of 7 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 7 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Problem Set #4 – due Nov. 13Problem 1 – Off to the RacesProblem 2 – Model-Based Diagnosis using OpSatMassachusetts Institute of Technology16.412J/6.834J Intelligent Embedded SystemsProblem Set #4 – due Nov. 13Problem 1 – Off to the RacesConsider the following racetrack.1234 5 6 78910It is a typical oval track with two straight-aways and two turns. Each straight-away has 4 segments, and each segment has an inside and outside lane. The turns, each consisting of one segment, also have inside and outside lanes.In this problem, a single race car is performing time trials. The goal is to minimize the time needed to traverse segments (to complete one lap, for example). The race car begins in segment 1 in the inside lane, and moves in the counter-clockwise direction. It can travel at 0, 20, 40, or 60 mph. Its initial speed is 0.Two types of actions are possible for the car: changing speed, and changing lane. Speed-related actions are:- go 20 mph. faster (subject to maximum speed of 60 mph. constraint)- go 20 mph. slower (subject to minimum speed of 0 mph. constraint)- stay at same speedLane-related actions are:- switch lanes- stay in same laneState is represented by the car’s- segment- lane- speedFor straight segments, the state transition function has the following characteristics:- If the action is to go 20 mph faster, and the speed in the current state is not alreadythe maximum, the speed in the next state will be 20 mph faster with probability 0.9, or will not change with probability 0.1. If the speed in the current state is already the maximum, the speed in the next state will be the same with probability1.- If the action is to go 20 mph slower, and the speed in the current state is not already the minimum, the speed in the next state will be 20 mph slower with probability 0.9, or will not change with probability 0.1. If the speed in the currentstate is already the minimum, the speed in the next state will be the same with probability 1.- If the action is to stay at the same speed, this will happen with probability 1.- If the action is to switch lanes, this will happen with probability 0.9; the car will stay in the same lane with probability 0.1.- If the action is to stay in the same lane, this will happen with probability 1.The probabilities model unforeseen circumstances like oil slicks, driver hesitation, etc.The transition function for the turns is the same as for the straight segments, with the following addition. The faster the car is going through the turn, the higher the likelihood it will crash. The likelihood is even higher in the inside lane of the turn. A crash is modeled here as resetting the speed to 0 (as is done in many video games). Finally, the segment 8 turn is banked, so the likelihood of crash here is lower than for the segment 3 turn. To summarize,For segment 3 turn, outside lane, probability of crash is as follows:Speed Probability of crash0 020 040 0.160 0.3For segment 3 turn, inside lane, probability of crash is as follows:Speed Probability of crash0 020 0.140 0.360 0.5For segment 8 turn, outside lane, probability of crash is as follows:Speed Probability of crash0 020 040 060 0.1For segment 8 turn, inside lane, probability of crash is as follows:Speed Probability of crash0 020 040 0.160 0.3Reward is based solely on current state (not action). For the straight segments the reward is based only on speed: 0 for 0 mph, 1 for 20 mph, 2 for 40 mph, and 3 for 60 mph.For the inside lanes on the turns, reward is: 0 for 0 mph, 2 for 20 mph, 4 for 40 mph, and 6 for 60 mph. For the outside lanes, reward is the same as for straight segments.The race car will always transition to the next segment on the track with probability 1 unless its speed is 0, or if it crashes.A. MDPImplement the value iteration algorithm in the language of your choice. Please do not collaborate with other students when writing the code. Provide your commented listing.Run your algorithm for this problem. You may choose a suitable epsilon. What is the optimal value function V*? What is the optimal Q*(s,a) function for this value function? What is the optimal policy?Plot |V(t) – V*| as a function of iteration t (V(t) is the approximate value function computed at each iteration. Use the maximum norm; the absolute value of the largest difference in V over the state space.B. Reinforcement Learning – Q-learning algorithmImplement a simulator of the racetrack MDP. That is, implement a function that, when given state s and action a, returns the appropriate reward r, and a new state s’, drawn according to the above-described probabilities. Provide your commented listing.Implement the Q-learning algorithm for this problem. Provide your commented listing.Implement an exploration strategy that picks the apparent best action with probability 0.9 and a random action with probability 0.1. Combine this with the simulator and Q-learning algorithm. Run this and show that it converges to the values obtained in part A.C. Reinforcement Learning – Q-learning convergence- Run the Q-learning algorithm for 10 steps, starting in the initial state- Freeze and run the policy using greedy selection for 25 steps, starting in the initialstate. Calculate the total discounted reward.- Repeat this experiment, first running Q-learning for 20 steps, 30 steps, etc.- Plot the discounted reward as a function of number of learning steps. This should converge to V*.D. Reinforcement Learning – Dyna algorithmImplement the Dyna algorithm, and replace the Q-learning algorithm in part B with Dyna. Compare results in terms of convergence based on number of actions needed, and computational effort.E. Extra Credit – Prioritized SweepingImplement the prioritized sweeping algorithm, and compare with Dyna and Q-learning.Problem 2 – Model-Based Diagnosis using OpSatTo be


View Full Document

MIT 16 412J - Problem Set 4

Documents in this Course
Load more
Download Problem Set 4
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Problem Set 4 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Problem Set 4 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?