New version page

UH COSC 4368 - Using Reinforcement Learning To Discover Paths in a Transportation World Group Project

This preview shows page 1 out of 4 pages.

View Full Document
View Full Document

End of preview. Want to read all 4 pages?

Upload your study docs or become a GradeBuddy member to access this document.

View Full Document
Unformatted text preview:

Christoph F. Eick and Romita BanerjeeCOSC 4368 Project Spring 2019Using Reinforcement Learning To Discover Paths in a Transportation WorldGroup Project (usually 4-5 Students per Group)Version 5 Deadlines: Groups 5-13: Deliverables: April 18, 11p, 7-minute presentation on: Monday, April 22; Groups 1-4: Deliverables: April 20, 11a, 7-minute presentation on Wednesday, April 17. In this project we will use reinforcement to learn and adapt “promising paths” in robot-style world. Learning objectives of the COSC 4368 Group Project include:- Understanding basic reinforcement learning concepts such as utilities, policies,learning rates, discount rates and their interactions.- Obtain experience in designing agent-based systems that explore and learn in initiallyunknown environment and which are capable to adapt to changes. - Learning how to conduct experiments that evaluate the performance of reinforcementlearning systems and learning to interpret such results. - Development of visualization techniques summarizing how the agent moves, how theworld and the q-table changes, and the system performance. - Development of path visualization and analysis techniques to interpret and evaluatethe behavior of agent-based path-learning systems.- Learning to develop AI software in a team. Figure 1: Visualization of the PD-World.Figure 2: An Urban Grid World.In particular in Project2 you will use Q-learning/SARSA1 for the PD-Word(http://www2.cs.uh.edu/~ceick/ai/2019-World.pptx), conducting five experiments usingdifferent parameters and policies, and summarize and interpret the experimental results.Moreover, you will develop path visualization techniques that are capable to shed light onwhat paths the learning system actually has learnt from obtained Q-Tables—we call suchpaths attractive paths in the remainder of this document. In experiments you conduct, the learning rate is =0.3 and the discount rate is assumed tobe =0.5 (except Experiment 4) and we assume that q values are initialized with 0 at thebeginning of the experiment. The following 3 policies will be used in the experiments:• PRANDOM: If pickup and dropoff is applicable, choose this operator; otherwise,choose an applicable operator randomly.• PEPLOIT: If pickup and dropoff is applicable, choose this operator; otherwise,apply the applicable operator with the highest q-value (break ties by rolling a dicefor operators with the same q-value) with probability 0.8 and choose a differentapplicable operator randomly with probability 0.2. • PGREEDY: If pickup and dropoff is applicable, choose this operator; otherwise,apply the applicable operator with the highest q-value (break ties by rolling a dice foroperators with the same q-value). Figure 3: Visualization of an Attractive Path for a Search Problem1 SARSA is a variation of Q-learning that uses the q-value of the actually chosen action and not the q-valueof the best action!The five experiments you conduct are as follows: 1. In Experiment 1 you use =0.3, and run the Q-learning algorithm2 for 4000 steps withpolicy PRANDOM; then run PGREEDY for 4000 steps3. Display and interpret the Q-table you obtained in the middle and the end of the experiment. 2. In Experiment 2 you use =0.3, and run the Q-learning algorithm for 8000 steps withpolicy PEXPLOIT—however, use policy PRANDOM for the first 200 steps of theexperiment, and then switch to PEXPLOIT for the remainder of the experiment.Analyze the performance variables and summarize what was learnt by analyzing theq-table at different stages of the experiment.3. In Experiment 3 you use =0.3, and run the SARSA q-learning variation for 8000steps with policy PEXPLOIT—however, use policy PRANDOM for the first 200steps of the experiment, and then switch to PEXPLOIT for the remainder of theexperiment. When analyzing Experiment 3 center on comparing the performance ofQ-learning and SARSA. Also report the final q-table of this experiment. 4. Experiment 4 is the same as Experiment3, except you use a discount rate of =1.0(instead of 0.5). However, just focus on analyzing the impact of the discount rate  inthis experiment on the performance variables and not on other matters4. 5. Experiment 5 is the same as Experiment 2; that is =0.3 and =0.5 and you usePEXPLOIT with Q-learning except for the first 200 operator applications you usePRANDOM; however, after the agent reaches a terminal state the second time, youwill swap pickup and drop-off locations. When analyzing the results of thisexperiment focus how well and quickly the q-learning approach was able to adapt tothis change.For all experiments, if the agent reaches a terminal state, restart the experiment byresetting the PD world to the initial state, but do not reset the Q-table. Run eachexperiment twice, and report5 and interpret the results; e.g. utilities computed, rewardsobtained in various stages of each experiment. Assess which experiment obtained the best results6. Next, analyze the various q-tablesyou created and try identify attractive paths7 in the obtained q-tables, if there are any.Moreover, briefly assess if your system gets better after it solved a few PD-worldproblems—reached the terminal state at least once. Briefly analyze to which extend the2 You have the option to use the SARSA algorithm instead, if you prefer that! Make clear in your report which algorithm you used in Experiment1. 3 Do not reset the q-table after 4000 steps!4 For groups with only 4 students, conducting experiment 4 is optional!5 Additionally, report the following Q-tables for Experiments 2 (or Experiment 3 if you prefer that, in this case you will only need to report the final Q-Table of Experiment 2) in your report a) when the first drop-off location is filled (the fifth block has been delivered to it) and b) when a terminal state is reached and c) the final Q-table of each experiment. The Q-table in the screenshot should be presented as a matrix, with s rows (states) and t columns (operators). Thus, the Q-table for recommended state space has 25 x 2 rows and 6 columns; however, the q-values for the drop-off and pickup operators do not need to be reported.6 Provide graphs that show, how the algorithm’s performance variables changed over the duration of the experiment in the three experiments.7 A


View Full Document
Loading Unlocking...
Login

Join to view Using Reinforcement Learning To Discover Paths in a Transportation World Group Project and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Using Reinforcement Learning To Discover Paths in a Transportation World Group Project and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?