**Unformatted text preview:**

COSC 4368 Group Project Spring 2019 Learning Paths from Feedback Using Reinforcement Learning for a Transportation World P D P D D P Eick Q Learning and SARSA for the PD World Terminal State Drop off cells contain 5 blocks each Initial State Agent is in cell 1 5 and pickup cells contain 5 blocks PD World Goal Transport from pickup cells to dropoff cells 1 1 1 2 1 3 1 4 2 1 2 2 2 3 2 4 2 5 3 1 3 2 3 3 3 4 3 5 4 1 4 2 4 3 4 4 4 5 5 1 5 2 5 3 5 4 5 5 Pickup Cells 1 1 3 3 5 5 Dropoff Cells 5 1 5 3 2 5 1 5 Eick Q Learning and SARSA for the PD World Spring 2019 PD World P D P D D P Operators there are six of them North South East West are applicable in each state and move the agent to the cell in that direction except leaving the grid is not allowed Pickup is only applicable if the agent is in an pickup cell that contain at least one block and if the agent does not already carry a block Dropoff is only applicable if the agent is in a dropoff cell that contains less that 5 blocks and if the agent carries a block Initial state of the PD World Each pickup cell contains 5 blocks and dropoff cells contain 0 blocks the agent always starts in position 1 5 Eick Q Learning and SARSA for the PD World Rewards in the PD World P D P D D P Rewards Picking up a block from a pickup state 13 Dropping off a block in a dropoff state 13 Applying north south east west 1 Eick Q Learning and SARSA for the PD World 2019 Policies PRandom If pickup and dropoff is applicable choose this operator otherwise choose an operator randomly PExploit If pickup and dropoff is applicable choose this operator otherwise apply the applicable operator with the highest q value break ties by rolling a dice for operators with the same utility with probability 0 80 and choose a different applicable operator randomly with probability 0 20 PGreedy If pickup and dropoff is applicable choose this operator otherwise apply the applicable operator with the highest q value break ties by rolling a dice for operators with the same utility Eick Q Learning and SARSA for the PD World Performance Measures a Bank account of the agent b Number of operators applied to reach a terminal state from the initial state this can happen multiple times in a single experiment Eick Q Learning and SARSA for the PD World P D State Space PD World P D D P The actual state space of the PD World is as follows i j x a b c d e f with i j is the position of the agent x is 1 if the agent carries a block and 0 if not a b c d e f are the number of blocks in cells 1 1 3 3 5 5 5 1 5 3 and 4 5 respectively Initial State 1 1 0 5 5 5 0 0 0 Terminal State 0 0 0 0 5 5 5 Remark The actual reinforcement learning approach likely will use a simplified state space that aggregates multiple states of the actual state space into a single state in the reinforcement learning state space Eick Q Learning and SARSA for the PD World Mapping State Spaces to RL State Space Most worlds have enormously large state spaces or even nonfinite state spaces Moreover how quickly Q TD learning learns is inversely proportional to the size of the state space Consequently smaller state spaces are used as RL state spaces and the original state space are rarely used as RL state space World State Space Reduction RL State Space Eick Q Learning and SARSA for the PD World Recommended Reinforcement Learning State Space In this approach reinforcement learning states have the form i j x where i j is the position of the agent x is 1 if the agent carries a block otherwise 0 That is the state space has only 50 states Discussion 1 The algorithm initially learns paths between pickup states and dropoff states different paths for x 1 or for x 0 2 Minor complication The q values of those paths will decrease is soon as the particular pickup state runs out of blocks or the particular dropoff state cannot store any further blocks as it is no longer attractive to visit these locations Suggestion Use this Reinforcement Learning State Space for this project and no other space Eick Q Learning and SARSA for the PD World Alternative Reinforcement Learning Search Space1 Reinforcement learning states have the form i j x s t u where i j is the position of the agent x is 1 if the agent carries a block otherwise 0 g h i are boolean variables whose meaning depend on if the agent carries a block or not Case 1 x 0 agent does not carry a block s is 1 if cell 1 1 contains at least one block t is 1 if cell 3 3 contains at least one block u is 1 if cell 5 5 contains at least one block Case 2 x 1 agent does carry a block s is 1 if cell 5 1 contains less than 5 blocks t is 1 if cell 5 3 contains less than 5 blocks u is 1 if cell 4 5 contains less than 5 blocks There are 400 states total in the reinforcement learning state space1 Eick Q Learning and SARSA for the PD World Analysis of Attractive Paths See also http horstmann com gridworld gridworld manual html http cs stanford edu people karpathy reinforcejs gridworld td html Eick Q Learning and SARSA for the PD World Remark This is the QL approach you must use TD Q Learning for the PD World Goal Measure the utility of using action a in state s denoted by Q a s the following update formula is used every time an agent reaches state s from s using actions a Q a s 1 Q a s R s a s maxa Q a s is the learning rate g is the discount factor a has to be an applicable operator in s e g pickup and drop off are not applicable in a pickup dropoff states if empty full R s a s is the reward of reaching s from s by applying a e g 1 if moving 13 if picking up or dropping blocks for the PD World Eick Q Learning and SARSA for the PD World SARSA a S s Approach SARSA selects using the policy the action a to be applied to s and then updates Qvalues as follows Q a s Q a s R s Q a s Q a s SARSA vs Q Learning SARSA uses the actually taken action for the update and is therefore more …

View Full Document