Markov Decision Processes & Reinforcement LearningOutlineStochastic ProcessStochastic Process ExampleMarkov PropertyMarkov Property ExampleMarkov ChainMarkov Decision Process (MDP)Description of MDPsSimple MDP ExampleTransition ProbabilitiesTransition GraphSolution to an MDP = Policy πVariantsValue IterationWhy So Interesting?Typical AgentMission: Optimize RewardValue FunctionsAnother Value FunctionDynamic ProgrammingDP Continued…Policy ImprovementPolicy IterationRemember Value Iteration?Monte Carlo MethodsPolicy EvaluationEstimation of Action ValuesExample Monte Carlo AlgorithmAnother MC AlgorithmTemporal-Difference LearningTD(0)SARSA – On-policy ControlQ-Learning – Off-policy ControlCase StudyNASA Space Shuttle Payload Processing Problem (SSPPP)SSPPP – continued…Even More SSPPP…RL and CBRReferencesMarkov Decision Processes & Reinforcement LearningMegan SmithLehigh University, Fall 2006OutlineStochastic ProcessMarkov PropertyMarkov ChainMarkov Decision ProcessReinforcement LearningRL TechniquesExample ApplicationsStochastic ProcessQuick definition: A Random ProcessOften viewed as a collection of indexed random variablesUseful to us: Set of states with probabilities of being in those states indexed over timeWe’ll deal with discrete stochastic processeshttp://en.wikipedia.org/wiki/Image:AAMarkov.jpgStochastic Process ExampleClassic: Random WalkStart at state X0 at time t0At time ti, move a step Zi whereP(Zi = -1) = p and P(Zi = 1) = 1 - pAt time ti, state Xi = X0 + Z1 +…+ Zihttp://en.wikipedia.org/wiki/Image:Random_Walk_example.pngMarkov PropertyAlso thought of as the “memoryless” propertyA stochastic process is said to have the Markov property if the probability of state Xn+1 having any given value depends only upon state XnVery much depends on description of statesMarkov Property ExampleCheckers:Current State: The current configuration of the boardContains all information needed for transition to next stateThus, each configuration can be said to have the Markov propertyMarkov ChainDiscrete-time stochastic process with the Markov propertyIndustry Example: Google’s PageRank algorithmProbability distribution representing likelihood of random linking ending up on a pagehttp://en.wikipedia.org/wiki/PageRankMarkov Decision Process (MDP)Discrete time stochastic control processExtension of Markov chainsDifferences:Addition of actions (choice)Addition of rewards (motivation)If the actions are fixed, an MDP reduces to a Markov chainDescription of MDPsTuple (S, A, P(.,.), R(.)))S -> state spaceA -> action spacePa(s, s’) = Pr(st+1 = s’ | st = s, at = a)R(s) = immediate reward at state sGoal is to maximize some cumulative function of the rewardsFinite MDPs have finite state and action spacesSimple MDP ExampleRecycling MDP RobotCan search for trashcan, wait for someone to bring a trashcan, or go home and recharge batteryHas two energy levels – high and lowSearching runs down battery, waiting does not, and a depleted battery has a very low rewardnews.bbc.co.ukTransition Probabilitiess = sts’ = st+1a = atPass’Rass’high high search α Rsearchhigh low search 1 - α Rsearchlow high search 1 - β -3low low search β Rsearchhigh high wait 1 Rwaithigh low wait 0 Rwaitlow high wait 0 Rwaitlow low wait 1 Rwaitlow high recharge 1 0low low recharge 0 0Transition Graphstate nodeaction nodeSolution to an MDP = Policy πGives the action to take from a given state regardless of historyTwo arrays indexed by state V is the value function, namely the discounted sum of rewards on average from following a policyπ is an array of actions to be taken in each state (Policy)V(s): = R(s) + γ∑Pπ(s)(s,s')V(s')2 basic stepsVariantsValue IterationPolicy IterationModified Policy IterationPrioritized SweepingV(s): = R(s) + γ∑Pπ(s)(s,s')V(s')2 basic steps12Value FunctionValue Iterationk Vk(PU) Vk(PF) Vk(RU) Vk(RF)123456V(s) = R(s) + γmaxa∑Pa(s,s')V(s')00101004.514.5192.038.5518.55 24.184.7611.7919.2629.237.4515.3020.81 31.8210.23 17.67 22.72 33.68Why So Interesting?If the transition probabilities are known, this becomes a straightforward computational problem, however…If the transition probabilities are unknown, then this is a problem for reinforcement learning.Typical AgentIn reinforcement learning (RL), the agent observes a state and takes an action.Afterward, the agent receives a reward.Mission: Optimize RewardRewards are calculated in the environmentUsed to teach the agent how to reach a goal stateMust signal what we ultimately want achieved, not necessarily subgoalsMay be discounted over timeIn general, seek to maximize the expected returnValue FunctionsVπ is a value function (How good is it to be in this state?)Vπ is the unique solution to its Bellman EquationExpresses relationship between a state and its successor statesBellman Equation:State-value function for policy πAnother Value FunctionQπ defines the value of taking action a in state s under policy πExpected return starting from s, taking action a, and thereafter following policy πBackup diagrams for (a) Vπ and (b) QπAction-value function for policy πDynamic ProgrammingClassically, a collection of algorithms used to compute optimal policies given a perfect model of environment as an MDPThe classical view is not so useful in practice since we rarely have a perfect environment modelProvides foundation for other methodsNot practical for large problemsDP Continued…Use value functions to organize and structure the search for good policies.Turn Bellman equations into update policies.Iterative policy evaluation using full backupsPolicy ImprovementWhen should we change the policy?If we pick a new action α from state s and thereafter follow the current policy and V(π’) >= V(π), then picking α from state s is a better policy overall.Results from the policy improvement theoremPolicy IterationContinue improving the policy π and recalculating V(π)A finite MDP has a finite number of policies, so convergence is guaranteed in a finite number of iterationsRemember Value Iteration?Used to truncate policy iteration by combining one sweep of policy evaluation and one of policy improvement in each of its sweeps.Monte Carlo MethodsRequires only episodic experience – on-line or
View Full Document