LEHIGH CSE 335 - Markov Decision Processes - D1833263

Home> Schools> Lehigh University> Computer Science & Engineering (CSE) > CSE 335> Markov Decision Processes

DOC PREVIEW

LEHIGH CSE 335 - Markov Decision Processes

School name Lehigh University

Course Cse 335- Intellignt Decision Supprt Sys

Pages 40

This preview shows page 1-2-3-19-20-38-39-40 out of 40 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 40 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 40 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 40 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 40 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 40 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 40 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 40 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 40 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 40 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Markov Decision Processes & Reinforcement LearningOutlineStochastic ProcessStochastic Process ExampleMarkov PropertyMarkov Property ExampleMarkov ChainMarkov Decision Process (MDP)Description of MDPsSimple MDP ExampleTransition ProbabilitiesTransition GraphSolution to an MDP = Policy πVariantsValue IterationWhy So Interesting?Typical AgentMission: Optimize RewardValue FunctionsAnother Value FunctionDynamic ProgrammingDP Continued…Policy ImprovementPolicy IterationRemember Value Iteration?Monte Carlo MethodsPolicy EvaluationEstimation of Action ValuesExample Monte Carlo AlgorithmAnother MC AlgorithmTemporal-Difference LearningTD(0)SARSA – On-policy ControlQ-Learning – Off-policy ControlCase StudyNASA Space Shuttle Payload Processing Problem (SSPPP)SSPPP – continued…Even More SSPPP…RL and CBRReferencesMarkov Decision Processes & Reinforcement LearningMegan SmithLehigh University, Fall 2006OutlineStochastic ProcessMarkov PropertyMarkov ChainMarkov Decision ProcessReinforcement LearningRL TechniquesExample ApplicationsStochastic ProcessQuick definition: A Random ProcessOften viewed as a collection of indexed random variablesUseful to us: Set of states with probabilities of being in those states indexed over timeWe’ll deal with discrete stochastic processeshttp://en.wikipedia.org/wiki/Image:AAMarkov.jpgStochastic Process ExampleClassic: Random WalkStart at state X0 at time t0At time ti, move a step Zi whereP(Zi = -1) = p and P(Zi = 1) = 1 - pAt time ti, state Xi = X0 + Z1 +…+ Zihttp://en.wikipedia.org/wiki/Image:Random_Walk_example.pngMarkov PropertyAlso thought of as the “memoryless” propertyA stochastic process is said to have the Markov property if the probability of state Xn+1 having any given value depends only upon state XnVery much depends on description of statesMarkov Property ExampleCheckers:Current State: The current configuration of the boardContains all information needed for transition to next stateThus, each configuration can be said to have the Markov propertyMarkov ChainDiscrete-time stochastic process with the Markov propertyIndustry Example: Google’s PageRank algorithmProbability distribution representing likelihood of random linking ending up on a pagehttp://en.wikipedia.org/wiki/PageRankMarkov Decision Process (MDP)Discrete time stochastic control processExtension of Markov chainsDifferences:Addition of actions (choice)Addition of rewards (motivation)If the actions are fixed, an MDP reduces to a Markov chainDescription of MDPsTuple (S, A, P(.,.), R(.)))S -> state spaceA -> action spacePa(s, s’) = Pr(st+1 = s’ | st = s, at = a)R(s) = immediate reward at state sGoal is to maximize some cumulative function of the rewardsFinite MDPs have finite state and action spacesSimple MDP ExampleRecycling MDP RobotCan search for trashcan, wait for someone to bring a trashcan, or go home and recharge batteryHas two energy levels – high and lowSearching runs down battery, waiting does not, and a depleted battery has a very low rewardnews.bbc.co.ukTransition Probabilitiess = sts’ = st+1a = atPass’Rass’high high search α Rsearchhigh low search 1 - α Rsearchlow high search 1 - β -3low low search β Rsearchhigh high wait 1 Rwaithigh low wait 0 Rwaitlow high wait 0 Rwaitlow low wait 1 Rwaitlow high recharge 1 0low low recharge 0 0Transition Graphstate nodeaction nodeSolution to an MDP = Policy πGives the action to take from a given state regardless of historyTwo arrays indexed by state V is the value function, namely the discounted sum of rewards on average from following a policyπ is an array of actions to be taken in each state (Policy)V(s): = R(s) + γ∑Pπ(s)(s,s')V(s')2 basic stepsVariantsValue IterationPolicy IterationModified Policy IterationPrioritized SweepingV(s): = R(s) + γ∑Pπ(s)(s,s')V(s')2 basic steps12Value FunctionValue Iterationk Vk(PU) Vk(PF) Vk(RU) Vk(RF)123456V(s) = R(s) + γmaxa∑Pa(s,s')V(s')00101004.514.5192.038.5518.55 24.184.7611.7919.2629.237.4515.3020.81 31.8210.23 17.67 22.72 33.68Why So Interesting?If the transition probabilities are known, this becomes a straightforward computational problem, however…If the transition probabilities are unknown, then this is a problem for reinforcement learning.Typical AgentIn reinforcement learning (RL), the agent observes a state and takes an action.Afterward, the agent receives a reward.Mission: Optimize RewardRewards are calculated in the environmentUsed to teach the agent how to reach a goal stateMust signal what we ultimately want achieved, not necessarily subgoalsMay be discounted over timeIn general, seek to maximize the expected returnValue FunctionsVπ is a value function (How good is it to be in this state?)Vπ is the unique solution to its Bellman EquationExpresses relationship between a state and its successor statesBellman Equation:State-value function for policy πAnother Value FunctionQπ defines the value of taking action a in state s under policy πExpected return starting from s, taking action a, and thereafter following policy πBackup diagrams for (a) Vπ and (b) QπAction-value function for policy πDynamic ProgrammingClassically, a collection of algorithms used to compute optimal policies given a perfect model of environment as an MDPThe classical view is not so useful in practice since we rarely have a perfect environment modelProvides foundation for other methodsNot practical for large problemsDP Continued…Use value functions to organize and structure the search for good policies.Turn Bellman equations into update policies.Iterative policy evaluation using full backupsPolicy ImprovementWhen should we change the policy?If we pick a new action α from state s and thereafter follow the current policy and V(π’) >= V(π), then picking α from state s is a better policy overall.Results from the policy improvement theoremPolicy IterationContinue improving the policy π and recalculating V(π)A finite MDP has a finite number of policies, so convergence is guaranteed in a finite number of iterationsRemember Value Iteration?Used to truncate policy iteration by combining one sweep of policy evaluation and one of policy improvement in each of its sweeps.Monte Carlo MethodsRequires only episodic experience – on-line or

View Full Document