A Tutorial on the Partially Observable Markov

Home> Academic Documents> A Tutorial on the Partially Observable Markov

DOC PREVIEW

This preview shows page 1-2-24-25 out of 25 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide 25A Tutorial on the Partially Observable Markov Decision Process and Its ApplicationsLawrence CarinJune 7,2006Outline Overview of Markov decision Processes (MDPs) Introduction to partially observable decision processes (POMDPs) Some applications of POMDPs Conclusions Overview of MDPs Introduction to POMDPs model Some applications of POMDPs ConclusionsMarkov decision processesThe MDP is defined by the tuple < S, A, T, R > S is a finite set of states of the world. A is a finite set of actions. T: SA  (S) is the state-transition function, the probability of an action changing the the world state from one to another,T(s, a, s’). R: SA   is the reward for the agent in a given world state after performing an action, R(s, a).Two properties of the MDP• The action-dependent state transition is Markovian• The state is fully observable after taking action aIllustration of MDPsMarkov decision processesAGENT WORLD: T(s,a, s’)State sAction aas  :Markov decision processesObjective of MDPs• Finding the optimal policy , mapping state s to action a in order to maximize the value function V(s).))'()',,(),((max)('sasVsasTasRsV Overview of MDPs Introduction to POMDPs Some applications of POMDPs Conclusions S, A, T, and R are defined the same as in MDPs is a finite set of observations the agent can experience its world.  O: SA  () is the observation function, the probability of making a certain observation after performing a particular action, landing in state s’, O(s’, a, o).The POMDP is defined by the tuple < S, A, T, R, , O >Introduction to POMDPsIntroduction to POMDPsDifferences between MDPs and POMDPs• The state is hidden after taking action a.• The hidden state information is inferred from the action-state dependent observation function O(s’, a, o).Uncertainty of state s in POMDPsIntroduction to POMDPsA new concept in POMDPs: Belief State b(s)b(st) = Pr(st = s | o1, a1, o2, a2, …, ot-1, at-1, ot)Introduction to POMDPs ),|Pr()()',,(),,'( )'('baosbsasToasOsbSss1o1o2bb’=T(b|a, o1)b’=T(b|a, o2)n control interval remainingn-1 control interval remainings2s2s3s3s1(1)The belief state b(s) evolves according to Bayes ruleIntroduction to POMDPsIllustration of POMDPsSE: AGENTbWORLD: T(s,a, s’) O(s’, a, o)Observation oAction aab  :'bb SE: State Estimator using (1)  : Policy SearchIntroduction to POMDPs• Finding the optimal policy  for POMDPs, mapping belief point b to action a in order to maximize the value function V(b). ))'((),'|Pr(),|'Pr()(),()(max ))'((){,'|Pr(),|'Pr()(max)(1'1''    Ss ooanSsSsAaSs ooanaossSsAansbVasoasssbasRsbsbVRasoasssbbVExpected immediate rewardObjective of POMDPs(2)][max])()([max)( bsbsbVknkSsknknPr(S1)01V(b)p1p2p3p4p5a(p1) a(p2) a(p5)• Piecewise linearity and convexity of optimal value function for finite horizon in POMDPs Introduction to POMDPsOptimal value function (3)Substituting (3), (1) into (2) )'(),'|(),|'(),()(max )'(),'|(),|'()(max),()(max )'(),|Pr(),()(max)('),,(1'11   Ss o soablnAao s sknkSsAaonSsAansasoassTasRsbsasoassTsbasRsbbVaboasRsbbVMaximizing to obtain the index l-vector of belief point bOptimal value of belief point b(4)Introduction to POMDPsIntroduction to POMDPsApproaches to solving POMDPs problem• Exact algorithms: finding all -vectors for the whole belief space which is exact but intractable for large size problems.• Approximate algorithms: finding -vectors of a subset of the belief space, which is fast and can deal with large size problems.Point-based value iteration (PBVI)b5b1b3b4b0• focus on a finite set of belief points• maintain an -vector for each pointPoint-Based Value Iteration• RBVI maintains an -vector for each convex region over which the optimal value function is linear.• RBVI simultaneously determines the -vectors for all relevant convex regions based on all available belief points.Region-Based Value Iteration (RBVI)The piecewise linear value function:which can be reformulated as by introducing hidden variables z(b)=k, denoting b BkRBVI (Contd)The belief space is partitioned using hyper-ellipsoids,Then we haveRBVI (Contd)The joint distribution of V(b) and b can be written as whereExpectation-Maximization (EM) Estimation:RBVI (Contd)E step:M step: Overview of MDPs Introduction to POMDPs model Some applications of POMDPs ConclusionsApplications of POMDPs• Application of Partially Observable Markov Decision Processes to robot navigation in a Minefield • Application of Partially Observable Markov Decision Processes to feature selection• Application of Partially Observable Markov Decision Processes to sensor schedulingApplications of POMDPsSome considerations in applying POMDPs to new problems• How to define the state • How to obtain the transition and observation matrix• How to set the rewardReferences1. Leslie Pack Kaelbling, Michael L. Littman and Anthony R. Cassandra. Planning and Acting in Partially Observable Stochastic Domains. Artificial Intelligence, Vol. 101,1998.2. Smallwood, R. D., and Sondik, E. J. 1973. The optimal control of partially observable markov processes over a finite horizon. Operational Research 21:1071–1088.3. J. Pineau, G. Gordon & S. Thrun. Point-based value iteration: An anytime algorithm for POMDPs. International Joint Conference on Artificial Intelligence (IJCAI), Acapulco, Mexico, Aug. 2003.4. D. P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, Blemont, Massachusetts,2001, Vol.1 & Vol.2.5. Bellman, R. 1957. Dynamic Programming. Princeton University


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-24-25 out of 25 pages.

Please select your school