CS664 Lecture #5: Markov chainsCoins with memoryMarkov chainsMarkov chain evolutionExample in actionTransition matrixSome notationStationary distributionsPerron-Frobenius theoremConvergence ratesMarkov coin revisitedMarkov chains in visionEnergy functionGradient descentProperties of ETradeoffs of optimizationComplexityConsequencesGeneral-purpose methodsGradient descent alternatives?Metropolis(E,T)Metropolis propertiesRandom walks on graphsBiased random walksExampleCS664 Lecture #5: Markov chainsSome source material taken from:Joseph Changhttp://www.stat.yale.edu/~jtc5/jtc.html2Coins with memory Suppose that the coin acts the way that gamblers think – Look back at last result– Produce the opposite answer (probability p) or the same answer (probability 1-p) At p = .5, what percentage of heads do we expect in the limit?– What about at p = .1? (“Stubborn” coin)– What about at p = .9? (“Flighty” coin)– What about at p = 0? (“Stuck” coin)3Markov chains Generalization of a finite automaton Probabilistic transitions (edge weights)H T1 - p 1 - ppp4Markov chain evolution Distribution over states at a given time Taking a step updates the distribution– According to the edge weights– Consider the “Markov frog”3231s15352s25Example in action3231s15352s26Transition matrix The “probability mass” moved to a state is a linear combination of the masses at adjacent states– Coefficients are the edge weights7Some notation Stochastic vector π has non-negative elements that sum to 1– Stochastic matrix K has stochastic columns– πnis the distribution after n steps8Stationary distributions8553833285528331858353325231⋅+⋅⋅+⋅=8583=9Perron-Frobenius theorem If a Markov chain is strongly connected and has self-loops, it converges to a unique stationary distribution– No matter what the starting distribution Multiple self-loops are not required– Need to avoid “oscillating” cases1110Convergence rates Nothing in this theorem about rate! There are some complicated theorems on this topic– Nothing that guarantees fast convergence for the cases of interest11Markov coin revisited Transition matrix is given by– What about p=q=0?12Markov chains in vision Vital tool for many vision problems– Basis for trigrams, hence Efros & Leung• Images have “local” structure Major application: sampling– Generating answers from a distribution Major application: energy minimization– Also known as optimization– Elegant way to formulate most vision problems– Lots of interesting and powerful algorithms13Energy function),( yxE),( yxLocal minGlobal minCandidate14Gradient descent),( yx),( yxE15Properties of E Local versus global minimum If there is a unique minimum, E is said to be convex– Issue becomes convergence speed– In vision, we’re rarely so lucky We can compute global min sometimes– Other times compute a “strong” local min16Tradeoffs of optimization Advantages– Clean separation between what you want to compute and howyou compute it– Easy to add new constraints (terms)– Simple to explain Disadvantages– Optimization is often difficult– Separation of what and how can hurt you17Complexity In complete generality, computing global min requires exhaustive search Consider 2 energy functions– Uniform (flat everywhere)– Uniform with a well somewhere True even if P=NP18Consequences Consider an optimization method that can find the global min of an arbitrary E– Must require exponential time– Asymptotically same as exhaustive search Might work for a particular problem Strong methods have limited E– You need to understand and exploit the structure of the problem19General-purpose methods Example: genetic algorithms– Not a method taken seriously by reputable academics, in vision or elsewhere Population of candidate solutions– Representation is key Create new population– Crossovers, mutations– Replace the worst (highest E) candidates20Gradient descent alternatives? If E is convex we can just roll downhill– Risk is being stuck in local min What if we sometimes move uphill?21Metropolis(E,T)1. Generate random change (“sampling”)2. If the energy is lower, go there3. If the energy is higher1. Go there with probability ∝ exp(-ΔE/T)2. Otherwise, stay at old candidate22Metropolis properties We can do nothing (step 3.2) Gradient descent at low T, random search at high T Randomized algorithm– Output is a distribution over candidates– Hence, distribution over energy23Random walks on graphs Suppose we pick an edge uniformly at random from the outgoing edges– Undirected graph with self-loops– What is the stationary distribution?• More likely to end up at a node with many (incoming) edges, i.e. high degree2332224Biased random walks What if we want a different stationary distribution?– E.g., high degree node should be “unpopular”– Solution: change transition probabilities• I.e., don’t pick an outgoing edge uniformly Weight the outgoing edges by their relative popularity in desired distribution25Example2332242222Multiply by 2 Multiply by 2/3
View Full Document