UMD CMSC 828G - Principles of Data Mining - D198751

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 828G> Principles of Data Mining

UMD CMSC 828G - Principles of Data Mining

School name University of Maryland, College Park

Course Cmsc 828g- Advanced Topics in Information Processing:Data-Intensive Computing with MapReduce

Pages 29

Download Save

Unformatted text preview:

CMSC828G Principles of Data Mining Lecture #10•Today’s Reading:– HMS, 9.6, 6.4, 7.3.2, 7.4, 8.4• Today’s Lecture:– Probabilistic Model-based Clustering –Descriptive Modeling• Upcoming Due Dates:– H2 due Thu 3/7Families of Clustering Algorithms• Partition-based methods – e.g., K-means• Hierarchical clustering – e.g., hierarchical agglomerative clustering• Probabilistic model-based clustering – e.g., mixture modelsProbabilistic Model-based Clustering• Assume a probability model for each component cluster• Mixture Model:∑=θ=K1kkkk);x(fw)x(f•where fkare component distributions• components: gaussian, poisson, exponentialGaussian Mixture Models (GMM)• K components• model for each component cluster N(µk,σk)µ1µ2µ3∑=σµ=K1kkkk),;x(fw)x(pmuch borrowed from Andrew Moore’s Clustering Gaussian Mixtures Lecture notesGMM cont.• Generative Modelµ2µ3• choose component with probability wk• generate X ~ N(µk,σk)µ1µ1XGMM cont.• But, we need to learn a model from the data…• D = {x(1), …, x(n)}First, suppose we know w1, …, wkP(x | wj, µ1 , …, µk, σ1 , …, σk) = ?Next,P(x | µ1 , …, µk, σ1 , …, σk) = ?And,P(D | µ1 , …, µk , σ1 , …, σk) = ?Look Familiar?•P(D | µ1 , …, µk , σ1 , …, σk) gives us the probability for a particular choice of µ1 , …, µk , σ1 , …, σk• How do we choose µi σi?•Find µi that maximize the likelihood•Setand solve for µi0),,,,,|D(Plogk1k1j=σσµµµ∂∂KK• Problem: we have a bunch on non-linear non-analytically-solvable equations• One solution: gradient descent…. slow•instead….Expectation Maximization (EM)• Dempster, Laird, and Rubin, 1977• extremely popular recently• applicable in a wide range of problems• many uses: hidden markov models, • basic idea is quite simple…EM setup• Let D = {x(1), …, x(n)} be n observed data vectors• Let H = {z(1),…, z(n)} be n values of hidden variable Z (these might be the cluster labels)• Then the log-likelihood of the observed data is)|H,D(plog)|D(plog)(lHθ=θ=θ∑•bothθ and H are unknown • Let Q(H) be any probability distribution for H.)|H,D(plog)(lHθ=θ∑∑θ=H)H(Q)|H,D(p)H(Qlog∑θ≥H)H(Q)|H,D(plog)H(Q∑∑+θ=HH)H(Q1log)H(Q)|H,D(plog)H(Q),Q(Fθ=lower boundon l(θ)Jensen’s Inequality• http://www.engineering.usu.edu/classes/ece/7680/lecture2/node5.htmlEM Algorithm• EM algorithm alternates between– maximize F with respect to dist. Q with θ fixed– maximize F with respect to θ with Q = p(H) fixed),Q(FmaxargQkkQ1kθ=+),Q(Fmaxargk1k1kθ=θ+θ+E-step:M-step:• Maximum for E step:),D|H(pQk1kθ=+• Maximum for M step:)|H,D(plog),D|H(pmaxargkHk1kθθ=θ∑θ+Intuition: •In the E-step, we estimate the distribution on the hidden variables,conditioned on a particular setting of the parameter vector θk•In the M-step, we choose new set of parameters θk+1 to maximizethe expected log-likelihood of observed dataNotes• Often both the E and M step can be solved in closed form• Neither the E step nor the M step can decrease the log-likelihood• Under relatively general conditions the algorithm is guaranteed to converge to a local maximum of log-likelihood• We must specify a starting point for the algorithm, for example a random choice of θ or Q• We must specify stopping criteria, or convergence detection• Computational complexity: number of iterations, time to compute E and M stepsBack to our example…• We want to fit a normal mixture distribution:∑=σµ=K1kkkkk),;x(fw)x(fwkis prior probability of x belonging to componentθ = (w1, …. wk, µ1 , …, µk , σ1 , …, σk) is parameter vectorthen probability of data point x coming from kth class is:)x(f),;x(fw)x|k(Pˆkkkkσµ=E-step:M-step:∑==n1kk))i(x|k(Pˆn1wˆ∑==µn1kkk)i(x))i(x|k(Pˆnw1ˆ2kn1kkk))i(x))(i(x|k(Pˆnw1ˆµ−=σ∑=EM DemoEM demohttp://diwww.epfl.ch/mantra/tutorial/english/gaussian/html/EM Comments• complexity of EM for multivariate gaussian mixtures with K components: dominated by calculation of K covariance matrices. – With p dimensions, O(Kp2) covariance parameters to be estimated– Each requires summing over n data points and cluster weights, leading to O(Kp2n) per step• Often times there are large increases in likelihood over first few iteration and then can slowly converge; likelihood as function of iterations not necessarily concaveLikelihood as function of EM Iterationfigure 9.2 in textProbabilistic Model-based clustering• model provides full distributional description for each component; we may be able to interpret differences in the distributions• Given the model, each data point has a K-component vector of probabilities that it belongs to each group (why this is called soft-clustering)• EM algorithm can be extended for MAP and Bayesian estimation• Method can be extended to data that are not in p dimensional vector form. mixtures of sequence models, e.g. markov models, others• key cost: assumption of parametric modeland finally…how do we choose K?How to choose K• Choose K that maximizes likelihood?• NOT. • As K is increased, the value of the likelihood at maximum cannot decrease• Problem of scoring models with different complexitiesModel Selection• First, as we mentioned before, there are two possible goals in constructing a model• Goal #1: summarization – we would like to describe the data as precisely as possible– given that our model is comprehensible (subjective) or compact (objective)– general approach based on data compression and information-theoretic arguments results in a score function:)M,(plog)M,|D(plog)M,(scoreθ−θ−=θscore(θ,M) = # of bits to describe the data given the model + number of bits to describe the model# of bits to transmitthe model# of bits to transmitthe data that is not accountedfor by the modelModel Selection, cont.• Goal #2: generalize from the available data to new data– goodness of fit is part of the objective– but, since the data is not the entire population, we want to learn a model that will generalize to other new data instances• In both cases, we want a score function that strikes a compromise between how well the model fits the data and the simplicity of the model, although the theoretical motivations for this are quite differentBias-Variance• Model too flexible ⇒ overfit the data ⇒ high variance• Model too restrictive ⇒ can’t fit the data ⇒ high bias• Bias-variance

View Full Document


School:
Email:
New Password:
Confirm Password:

UMD CMSC 828G - Principles of Data Mining

Sign up for free to view:

Please select your school