UMD CMSC 828G - Principles of Data Mining - D2964072

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 828G> Principles of Data Mining

DOC PREVIEW

UMD CMSC 828G - Principles of Data Mining

School name University of Maryland, College Park

Course Cmsc 828g- Advanced Topics in Information Processing:Data-Intensive Computing with MapReduce

Pages 5

This preview shows page 1-2 out of 5 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1CMSC828G Principles of Data Mining Lecture #10• Today’s Reading:– HMS, 9.6, 6.4, 7.3.2, 7.4, 8.4•Today’s Lecture:– Probabilistic Model-based Clustering – Descriptive Modeling• Upcoming Due Dates:– H2 due Thu 3/7Families of Clustering Algorithms• Partition-based methods – e.g., K-means• Hierarchical clustering – e.g., hierarchical agglomerative clustering• Probabilistic model-based clustering – e.g., mixture modelsProbabilistic Model-based Clustering• Assume a probability model for each component cluster•Mixture Model:∑=θ=K1kkkk);x(fw)x(f•where fkare component distributions• components: gaussian, poisson, exponentialGaussian Mixture Models (GMM)•K components• model for each component cluster N(µk,σk)µ1µ2µ3∑=σµ=K1kkkk),;x(fw)x(pmuch borrowed from Andrew Moore’s Clustering Gaussian Mixtures Lecture notesGMM cont.• Generative Modelµ2µ3• choose component with probability wk• generate X ~ N(µk,σk)µ1µ1XGMM cont.• But, we need to learn a model from the data…•D = {x(1), …, x(n)}First, suppose we know w1, …, wkP(x | wj, µ1 , …, µk, σ1 , …, σk) = ?Next,P(x | µ1 , …, µk, σ1 , …, σk) = ?And,P(D | µ1 , …, µk , σ1 , …, σk) = ?2Look Familiar?•P(D | µ1 , …, µk , σ1 , …, σk) gives us the probability for a particular choice of µ1 , …, µk , σ1 , …, σk• How do we choose µi σi?•Find µi that maximize the likelihood•Setand solve for µi0),,,,,|D(Plogk1k1j=σσµµµ∂∂KK• Problem: we have a bunch on non-linear non-analytically-solvable equations• One solution: gradient descent…. slow•instead….Expectation Maximization (EM)• Dempster, Laird, and Rubin, 1977• extremely popular recently• applicable in a wide range of problems• many uses: hidden markov models, • basic idea is quite simple…EM setup• Let D = {x(1), …, x(n)} be n observed data vectors• Let H = {z(1),…, z(n)} be n values of hidden variable Z (these might be the cluster labels)• Then the log-likelihood of the observed data is)|H,D(plog)|D(plog)(lHθ=θ=θ∑•bothθ and H are unknown • Let Q(H) be any probability distribution for H.)|H,D(plog)(lHθ=θ∑∑θ=H)H(Q)|H,D(p)H(Qlog∑θ≥H)H(Q)|H,D(plog)H(Q∑∑+θ=HH)H(Q1log)H(Q)|H,D(plog)H(Q),Q(F θ=lower boundon l(θ)Jensen’s Inequality• http://www.engineering.usu.edu/classes/ece/7680/lecture2/node5.htmlEM Algorithm• EM algorithm alternates between– maximize F with respect to dist. Q with θ fixed– maximize F with respect to θ with Q = p(H) fixed),Q(FmaxargQkkQ1kθ=+),Q(Fmaxargk1k1kθ=θ+θ+E-step:M-step:•Maximum for E step:),D|H(pQk1kθ=+•Maximum for M step:)|H,D(plog),D|H(pmaxargkHk1kθθ=θ∑θ+Intuition: •In the E-step, we estimate the distribution on the hidden variables,conditioned on a particular setting of the parameter vector θk•In the M-step, we choose new set of parameters θk+1 to maximizethe expected log-likelihood of observed dataNotes• Often both the E and M step can be solved in closed form• Neither the E step nor the M step can decrease the log-likelihood• Under relatively general conditions the algorithm is guaranteed to converge to a local maximum of log-likelihood• We must specify a starting point for the algorithm, for example a random choice of θ or Q• We must specify stopping criteria, or convergence detection• Computational complexity: number of iterations, time to compute E and M steps3Back to our example…• We want to fit a normal mixture distribution:∑=σµ=K1kkkkk),;x(fw)x(fwkis prior probability of x belonging to componentθ = (w1, …. wk, µ1 , …, µk , σ1 , …, σk) is parameter vectorthen probability of data point x coming from kth class is:)x(f),;x(fw)x|k(Pˆkkkkσµ=E-step:M-step:∑==n1kk))i(x|k(Pˆn1wˆ∑==µn1kkk)i(x))i(x|k(Pˆnw1ˆ2kn1kkk))i(x))(i(x|k(Pˆnw1ˆµ−=σ∑=EM DemoEM demohttp://diwww.epfl.ch/mantra/tutorial/english/gaussian/html/EM Comments• complexity of EM for multivariate gaussian mixtures with K components: dominated by calculation of K covariance matrices. – With p dimensions, O(Kp2) covariance parameters to be estimated– Each requires summing over n data points and cluster weights, leading to O(Kp2n) per step• Often times there are large increases in likelihood over first few iteration and then can slowly converge; likelihood as function of iterations not necessarily concaveLikelihood as function of EM Iterationfigure 9.2 in textProbabilistic Model-based clustering• model provides full distributional description for each component; we may be able to interpret differences in the distributions• Given the model, each data point has a K-component vector of probabilities that it belongs to each group (why this is called soft-clustering)• EM algorithm can be extended for MAP and Bayesian estimation• Method can be extended to data that are not in p dimensional vector form. mixtures of sequence models, e.g. markov models, others• key cost: assumption of parametric modeland finally…how do we choose K?4How to choose K• Choose K that maximizes likelihood?•NOT. • As K is increased, the value of the likelihood at maximum cannot decrease• Problem of scoring models with different complexitiesModel Selection• First, as we mentioned before, there are two possible goals in constructing a model• Goal #1: summarization – we would like to describe the data as precisely as possible– given that our model is comprehensible (subjective) or compact (objective)– general approach based on data compression and information-theoretic arguments results in a score function:)M,(plog)M,|D(plog)M,(score θ−θ−=θscore(θ,M) = # of bits to describe the data given the model + number of bits to describe the model# of bits to transmitthe model# of bits to transmitthe data that is not accountedfor by the modelModel Selection, cont.• Goal #2: generalize from the available data to new data– goodness of fit is part of the objective– but, since the data is not the entire population, we want to learn a model that will generalize to other new data instances• In both cases, we want a score function that strikes a compromise between how well the model fits the data and the simplicity of the model, although the theoretical motivations for this are quite differentBias-Variance• Model too flexible ⇒ overfit the data ⇒ high variance• Model too restrictive ⇒ can’t fit the data ⇒ high bias• Bias-variance

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2 out of 5 pages.

UMD CMSC 828G - Principles of Data Mining

Sign up for free to view:

Please select your school