UMD CMSC 828G - Principles of Data Mining - D553635

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 828G> Principles of Data Mining

UMD CMSC 828G - Principles of Data Mining

School name University of Maryland, College Park

Course Cmsc 828g- Advanced Topics in Information Processing:Data-Intensive Computing with MapReduce

Pages 27

Download Save

Unformatted text preview:

CMSC828G Principles of Data Mining Lecture #11•Today’s Reading:– HMS, 6.4, 9.2• Today’s Lecture:–Descriptive Modeling– Graphical Models– Class survey• Upcoming Due Dates:– H2 due Thu 3/7Describing Data • The canonical descriptive strategy is to describe the data in terms of their underlying distribution• As usual, we have a p-dimensional data matrix with variables X1, …, Xp• The joint distribution is P(X1, …, Xp)• The joint gives us complete information about the variables• Given the joint distribution, we can answer any question about the relationships among any subset of variables– are X2 and X5 independent?– generating approximate answers to queries for large databases orselectivity estimation• Given a query (conditions that observations must satisfy), estimate the fraction of rows that satisfy this condition (the selectivity of the query)• These estimates are needed during query optimization • If we have a good approximation for the joint distribution of data, we can use it to efficiently compute approximate selectivitiesDifficulty: Curse of Dimensionality• consider joint distribution for multivariate categorical data• p variables, each taking m values• The joint distribution requires specifying O(mp) different probabilities.• exponential # of parameters is problem for:–estimation– representation and reasoningCoD: Estimation•We can think of mp cells, {c1, …, cmp} each containing ni observations• The expected number of data points in cell i, given a random sample from p(x) of size n is Ep(x)[ni]=npi• Suppose p(x) is uniform, i.e. p(xi) ≈1/mp• Then []pi)x(pmnnE ≈• if n < 0.5mpthen the expected number of points in any cell is closer to 0 then 1.• if we use MLE estimator, pi=0 for each empty cell• fundamental problem: if we have a data set of size n over p variables and we want to increase the number of variables from p to 2p, in order to keep the expected number of data points in each cell the same, we must increase the size of the data set by a factor of mp!!CoD: Reasoning• Even if we can reliably estimate a full joint distribution from the data, it is exponential in both space and time to manipulate it directly.• For example if we want to determine the marginal distribution of any single variable xj, we calculate it as follows:∑+−+−=p1j1j1X,,X,X,,Xp1j1j1j)X,,X,X,,X(p)x(pKKKK• this requires summing over all the other variables and requires O(mp) summations. • working directly with the fulljoint distribution is feasible for only relatively low-dimensional problems.Models• parametric •nonparametric• mixture distributions• semi-parametricParametric Models• assume a particular, relatively simple, functional form• e.g., uniform distribution, normal distribution, exponential, Poisson• typically relatively small number of parameters• often closed form solutions for parameter estimates that require a single pass through the data• important to test the assumptions made by the model:– using simple visualizations– using statistical goodness-of-fit testsNonparametric Models• take a local data-driven weighted average of around the point of interest• simplest version: histogram– estimate for density is just (scaled) number of points in bin–problems: • not smooth• choosing the number of bins, bin locations and widths– ok for large data sets, small p– regardless, generally useful to look at histograms with large number of bins, since can provide info on outliers, multmodality, skewness, tail behavior, etc.Nonparametric Models cont.• kernel density estimates• density at any point x is proportional to a weighted sum of all points in the training data set• weights are defined by an appropriately defined kernel function•in 1-D:∑==n1iiwn1)x(f−=h)i(xxKwiotherwise0)t(K;1t,t1)t(K =≤−=• we’ll talk about this more when we discuss SVMs….h determines smoothness; many possible choices for KMixture Distributions• Assume a probability model for each component • Mixture Model:∑=θ=K1kkkk);x(fw)x(f• distribution is linear combination of simpler distributions•where fkare component distributions• components: gaussian, poisson, exponential• unlike simple parametric models, typically no closed form solution for maximizing score; one approach: use EMSemi-parametric Models• general class of functional forms in which the number of adaptive parameters can be increased in a systematic way to build ever more flexible models, but where the total number of parameters in the model can be varied independently from the size of the data set. Bishop• e.g., mixture-models (where k varies), neural networks, graphical models.Graphical Models• In the next 3-4 lectures, we will be studying graphical models• e.g. Bayesian networks, Bayes nets, Belief nets, Markov networks, etc.•We will study:–representation–reasoning–learning• Materials based on upcoming book by Nir Friedman and Daphne Koller. Slides courtesy of Nir Friedman.Probability Distributions•Let X1,…,Xpbe random variables• Let P be a joint distribution over X1,…,XpIf the variables are binary, then we need O(2p) parameters to describe PCan we do better?• Key idea: use properties of independenceIndependent Random Variables• Two variables X and Y are independent if– P(X = x|Y = y) = P(X = x) for all values x,y– That is, learning the values of Y does not change prediction of X• If X and Y are independent then – P(X,Y) = P(X|Y)P(Y) = P(X)P(Y)• In general, if X1,…,Xpare independent, then–P(X1,…,Xp)= P(X1)...P(Xp)– Requires O(n) parametersConditional Independence• Unfortunately, most of random variables of interest are not independent of each other• A more suitable notion is that of conditional independence• Two variables X and Y are conditionally independent given Z if– P(X = x|Y = y,Z=z) = P(X = x|Z=z) for all values x,y,z– That is, learning the values of Y does not change prediction of X once we know the value of Z– notation: I ( X , Y | Z )Example: Naïve Bayesian Model• A common model in early diagnosis:– Symptoms are conditionally independent given the disease (or fault)• Thus, if –X1,…,Xpdenote whether the symptoms exhibited by the patient (headache, high-fever, etc.) and – H denotes the hypothesis about the patients health•then, P(X1,…,Xp,H) = P(H)P(X1|H)…P(Xp|H),•This naïve Bayesian model allows compact

View Full Document


School:
Email:
New Password:
Confirm Password:

UMD CMSC 828G - Principles of Data Mining

Sign up for free to view:

Please select your school