UMD CMSC 828G - Principles of Data Mining - D1930895

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 828G> Principles of Data Mining

DOC PREVIEW

UMD CMSC 828G - Principles of Data Mining

School name University of Maryland, College Park

Course Cmsc 828g- Advanced Topics in Information Processing:Data-Intensive Computing with MapReduce

Pages 33

This preview shows page 1-2-15-16-17-32-33 out of 33 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 33 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CMSC828G Principles of Data Mining Lecture #8•Today’s Reading:– HMS, chapter 4, 5• Today’s Lecture:– Parameter Estimation cont.– Hypothesis Testing– Data mining: the component view• Upcoming Due Dates:–H1 due nowStatistical Inference• Statistical inference: inferring properties of an unknown distribution from data generated by that distribution. • Estimate parameters of the model from data•Use likelihood function:ModelDataProbabilityStatistical InferenceStatistical Inference, cont.• assumes that the sample has been drawn from the population at random• The model specifies the distribution for the population; the probability that a particular value for the variable will appear in the sample• If we have a model M for the data, we can compute the probability that a random sampling process will lead to the dataD = {x(1),…,x(x)}:• if we assume the probability of each data point is independent, or ‘drawn at random’ then)M|D(p)M,|)i(x(p)M,|D(pn1iθ=θ∏=Statistical Inference, cont.• based on this value, we can decide how realistic the assumed model is.• if the assumed model is unlikely to have generated the data, we might reject the model; this is the principle behind hypothesis testing• we can also estimate population values for the parameters…)M,|)i(x(p)M,|D(pn1iθ=θ∏=Parameter estimation• Maximum Likelihood Estimation cont.• Bayesian EstimationLikelihood Function• Let D = {x(1),…,x(n)}• independently sampled, from the same distribution p(x|θ)‘independent and identically distributed’, iid•The likelihood function, L(θ| x(1),…,x(n)) captures the probability of the data as a function of θ)|)i(x(p)|)n(x),...,1(x(p))n(x),...,1(x|(L)D,(Ln1iθ=θ=θ=θ∏=Likelihood• L(X, θ) for a parameter θ and data X is the relative probability of getting the data for the different possible parameters θ.• We must remember that it is not a probability distribution.• For a probability distribution, θ is fixed and you can get all possible values of the random variable X.• For the likelihood funtion, X=x is fixed and you consider how the probability of getting this x changes when you change θ.Maximum Likelihood Estimation (MLE)• Most widely used method of parameter estimation• Using the likelihood function, L(θ| x(1),…,x(n))• choose value that maximizes the likelihood function)|)i(x(p)|)n(x),...,1(x(p))n(x),...,1(x|(L)D,(Ln1iθ=θ=θ=θ∏=MLEθ)Sufficient Statistics• (informal definition)• s(D) is a sufficient statistic for θ if the likelihood L(θ) depends on D only through s(D), i.e., L(θ,D)=L(θ,s(D))• e.g., for our binomial example, r, the number of customers who have purchased milk, is a sufficient statistic (assuming n, the number of customers, is known)• Why is this important?for large datasets, instead of working with the entiredata set, we can compute and store just sufficient statisticsBayesian Estimation• frequentist approach: parameters of a population are fixed but unknown, and data is a random sample from that population• Bayesian approach: the data is known, and the parameters are the random variables. θ has a distribution of possible values and the observed data provide evidence for different valuesBayesian Estimation, cont.• Before observing the data, we have a distribution over possible values for θ which is called the prior distribution)(pθ• By analyzing the data, we can update our beliefs to take into account the observed data. This leads to a distribution over possible values for θ given D which is called the posterior distribution)D|(pθ• We use Bayes’ theorem to obtain the posterior:)D(p)(p)|D(p)D|(pθθ=θBayesian Estimation, cont.• The posterior gives a distribution over parameter values. If wewould like a single value (a point estimate, as in the MLE case), we can use the same princple, choose the value of θwhich maximizes the posterior. • This is called the maximum a posteriori (MAP) method)D|(pmaxargˆMAPθ=θθMLE is a special case of MAP with ‘flat’ priorBayesian Estimation, cont.• For a given data set D and a particular model, P(D) is constant, so we often write:)(p)|D(p)D|(pθθ∝θposterior likelihood priorExample• Consider our example of customers milk buying habits, where we are trying to estimate θ, the probability that a customer will buy milk• The most commonly chosen model for the prior of a binomial rv is a Beta distribution.• The beta distribution has parameters α > 0, β > 0, and is:)1()(p11 −β−αθ−θ∝θ•relative sizes of α and β control the location and spread of the distribution for θ. • The variance is inversely proportional to α + β; size of the sum controls the narrowness of the prior. If α + β is large, the prior will be sharply peaked.[])1()()var(,E2+β+αβ+ααβ=θβ+αα=θExample)1()(p11 −β−αθ−θ∝θ• Beta prior:• Likelihood for Binomial model:)1()D|(Lrnr −θ−θ=θ)1()1()1()(p)|D(p)D|(p1rn1r11rnr−β+−−α+−β−α−θ−θ=θ−θθ−θ=θθ=θ• Posterior:• Beta is like a Binomial with α - 1 prior successes and β - 1 prior failures; • can think of Beta as having equivalent sample size α + β -2• Posterior is Binomial Beta distribution with parameters r + αand n - r + βBETA is CONJUGATE PRIOR for BINOMIALBayesian Approach to Prediction• A fully Bayesian approach is characterized by maintaining a distribution over models• In order to make a prediction about a new data point x(n+1) not in our training data set D, we average over all possible values of θ, weighted by the posterior probability p(θ|D):θθθ+=θθ+=+∫∫d)D|(p)|)1n(x(pd)D|),1n(x(p)D|)1n(x(pOf course, computationally, this is much more challengingthan the maximum likelihood approach… MCMCClassical Hypothesis Testing• Two hypothesis H0 (the null hypothesis) and H1(the alternative hypothesis)• Outcome of a hypothesis test is 'reject H0' or 'do not reject H0'•Example:–H0: there is no difference in taste between coke and pespi against H1: there is a difference.• The hypotheses are often statements about population parameters like expected value and variance:– for example H0might be that the expected value of the height of ten year old boys in the Scottish population is not different from that of ten year old girls • A hypothesis might also be a statement about the distributional form of a characteristic of interest, – the height of ten year old boys is normally

View Full Document