Princeton COS 424 - Model-Based Classification

Unformatted text preview:

Model-Based ClassificationDavid M. BleiFebruary 29, 2012Probability models• A probability model is a joint distribution of a set of observations.•Often, a model is indexed by a parameter. Each value of the parameter gives a differentdistribution of the data.– The parameter of a Bernoulli is the probability of heads.– The parameters of a Gaussian are its mean and variance.• Many models (but not all) assume the data are independent and identically distributed.• For a boring example, consider N coin flips, each of which has heads with probability π,p(x1,...,xN|π) =NYn=1p(xi|π). (1)Each term is a Bernoulli,p(xn|π) = π1(xn=h)(1 − π)1(xn=t)(2)• Suppose we flip a coin N times and record the outcomes.•Further suppose that we think that the probability of heads isπ. (This is distinct fromwhatever the probability of heads “really” is.)• Given π, the probability of an observed sequence isp(x1,...,xN|π) =NYn=1π1[xn=h](1 − π)1[xn=t](3)1• As a function of π, the probability of a data set is the likelihood function.• Taking logs, this is the log likelihood function.L (π) =NXn=11[xn= h]logπ + 1[xn= t]log(1 − π) (4)•The maximum likelihood estimate is the value of the parameter that maximizes the loglikelihood (equivalently, the likelihood).• In the Bernoulli example, it is the proportion of heads.ˆπ =1NNXn=11[xn= H] (5)• In a Gaussian, it is the empirical meanˆµ =1NNXn=1xn(6)and empirical varianceˆσ2=1NNXn=1(xn−ˆµ)2(7)• In a sense, this is the value that best explains our observations.• Why is the MLE good?– The MLE is consistent.– Flip a coin N times with true bias π∗.– Estimate the parameter from x1,...xNwith the MLEˆπ.– Then,limN→∞ˆπ = π∗– This is a good thing. It lets many statisticians sleep at night.• (R demonstration: Bernoulli data sets)• (Slides: Modeling approval ratings with a Gaussian)2Graphical models• Represents a joint distribution of random variables; used to show models.(Also called “Bayesian network”)• Semantics:– Nodes are RVs; Edges denote possible dependence– Shaded nodes are observed; unshaded nodes are hidden– GMs with shaded nodes represent posterior distributions.• Each graphical model is a family of distributions.• Connects computing about models to graph structure (COS513)• Connects independence assumptions to graph structure (COS513)• Here we’ll use it as a schematic for factored joint distributions.(Show the classification graphical model.)• Return briefly to the Aliens/Watch example– It’s true that many members of this family do not have X ⊥⊥ Y |Z.–The class discussion revealed conditions whereZis dependent onXandY, but theyare still conditionally independent. (That is a subfamily of this family.)–I conjectured that for conditional independence betweenXandY, one of the depen-dencies of Z had to be broken. However, I couldn’t prove it.Classification set-up•In classification we observe two sets of data. One set contains data (“features”) labeledwith a category. The other set contains unlabeled features.•The idea is to fit a model of how the label relates to the features. Then given new unlabeleddata, predict the category. This is supervised learning.• More formally,3–The fully observed data (also called the “training data”) are{xi,ci}ni=1, whereciis inone of k categories and the features xiare a vector of values.– The partially observed data are xnew. Our goal is to predict ynew.• Here are some examples. In each, what are the features? What are the labels?– Classify images into semantic categories– Classify news articles into section of the newspaper– Classify genetic code as entron or intron– Classify radar blips as friendly or unfriendly– Classify credit cards as stolen or not stolen– Others?Basic idea•Classification with generative models links statistical modeling to classification problems.• We can model many kinds of data (features x) with appropriate probability distributions.– Continuous features can be modeled with a Gaussian– Binary features can be modeled with a Bernoulli– Positive features can be modeled with a Gamma– Etc.•Recall that our training set is a collection of labeled feature vectors (xi,ci). The idea is to fita distribution of features conditional on the class label.– A different Gaussian for each class label– A different Bernoulli for each class label– A different Gamma distribution for each class label– Etc.•To classify a new feature vectorxwe compute the conditional distribution of the classlabel given the features p(c | x).• For example:4–Suppose images are represented with continuous value image features such as colorintensities, texture features, and others.–Each training image is labeled with one category, such as “outdoor”, “indoor”, “sports”.(This data set exists. It’s called CalTech 101.)–To build our classifier, we fit a Gaussian distribution to each feature conditional onthe class label. For example, we would find a distribution of color for indoor scenes,for outdoor scenes, for portraits, and each other category.1–To classify a new image, we would consider each label and look at the probability ofits features given that label. (It’s a little more complicated than this—see below—andthis process will emerge naturally from Bayes rule.)Modeling assumptions• Data– The training data are D = {xi,ci}ni=1– The data to predict are {xnew}.• The graphical models for fitting and prediction are the following,xicinπθjkπθjkcxFitting Prediction• Some details– Small boxes are parameters. Unshaded are fit; shaded are fixed.– The joint distribution of a feature vector and label isp(x,c |π,θ1:k) = p(c | π)p(x | c,θ1:k) (8)– The parameters θ1:kare the conditional distributions of the features.The parameter π is the probability of seeing each class.1Actually, we could fit a conditional distribution for the whole vector using a multivariate Gaussian. But, for now,let’s assume we fit one distribution for each feature.5– The second term p(x | c, θ1:k) selects the right class. (More below.)• How do we “select” the right class in p(x | c, θ1:k) and p(c |π)?– The class label c is represented as a k-vector with a single one.For example, a data point in the fourth class has c = 〈0, 0, 0, 1, 0, 0, 0〉.– The second term isp(x |c,θ1:k) =kYj =1p(x |θj)cj. (9)(Confirm that this equals the intended


View Full Document

Princeton COS 424 - Model-Based Classification

Download Model-Based Classification
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Model-Based Classification and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Model-Based Classification 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?