DOC PREVIEW
UCI ICS 273A - Clustering & Dimensionality Reduction

This preview shows page 1-2-3-4-5-6 out of 18 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 18 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Lecture 4What is Unsupervised Learning?Why Discover Structure ?Slide 4Clustering: K-meansK-meansVector QuantizationMixtures of GaussiansMoG ClusteringEM Algorithm: E-stepEM Algorithm: M-StepEM-MoGDimensionality ReductionPrincipal Components AnalysisPCAPCA propertiesSlide 17HomeworkLecture 4Unsupervised LearningClustering & Dimensionality Reduction273A Intro Machine LearningWhat is Unsupervised Learning?• In supervised learning we were given attributes & targets (e.g. class labels). In unsupervised learning we are only given attributes.• Our task is to discover structure in the data.• Example I: the data may be structured in clusters:• Example II: the data may live on a lower dimensional manifold:Is this a good clustering?Why Discover Structure ?• Data compression: If you have a good model you can encode the data more cheaply.• Example: To encode the data I have to encode the x and y position of each data-case. However, I could also encode the offset and angle of the line plus the deviations from the line. Small numbers can be encoded more cheaply than large numbers with the same precision.• This idea is the basis for model selection: The complexity of your model (e.g. the number of parameters) should be such that you can encode the data-set with the fewest number of bits (up to a certain precision).Homework: Argue why a larger dataset will require a more complex model to achieve maximal compression.Why Discover Structure ?• Often, the result of an unsupervised learning algorithm is a new representation for the same data. This new representation should be more meaningful and could be used for further processing (e.g. classification).• Example I: Clustering. The new representation is now given by the label of a cluster to which the data-point belongs. This tells us how similar data-cases are.• Example II: Dimensionality Reduction. Instead of a 100 dimensional vector of real numbers, the data are now represented by a 2 dimensional vector which can be drawn in the plane.• The new representation is smaller and hence more convenient computationally.• Example I: A text corpus has about 1M documents. Each document is represented as a 20,000 dimensional count vector for each word in the vocabulary. Dimensionality reduction turns this into a (say) 50 dimensional vector for each doc. However: in the new representation documents which are on the same topic, but do not necessarily share keywords have moved closer together!54700013500000010004001...theaon...Clustering: K-means• We iterate two operations:1. Update the assignment of data-cases to clusters2. Update the location of the cluster.• Denote the assignment of data-case “i” to cluster “c”. • Denote the position of cluster “c” in a d-dimensional space. • Denote the location of data-case i • Then iterate until convergence:1. For each data-case, compute distances to each cluster and the closest one:2. For each cluster location, compute the mean location of all data-cases assigned to it:[1,2,3,..., ]iz K�dcm ��dix ��argmin|| ||i i ccz x im= - "1cc ii Scx cNm�= "�Set of data-cases assigned to cluster cNr. of data-cases in cluster cK-means• Cost function:• Each step in k-means decreases this cost function.• Often initialization is very important since there are very many local minima in C. Relatively good initialization: place cluster locations on K randomly chosen data-cases.• How to choose K? Add complexity term: and minimize also over K Or X-validation Or Bayesian methods21|| ||ziNiiC x m== -�Homework: Derive the k-means algorithm by showing that:step 1 minimizes C over z, keeping the cluster locations fixed.step 2 minimizes C over cluster locations, keeping z fixed.1[# ] log( )2C C parameters N� + � �Vector Quantization• K-means divides the space up in a Voronoi tesselation.• Every point on a tile is summarized by the code-book vector “+”. This clearly allows for data compression !Mixtures of Gaussians• K-means assigns each data-case to exactly 1 cluster. But what if clusters are overlapping? Maybe we are uncertain as to which cluster it really belongs.• The mixtures of Gaussians algorithm assigns data-cases to cluster with a certain probability.MoG Clustering( )1/ 21 1[ ; , ] exp[ ( ) ( )]22 det( )TdN x x xm m mp-S = - - S -SCovariance determines the shape of these contours• Idea: fit these Gaussian densities to the data, one per cluster.EM Algorithm: E-step' ' '' 1[ ; , ][ ; , ]c i c cicKic c ccN xrN xp mp m=S=S�• “r” is the probability that data-case “i” belongs to cluster “c”. • is the a priori probability of being assigned to cluster “c”.• Note that if the Gaussian has high probability on data-case “i” (i.e. the bell-shape is on top of the data-case) then it claims high responsibility for this data-case.• The denominator is just to normalize all responsibilities to 1: cp11Kiccr i== "�Homework: Imagine there are only two identical Gaussians and they both have theirmeans equal to Xi (the location of data-case “i”). Compute the responsibilities for data-case “i”. What happens if one Gaussian has much larger variance than the other?EM Algorithm: M-Step1c ic iicr xNm ��c iciN r��1( )( )Tc ic i c i cicr x xNm mS � - -�ccNNp �total responsibility claimed by cluster “c”expected fraction of data-cases assigned to this clusterweighted sample mean where every data-case is weighted according to the probability that it belongs to that cluster.weighted sample covarianceHomework: show that k-means is a special case of the E and M steps.EM-MoG• EM comes from “expectation maximization”. We won’t go through the derivation.• If we are forced to decide, we should assign a data-case to the cluster which claims highest responsibility.• For a new data-case, we should compute responsibilities as in the E-step and pick the cluster with the largest responsibility.• E and M steps should be iterated until convergence (which is guaranteed).• Every step increases the following objective function (which is the total log-probability of the data under the model we are learning):1 1log [ ; , ]N Kc i c ci cL N xp m= =� �= S� �� �� �Dimensionality Reduction• Instead of organized in clusters, the data may be approximately lying on a (perhaps curved) manifold. • Most information in


View Full Document

UCI ICS 273A - Clustering & Dimensionality Reduction

Download Clustering & Dimensionality Reduction
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Clustering & Dimensionality Reduction and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Clustering & Dimensionality Reduction 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?