**Unformatted text preview:**

'&$%CHAPTER 9 UNSUPERVISEDLEARNING1'&$%What is unsupervised learning?– We don’t observe outcome values/labels except featurevariables X.– The goal is to understand and characterize thestructures/distributions within feature data (similar to densityestimation).– It is useful for data exploration and dimension reduction.2'&$%Principal component analysis– It is one of the most popular methods examining intrinsicstructure of X.– Aims to identify the combination which explains most of datavariances (principal directions).– The method only uses the second moments of data.3'&$%Calculation of principal components– Suppose X1, ..., Xnto be mean-zero observed feature values inRp.– We look for a matrix˜Vp×qsuch that˜V = (˜V1, ...,˜Vq) haveorthogonal unit columns and the projection of X onto˜V -spacecontains the most variability, i.e.,nXi=1kXi−˜V˜VTXik2is minimal.– Write X = (X1, ..., Xn)T. This is equivalent to minimizingTrace³X − X˜V˜VT´³X − X˜V˜VT´T= Trace(I −˜V˜VT)TXTX(I −˜V˜VT).4'&$%Solutions to PCA– Consider singular-value decomposition:Xn×p= Un×pdiag{d1, ..., dp}Vp×p,d1≥ d2≥ ... ≥ dp≥ 0.– We aim to minimizeTrace(I−˜V˜VT)TVTDTV(I−˜V˜VT) = Trace(I−˜V˜VT)VTDTV.– It shows˜V consists of the first q-columns of V.– The first q-columns of UDT, the projection on the top qprincipal directions, are called principal components.5'&$%The choice of PC #– Small # PC is believed to be sufficient.– Useful for data visualization and dimension reduction.– Normally, the top q PCs are retained if their explainedvariation is at least 70% or 80%.6'&$%Latent component analysis– Assume that the feature variables are multiple indirectmeasurements of a few independent latent sources.– Recovering these latent sources understands feature structuresand achieves dimension reduction.– Two approaches: factor analysis and independent componentanalysis.– Both assumeXp×1= Ap×qSq×1+ ²p×1,S : latenet components, ² : independent noise.7'&$%Factor analysis– S = (S1, ..., S1) is assumed to be from Gaussian anduncorrelated.– It is different from the PCA as FA looks for the commonfactors explaining maximal variability, while the PCA considersthe total variability.– Estimation is based onCov(X) = AAT+ diag(var(²1), ..., var(²n)).– SVD plays important role in estimating A; or the MLE is usedfor estimation.– A is subject to rotation.– FA is most popular in social science.8'&$%Independent component analysis– The difference from FA is that S is not necessarily Gaussianbut its components are independent.– Actually, the solution to estimating A and S relies onminimizing some entropy so uses higher moments other thanthe second one.– See ICA.pdf.9'&$%Multidimensional scaling– This method projects original X to a much lower-dimensionalspace.– It is (probably only) useful for viewing X.– The goal of the projection is to ensure pairwise distances beforeand after projections to be consistent as much as possible.– Minimize (Stress function)Xi6=j(d(Xi, Xj) − kZi− Zjk)21/2.– Gradient descent algorithm is used of minimization.– Can be modified to add weights to each pair or just keepdistance ranks to be consistent.10'&$%Variants of MDS– Sammon mapping:Xi6=j(dij− kZi− Zjk)2dij.– Ranks:Xi6=j(dij− g(kZi− Zjk)is minimized for both Z’s and g’s.– Example of MDS.11'&$%Cluster analysis– Search for clusters of subjects so that within-cluster subjectsare most similar but between-cluster subjects are mostdifferent.– Look for a map: C : {1, ..., n} − − > {1, ..., K} from subject IDto cluster ID.– Within-cluster distance (loss):12nXi,j=1KXk=1I(C(i) = C(j) = k)d(Xi, Xj). (1)– Between-cluster distance (loss):12nXi,j=1KXk=1I(C(i) = k, C(j) 6= k)d(Xi, Xj). (2)– Either minimize (1) or maximize (2).12'&$%K-means cluster analysis– Applies when the distance is the Euclidean distance.– The within–cluster distance is equivalent tonXi=1KXk=1I(C(i) = k)kXi− mkk2,where mkis the k-cluster mean.– An iterative procedure is used to update mkand clustermembership.13'&$%K-medoids cluster analysis– It applies to general proximity matrix.– Replace mean mkby the point Xi(medoid) in the same clusterwhich has the least summed distance from the other points inthe cluster.– Iteratively update the medoid and cluster membership.14'&$%Hierarchical clustering– Either agglomerative (bottom-up) or divisive (top-down).– At each level, either merge two clusters or split clusters in anoptimal sense.– The way of defining between-cluster distance includes singlelinkage, complete linkage and group average.– The output is called a dendrogram.15'&$%Linkage definitions– Single linkage:d(C1, C2) = mini∈C1,j∈C2d(Xi, Xj).– Complete linkage:d(C1, C2) = maxi∈C1,j∈C2d(Xi, Xj).– Group average:d(C1, C2) =1n1n2Xi∈C1,j∈C2d(Xi, Xj).– Single linkage can produce clusters with very large diameterand the complete linkage is oppositive; while the group averageis a compromise between the two

View Full Document