Unformatted text preview:

ClusteringStructure in High-Dimensional DataClustering vs ClassificationTodayCentral DogmaExpression MicroarraysExpression MicroarraysExpression Microarray Data MatrixClustering and Classification in GenomicsClustering Expression DataWhy Cluster Genes by Expression?Clustering AlgorithmsK-Means ClusteringMore FormallyCost CriterionFuzzy K-MeansK-Means as a Generative ModelUnsupervised LearningIf We Have Labeled PointsIf We Have Labeled PointsIf We Know Cluster CentersWhat If We Have Neither?Expectation Maximization (EM)Expectation Maximization (EM)Revisiting K-MeansRevisiting K-MeansRevisiting K-MeansRevisting Fuzzy K-MeansRevisiting Fuzzy K-MeansK-Means, Viterbi learning & EMImplications: Non-globular ClustersBut How Many clusters?Hierarchical clusteringVisualization of resultsHierarchical clusteringDistance between clusters(Dis)Similarity MeasuresEvaluating Cluster PerformanceEvaluating clusters – Hypergeometric DistributionSimilar Genes Can ClusterClusters and Motif DiscoveryNext LectureMIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, EvolutionFall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.ClusteringLecture 3 September 16, 2008Computational Biology: Genomes, Networks, EvolutionStructure in High-Dimensional DataGyulassy, Atilla, et al. "Topologically Clean Distance Fields." IEEE Transactions on Visualization and Computer Graphics 13, no. 6 (2007): 1432-1439.• Structure can be used to reduce dimensionality of data• Structure can tell us something useful about the underlying phenomena• Structure can be used to make inferences about new data©2007 IEEE. Used with permission.Clustering vs Classification• Objects characterized by one or more features• Classification– Have labels for some points– Want a “rule” that will accurately assign labels to new points– Supervised learning•Clustering– No labels– Group points into clusters based on how “near” they are to one another– Identify structure in data– Unsupervised learningExpression in Exp 1Expression in Exp 2Today• Microarray Data• K-means clustering• Expectation Maximization• Hierarchical ClusteringCentral DogmaphenotypeDNAmRNAproteinWe can measure amounts of mRNA for everygene in a cellExpression Microarrays• A way to measure the levels of mRNA in every gene• Two basic types– Affymetrix gene chips– Spotted oligonucleotides• Both work on same principle– Put DNA probe on slide– Complementary hybridizationExpression Microarrays• Measure the level of mRNA messages in a cellDNA 1DNA 3DNA 5DNA 6 DNA 4 DNA 2cDNA 4cDNA 6HybridizeGene 1Gene 3Gene 5Gene 6 Gene 4 Gene 2MeasureRTRNA 4RNA 6Expression Microarray Data Matrix• Genes are typically given as rows• Experiment are given by columnsn experimentsm genesClustering and Classification in Genomics• Classification¾ Microarray data: classify cell state (i.e. AML vs ALL) using expression data¾ Protein/gene sequences: predict function, localization, etc.• Clustering¾ Microarray data: groups of genes that share similar function have similar expression patterns – identify regulons¾ Protein sequence: group related proteins to infer function¾ EST data: collapse redundant sequencesClustering Expression Data• Cluster Experiments– Group by similar expression profiles• Cluster Genes– Group by similar expression in different conditionsGene 1Gene 2ExperimentExperiment 1Experiment 2GenesWhy Cluster Genes by Expression?• Data Exploration– Summarize data– Explore without getting lost in each data point– Enhance visualization• Co-regulated Genes– Common expression may imply common regulation –Predictcis-regulatory promoter sequences• Functional Annotation– Similar function from similar expressionGCN4His2 Amino AcidsHis3 Amino AcidsUnknown Amino AcidsClustering Algorithms• Partitioning– Divides objects into non-overlapping clusters such that each data object is in exactly one subset• Agglomerative– A set of nested clusters organized as a hierarchyK-Means ClusteringThe Basic Idea• Assume a fixed number of clusters, K• Goal: create “compact” clustersMore Formally1. Initialize K centers ukFor each iteration n until convergence2. Assign each xithe label of the nearest center, where the distance between xiand ukis3. Move the position of each ukto the centroid of the points with that labelikiikx with label j( 1) , x # with label kxkn += =∑xμ x()2,ik i kd =−xμCost Criterion()()2123 n iwith label kCOST , , ,...,kik=−∑∑μ xxxx x x μWe can think of K-means as trying to create clusters that minimize a cost criterion associated with the size of the clusterμ1μ2μ3Minimizing this means minimizing each cluster term separately:()ii2222iwith label k with label kwith label2 kOptimum , t22= he centroidiikkikkik iikkk−= − += − +∑∑ ∑∑∑xxxx μ xxuμ x u xxuxuxFuzzy K-Means• Initialize K centers uk• For each point calculate the probability of membership for each category• Move the position of each ukto the weighted centroid :• IterateK-meansFuzzy K-meansiiii ix with label j x with label j( 1) P( | ) P( | )bbkkkn +=∑∑μxμxμxiP(label K | , )kxμOf course, K-Means just special case whereii1 if is closest toP(label K | , )0kk⎧=⎨⎩xμx μotherwiseK-Means as a Generative ModelSamples drawn from two equally normal distributions with unit variance - a Gaussian Mixture ModelxiModel ofP(X,Labels)μ1μ2()()21|exp22ijijPπ⎧⎫−⎪⎪=−⎨⎬⎪⎪⎩⎭xuxuUnsupervised LearningLearn?Samples drawn from two equally normal distributions with unit variance - a Gaussian Mixture Modelxiμ1μ2()()21|exp22ijijPπ⎧⎫−⎪⎪=−⎨⎬⎪⎪⎩⎭xuxuIf We Have Labeled PointsNeed to estimate unknown gaussian centers from dataIn general, how could we do this? How could we “estimate” the “best” uk?Choose ukto maximize probability of modelxiμ1 ?μ2?If We Have Labeled PointsNeed to estimate unknown gaussian centers from dataIn general, how could we do this? How could we “estimate” the “best” uk?Given a set of xi, all with label k, we can find the maximum likelihood μkfrom() ()()2i2arg11arg max log P | arg max log2mi2 niiiiiπ⎧⎫⎧⎫⎛⎞=−−+⎨⎬⎨ ⎬⎜⎟⎝⎠⎩⎭⎩⎭=−∑∏∑μμμx μ xuxuSolution is the centroidof the xiIf We Know Cluster CentersNeed to estimate labels for the dataLearn?xiDistance measure for


View Full Document
Download Clustering
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Clustering and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Clustering 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?