CMU CS 10810 - Clustering expression data - D667917

Home> Schools> Carnegie Mellon University> Computer Science (CS) > CS 10810> Clustering expression data

DOC PREVIEW

CMU CS 10810 - Clustering expression data

School name Carnegie Mellon University

Course Cs 10810- Computational Genomics

Pages 47

This preview shows page 1-2-3-22-23-24-45-46-47 out of 47 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 47 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Clustering expression data10-810 /02-710Computational GenomicsGoal• Data organization (for further study)• Functional assignment• Determine different patterns • Classification• Relations between experimental conditions • Subsets of genes related to subset of experimentsGenesExperimentsBothClustering metric• A key issue in clustering is to determine the similarity / distance metric.• Often, such metric has a bigger impact on the results than the actual clustering algorithm used• When determining the metric we should take into account our assumptions about the data and the goals of the clustering algorithm.Clustering algorithms• We can divide clustering methods into roughly three types: 1. hierarchical agglomerative clustering - For example, hierarchical clustering 2. Model based - For example, k-means, Gaussian mixtures3. Iterative partitioning (top down)- For example, graph based algorithmsHierarchical clustering• Probably the most popular clustering algorithm in this area• First presented in this context by Eisen in 1998• Agglomerative (bottom-up)• Algorithm:1. Initialize: each item a cluster2. Iterate:• select two most similar clusters• merge them3. Halt: when there is only one cluster leftdendrogramSimilarity criteria: Single Link• cluster similarity = similarity of two most similar members- Potentially long and skinny clustersExample: single link045890791003602054321543210458079030543)2,1(543)2,1(123458}8,9min{},min{9}9,10min{},min{3}3,6min{},min{5,25,15),2,1(4,24,14),2,1(3,23,13),2,1(=========dddddddddExample: single link04507054)3,2,1(54)3,2,1(123450458079030543)2,1(543)2,1(045890791003602054321543215}5,8min{},min{7}7,9min{},min{5,35),2,1(5),3,2,1(4,34),2,1(4),3,2,1(======ddddddExample: single link04507054)3,2,1(54)3,2,1(123450458079030543)2,1(543)2,1(045890791003602054321543215},min{5),3,2,1(4),3,2,1()5,4(),3,2,1(==dddHierarchical: Complete Link• cluster similarity = similarity of two least similar members+ tight clustersHierarchical: Average Link• cluster similarity = average similarity of all pairsthe most widely used similarity measureRobust against noiseSimilarity measure• In most cases the correlation coefficient ((normalized dot product) is used• The correlation coefficient is defined as:• Advantages:- Identifies relationships regardless of absolute unit changes- A simple way around missing values• Disadvantages- Not a metricyxiyixiyxyxystdxstdyxσσµµρ∑−−==))(()()(),cov(,Cluster resultsCombining several time series yeast datasetsValidationModel based clustering • In model based clustering methods we assume a generative model by which the data was generated• Our goal is to recover the parameters of such model, and use these to cluster the genesModel based clusteringFor simplicity we'll start with the following assumptions:• clusters are exclusive (single gene, single cluster) • we are searching for a fixed number of clusters (k)• variation of profiles within a cluster can be modeled as a multi-variate GaussianClustering algorithm1. initialize cluster models2. iterate until convergence: - assign genes to clusters- estimate cluster models on the basis of the genes assigned to themOur model: Gaussian mixtures• We assume a generative model that works in the following way• In order to generate a new point, we first chose a cluster 1≤i≤k according to p(i)• Next, we select the point using i’s probability distribution model• We assume that the profiles (vectors x =[x1,…,xn]) within each cluster are normally distributed such that x~N(µ,∑).2/)()(2/12/1||)2(1),|(iiTixxiniiexpµµπµ−Σ−−−Σ=ΣLikelihood• Given our model, and a set of parameters for each of the clusters, we can compute the joint likelihood of our data.• Our goal is to find a set of parameters that will maximize the above likelihood∏∏=i jijxpjpMDL )|()()|(Initialize• The easiest way is to chose a random gene as a center for each of the clusters• Initialization is a key aspect of this algorithm (and of other EM type algorithms we have discussed). It is wise to re-run the algorithm several times and chose the highest likelihood result as our clusters.• We will need to chose the variance / covariance for each clusterE step: Assigning profiles to clusters• Simple way: assign each gene (profile xj) to the cluster that gives the highest probability to it. In other words, gene j is assigned to cluster i when• Better way: assign each gene partially to different clusters based on the relative probabilities that the cluster models give to the profile• Each gene profile will consequently be associated with k assignment probabilitiesijxpxpjjii≠∀>)|(),|(σµσµ∑=jjjiixpipxpxip),|()(),|()|(σµσµRe-computing the parameters• We can re-estimate the Gaussian models on the basis of the partial (or simple) assignments • Each cluster i sees a vector of m (the number of genes) assignment probabilities representing the degree to which profiles are assigned to the clusterwi1= P(i|x1)…...wim= P(i|xm)• To re-estimate the cluster models we simply find the weighted mean and the covariance of the profiles, where the weighting is given by the above assignment probabilitiesRe-computing the parametersM step: Re-computing the parameters• To re-estimate the cluster models we simply find the weighted mean and the covariance of the profiles, where the weighting is given by the above assignment probabilities• We also determine the cluster distribution by setting• It can be shown that such a computation is the MLE for the classparameters• The two steps (E and M) are repeated until the parameters no longer change∑∑∑=k jkkjjiijxpxpip),|(),|()(σµσµSecond (and final) iterationThe importance of initializationsThe importance of initializations: Step 1The importance of initializations: Step 2The importance of initializations: Step 5The importance of initializations; Convergence0 10 20−101420 10 20−101380 10 20−0.500.5620 10 20−101350 10 20−0.500.5490 10 20−101530 10 20−101510 10 20−101450 10 20−0.500.5580 10 20−0.500.5500 10 20−101500 10 20−101210 10 20−0.500.5460 10

View Full Document