Unformatted text preview:

10-810: Advanced Algorithms and Models for Computational BiologyOptimal leaf ordering and classificationHierarchical clustering• As we mentioned, its one of the most popular methods for clustering gene expression data• One of its main advantages is the global overview of the entire experiment in one figure.• Biologists often omit the tree and use the figure to determine functional assignmentsClustering tree• For n leaves there are n-1internal nodes• Each flip in an internal node creates a new linear ordering • There are 2n-1possible linear ordering of the leafs of the tree45 334512Importance of the Ordering• Genes that are adjacent in the linear ordering are often hypothesized to share a common function.• Ordering can help determine relationships between genes and clusters in time series data analysis. Hierarchical clusteringPermutedInitial structureSome heuristics• Due to the large number of possible orderings (2n-1), finding the optimal ordering was considered impractical by Eisen[Eisen98]• Thus, some heuristics have been suggested for this problem:- Order genes based on their expression levels [Eisen98]- Order clusters using results of one dimensional som (Cluster)- Order leaves and internal nodes based on similarity to parents siblings [Alon99]Problem DefinitionDenote by Φ the space of the possible linear orderings consistent with the tree.Denote by v1…vnthe tree leaves.Our goal is to find an ordering that maximizes the similarity of adjacent elements:),(111maxvviniiSφφφ+−=Φ∈∑where S is the similarity matrixComputing the Optimal SimilarityRecursively compute the optimal similarity LT(u,w)for any pair of leaves (u,w)which could be on different corners (leftmost and rightmost) of T. For a leaf u∈T, CT(u) is the set of all possible corner leaves of T when u is on one corner of T.TT2T1wukmLT(u,w) = maxm∈CT1(u),k∈CT2(w)LT1(u,m) + LT2(k,w) + S(m,k)For all u∈T1For all w∈T2LT(u,w) = maxm∈CT1(u),k∈CT2(w)LT1(u,m) + LT2(k,w) + S(m,k)umkwTT1T2For all u∈T1For all k∈T2LL(u,k) = max m∈CT1(u)LT1(u,m) + S(m,k)For all w∈T2LT(u,w) = max k∈CT2(w)LL(u,k) + LT2(w,k)Algorithm ComplexityTime complexity : F(n) = θ(n3)By induction. If T = T1,T2and |T| = n, |T1| = s and |T2| = r we have:F(n) ≤sr2+ s2r + F(s) + F(r) ≤(s+r)3≤n3For the complete balanced binary tree with n leaves we have:?Space complexity: We store one value for each pair of leaves. We use pointers to reconstruct the path we took. Thus, space complexity is O(n2).Random InputsNum of leavesNum of valuesComputingS400 0210297290016002500Improved O(n4)O(n3) Improved O(n3)360020304050601319591862220 25190 140900 (15 min)580(10 min)5700(95 min)1850(30 min)Running Time – Biological Datasetstype of dataset num of genesnum of experimentsComputing S Improved O(n4)Cell cycle –cdc15800 24 1 16 12 23Different sources (Eisen)979 79 7 26 20 2551425959458003684Cell cycle (Spellman)Environment response(Young)O(n3)Improved O(n3)12 1209437Input Optimal orderingHierarchical clusteringInputHierarchical clustering Optimal orderingResults – Synthetic DataBiological Results• Spellman identified 800 genes as cell cycle regulated in Saccharomyces cerevisiae.• Genes were assigned to five groups termed G1,S,S/G2,G2/M and M/G1 which approximate the commonly used cell cycle groups in the literature.• This assignment was performed using a ‘phasing’ method which is a supervised classification algorithm.• In addition to the phasing method, the authors clustered these genes using hierarchical clusteringOptimal orderingCell Cycle – 24 experiments of cdc15 temperature sensitive mutantHierarchical clustering24 experiments of cdc15 temperature sensitive mutantClassificationTypes of classifiers• We can divide the large variety of classification approaches into roughly two main types 1. Generative:- build a generative statistical model- e.g., mixture model2. Discriminative- directly estimate a decision rule/boundary- e.g., logistic regressionGolub et al• 38 test samples (27 ALL 11 AML)• Each gene was initially compared to an idealized expression pattern: 11111111111111000000000000000000 for class 1 and similarly 000000000000000000000011111111111111 for the second class.• The actual selection was done by setting:• Large values of |p(g,c)| indicate strong correlation between the gene and the classes, and the sign of p(g,c) depends on the class in which this gene is expressed.)()()()(),(2121ggggcgpσσµµ+−=Weighted voting• Use a subset of the selected genes (50). •Set ag=p(g,c) and bg=(µ1(g)+µ2(g))/2• Given a new sample X, we set the vote of gene g to:• A positive value is a vote for class 1 and a negative for the second class)(ggggbxav−=Weighted votingVoting strength • The votes are summed for each of the two classes.• The decision is made by using:• PS determines our confidence in the classification result.• How do we chose PS ?losewinlosewinvvvvPS+−=Testing the classifier• Cross validation.• Test set: 38 samples:-20 ALL-14 AML• 29 of 34 had a classification value higher than the threshold and all were predicted correctly.Classification results Selected genesCan we do better?Generative classifiers• A mixture of two Gaussians, one Gaussian per class choice of class:• where X corresponds to, e.g., a tissue sample (expression levelsacross the genes).• Three basic problems we need to address:- decisions- estimation- variable (feature) selection),(~0),(~10011σµσµXclassXXclassX⇒∈⇒∈Decision: Bayesian classifiers• Given a probabilistic model and an unlabeled data vector X, we can use Bayes rule to determine the class:• We compute p(class=1|X) and p(class =0|X) and chose the class with the highest probability• This method can be easily extended to multiple classes)0()0|()1()1|()1()1|()|1(==+======classPclassXPclassPclassXPclassPclassXPXclasspDecision boundary• Given a probabilistic model and an unlabeled data vector X, we can use Bayes rule to determine the class:• Using Bayes classifiers, the decision comes down to the following (log) likelihood ratio:)0|()1|()1()1|()|1(=+=====classXPclassXPclassPclassXPXclassp10)0(),|()1(),|(log0011=⇒>==classclasspXpclasspXpσµσµDecision boundary• Using Bayes classifiers, the decision comes down to the following (log) likelihood ratio:Why?The prior class probabilities P(class) bias our decisions towards one class or the other. • Decision boundary:


View Full Document

CMU CS 10810 - Lecture

Download Lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?