Unformatted text preview:

Sequence clusteringIntroductionData clustering is one of the key tools used in various incarnations of data-mining - trying to make sense of large datasets. It is, thus, natural to ask whether clustering approaches can enhance our understanding of biological sequences. A common application of clustering is the analysis of 16S rRNA sequences extracted from an environment. In theory, each distinct rRNA sequence should represent a single organism, however, due to sequencing errors and recent evolutionary changes a single organism is represented by many closely related RNA sequences. Thus, by clustering the set of sequences we might infer how many organisms are present in our sample.Clustering basicsClustering approaches can be classified into three main categories:• Agglomerative - algorithm starts with each sequence as an independent cluster, then iteratively joins together sequences to create larger clusters• Divisive - algorithm starts with all the sequences in one cluster then iteratively breaks up the cluster into a collection of smaller clusters.• Partitioning/partition methods - e.g. k-means - algorithm starts with a pre-defined set of clusters then refines the cluster membership to improve cluster quality.Various approaches use different criteria for when the clustering should stop, usually based on some concept of cluster quality. The basic goal is to create clusters such that all sequences within a cluster are similar to each other(clusters are compact), and sequences contained in different clusters are dissimilar to each other (clusters are well separated). There are a variety of mathematical measures that capture one or both of these criteria (see http://machaon.karanagai.com/validation_algorithms.html). It is important to note that the quality of a cluster can be assessed either in an unsupervised way (clustering algorithm and validation algorithm have access to the same information) and in a supervised way (validation algorithm has access to additional information that can be used to evaluate cluster quality).Hierarchical clusteringHierarchical clustering is a variant of agglomerative clustering that continues the agglomeration process until only one cluster remains. Essentially, through hierarchical clustering you construct a tree, having the sequences as leaves, that captures the relationship between the sequences at various levels. Clusters can be constructed by cutting the tree at particular points (imagine slicing a head of broccoli with one slice - each floret will be an individual cluster) - i.e. for agiven node, all its descendents are part of the same cluster. The basic hierarchical clustering algorithm proceeds as follows, in a greedy manner:1. Compute all pair-wise distances between the sequences2. Merge together the sequences that are closest (most similar) to each other3. Compute the distance between the newly created cluster and all other sequences/clusters4. Repeat from 2.Different hierarchical clustering methods differ in the way they define the distance between already computed clusters, or between clusters and individual sequences. Thus we have:• Nearest neighbor (single linkage) - distance between clusters i and j is equal to the smallest distance among all pairs of sequences, the first from cluster i and the second from cluster j.• Furthest neighbor (complete linkage) - distance between clusters i and j is equal to the largest distance among all pairs of sequences, the first from cluster i and the second from cluster j.• Average neighbor (average linkage, UPGMA) - distance between clusters i and j is equal to the average distance among all pairs of sequences, the first from cluster i and the second from cluster j.• Ward's clustering - the clusters being merged are the ones that will construct a cluster with the lowest variance in within-cluster distances.Semi-supervised hierarchical clustering (VI-cut)In general the goal of clustering is to uncover certain relationships between a set of previously unknown sequences. In many cases, however, these relationships are known, or can be inferred. In the 16S rRNA example above, we might already know the organisms associated with some of the sequences. This information could be used to validate the clusters constructed through one of the clustering methods. A good metric for comparing the "true" clustering to the computed clustering is the Variation of Information metric: VI(C1,C2) = H(C1) + H(C2) - 2 I(C1,C2), where H is the entropy, and I is the mutual information between these clusterings. VI(C1,C2) = 0 iff C1 and C2 are the same clustering.Navlakha et al.(paper on syllabus page) showed that the VI distance has a nice decomposition property (it is easy to compute the VI distance between two clusterings given the distances between subsets of these clusterings) which allows the construction of a dynamic programming algorithm for finding the "perfect" cut-set in a hierarchical clustering tree - i.e. the set of cuts in the tree that maximizes the concordance between the resulting clusters and previously known clustering of a subset of the sequences. I.e. this results in a semi-supervised clustering approach.Phylogenetic treesThe hierarchical clustering approaches described above can apply to any type of data, as long as a distance can be defined between the objects being clustered. One biological application of clustering attempts to infer the evolutionary history that gave rise to a set of sequences observed today. In this version, the distance between sequences attempts to capture the evolutionary relatedness of the sequences, or the time (usually measured as number of mutations) that separates the sequences from each other. Thus, the goal is to reconstruct a tree that best captures the "heritage" of a set of sequences - the root of the tree being the most recent common ancestor of all the sequences. In the context of protein sequences, the BLOSUM and PAM matrices discussed during the sequence alignment lectures, provide a reasonable approximation of the evolutionary distance. For nucleotide sequences, several models of evolution apply that attempt to estimate the likelihood a certain base will mutate into another base during a given evolutionary time. The evolutionary time separating two sequences can be estimated from the number of identities in the alignment between the sequences. Neighbor-joining Simple hierarchical clustering approaches have been applied in the


View Full Document

UMD CMSC 858W - Sequence clustering

Download Sequence clustering
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Sequence clustering and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Sequence clustering 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?