Stanford CS 374 - Probcons - D2889799

Home> Schools> Stanford University> Computer Science (CS) > CS 374> Probcons

DOC PREVIEW

Stanford CS 374 - Probcons

School name Stanford University

Course Cs 374- Algorithms in Biology

Pages 42

This preview shows page 1-2-3-20-21-40-41-42 out of 42 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 42 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 42 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 42 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 42 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 42 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 42 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 42 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 42 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 42 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

PROBCONS: Probabilistic Consistency-based Multiple Sequence Alignment Chuong B. Do,1 Mahathi S. P. Mahabhashyam,1 Michael Brudno,1 and Serafim Batzoglou1,2 1Department of Computer Science, Stanford University, Stanford, California 94305, USA 2 Corresponding author. E-MAIL [email protected]; FAX (650) 725-1449.ABSTRACT To study gene evolution across a wide range of organisms, biologists need accurate tools for multiple sequence alignment of protein families. Obtaining accurate alignments, however, is a difficult computational problem because of not only the high computational cost but also the lack of proper objective functions for measuring alignment quality. In this paper, we introduce prob-abilistic consistency, a novel scoring function for multiple sequence comparisons. We present PROBCONS, a practical tool for progressive protein multiple sequence alignment based on prob-abilistic consistency, and evaluate its performance on several standard alignment benchmark datasets. On the BAliBASE, SABmark, and PREFAB benchmark alignment databases, PROB-CONS achieves statistically significant improvement over other leading methods while maintain-ing practical speed. PROBCONS is publicly available as a web resource. Source code and execu-tables are available under the GNU Public License at http://probcons.stanford.edu.INTRODUCTION Given a set of biological sequences, a multiple alignment provides a way of identifying and visualizing patterns of sequence conservation by organizing homologous positions across different sequences in col-umns. As sequence similarity often implies divergence from a common ancestor or functional similarity, sequence comparisons facilitate evolutionary and phylogenetic studies (Phillips et al. 2000, Castillo-Davis et al. 2004), and isolation of the most relevant regions (Attwood 2002) for a variety of biological analy-ses. In particular, conserved amino acid stretches in proteins are strong indicators of preserved three-dimensional structural domains, so protein alignments have been widely used in aiding structure predic-tion (Rost and Sander 1994, Jones 1999) and characterization of protein families (Sonnhammer et al. 1998, Johnson and Church 1999, Bateman et al. 2004). However, when sequence identity falls below 30%, called the ‘twilight zone’ of protein alignments, the accuracies of most automatic sequences align-ment methods drop considerably (Rost 1999, Thompson et al. 1999b). As a result, alignment quality is often the limiting factor in biological analyses of amino acid sequences (Jaroszewski et al. 2002). The problem of alignment construction consists of defining either explicitly or implicitly an objective function for assessing alignment quality and employing an efficient algorithm to find the optimal, or a near optimal, alignment according to the objective function. Two-sequence alignments are usually evalu-ated by addition of match/mismatch scores for aligned pairs of positions and affine gap penalties for un-aligned amino acids (Needleman and Wunsch 1970, Smith et al. 1981). Quantitatively, scores for aligned residues are given by log-odds (Altschul 1991) substitution matrices such as PAM (Dayhoff 1978), GONNET (Gonnet et al. 1992), or BLOSUM (Henikoff and Henikoff 1992). Estimation of appropriate gap penalties, however, is often regarded as a “black art” based on trial-and-error (Vingron and Waterman 1994). For two sequences of length L, an optimal alignment according to this metric may be computed in O(L2) time (Gotoh 1982) and O(L) space (Myers and Miller 1988) via dynamic programming. Pair-hidden Markov models (HMMs) provide an alternative formulation of the sequence alignment problem in which alignment generation is directly modeled as a first-order Markov process involvingstate emissions and transitions. In this approach, model parameters obtain an intuitive probabilistic inter-pretation and can be trained on real data using standard supervised or unsupervised maximum likelihood methods. The Viterbi (1967) algorithm computes the highest probability alignment of two input se-quences according to an alignment pair-HMM. In the standard 3-state pair-HMM for alignment, the Viterbi algorithm may be viewed as an instantiation of the Needleman-Wunsch algorithm in which alignment parameters are determined by a log-odds transformation of the HMM scoring scheme (Durbin et al. 1998). Since they specify a conditional probability distribution over the space of all suboptimal alignments, pair-HMMs also allow the computation of the posterior probability, P(xi ~ yj ∈ a* | x, y), that particular positions xi and yj of the two sequences x and y, respectively, will be matched in a randomly generated alignment a*. Running the Needleman-Wunsch algorithm with these posterior probabilities as substitu-tion scores and no gap penalties gives rise to the maximum expected accuracy alignment method (see Methods), also known as optimal accuracy alignment (Holmes and Durbin 1998). In the general case of multiple sequence comparisons, theoretically sound and biologically motivated scoring methods are not straightforward to devise. In practice, ad hoc sum-of-pairs schemes (Carrillo and Lipman 1998), which combine the projected pairwise log-odds scores for all pairs of sequences in the alignment and their weighted variants (Altschul et al. 1989) are commonly used. Unfortunately, direct application of dynamic programming is too inefficient for alignment of more than a few sequences. In-stead, a variety of heuristic strategies have been proposed, including genetic algorithms (Notredame and Higgins 1996), simulated annealing (Kim et al. 1994), alignment to a profile HMM (Krogh et al. 1994, Eddy 1995), or greedy assemblage of multiple segment-to-segment comparisons (Morgenstern et al. 1996). By far, the most popular heuristic strategies involve tree-based progressive alignment (Feng and Doolittle 1987) in which pairwise alignment steps for groups of sequences are assembled into a complete multiple alignment. As with any hierarchical approach, however, errors at early stages in the alignment not only propagate to the final alignment but also may increase the likelihood of misalignment due to in-correct conservation signals. Post-processing steps such as iterative refinement (Gotoh 1996) alleviate some of the errors made during progressive alignment. Consistency-based schemes take the alternative view that “prevention is the best

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-20-21-40-41-42 out of 42 pages.

Stanford CS 374 - Probcons

Sign up for free to view:

Please select your school