##
This **preview** shows page *1-2-17-18-19-35-36*
out of 36 **pages**.

*View Full Document*

End of preview. Want to read all 36 pages?

Upload your study docs or become a GradeBuddy member to access this document.

View Full Document**Unformatted text preview:**

A Brief Introduction to Multiple Sequence Analysisthrough GCG’s SeqLabThe SeqLab Graphical User Interface (GUI) is a ‘front-end’ to the GCG WisconsinSequence Analysis Package. It provides an intuitive alternative to the command line byallowing menu-driven access to most of GCG’s 150-some different programs and is a greatway to develop, refine, and analyze multiple sequence alignments. So what’s so greatabout a multiple sequence alignment? They are:very useful in the development of PCR primers and hybridization probes;great for producing annotated, publication quality, graphics and illustrations;invaluable in structure/function studies through homology inference;essential for building “Profiles” for remote homology similarity searching; andrequired for molecular evolutionary phylogenetic inference programs such as thosefrom PAUP* (Phylogenetic Analysis Using Parsimony [and other methods]) andPHYLIP (PHYLogeny Inference Package).This introductory tutorial will illustrate many of SeqLab's multitude of features, just the ‘tip-of-the-iceberg,’ hopefully whetting your appetite enough to encourage further exploration.Tuesday, July 24, 2007, 7:00 PMA GCG¥ Wisconsin Package™ SeqLab® tutorial for the Woods Hole MarineBiological Laboratory’s Workshop on Molecular Evolution.Author and Instructor: Steven M. ThompsonSteven M. ThompsonPage 2Steve ThompsonBioInfo 4U2538 Winnwood CircleValdosta, GA, USA [email protected]¥GCG is the Genetics Computer Group, part of Accelrys Inc.,producer of the Wisconsin Package™ for sequence analysis. 2007 BioInfo 4USteven M. ThompsonSteven M. ThompsonPage 3IntroductionThe power and sensitivity of sequence based computational methods dramatically increases with the additionof more data. More data yields stronger analyses — if done carefully! Otherwise, it can confound the issue.The patterns of conservation become clearer by comparing the conserved portions of sequences amongst alarger and larger dataset. Those areas most resistant to change are functionally the most important to themolecule. The basic assumption is that those portions of sequence of crucial functional value are mostconstrained against evolutionary change. They will not tolerate many mutations. Not that mutations do notoccur in these portions, just that most mutations in the region are lethal so we never see them. Other areasof sequence are able to drift more readily being less subject to evolutionary pressure. Therefore, sequencesend up a mosaic of quickly and slowly changing regions over evolutionary time. However, in order to learnanything by comparing sequences, we need to know how to compare them. We can use those constrainedportions as ‘anchors’ to create a sequence alignment allowing comparison, but this brings up the alignmentproblem and ‘similarity.’ It is easy to see that sequences are aligned when they have identical symbols atidentical positions, but what happens when symbols are not identical or the sequences are not the samelength. How can we know that the most similar portions of our sequences are aligned, when is an alignmentoptimal, and does optimal mean biologically correct? How can anybody figure any of this out?Multiple sequence dynamic programmingAs seen in pairwise dynamic programming, looking at every possible position by sliding one sequence alongevery other sequence, just will not work for alignment. Therefore, dynamic programming reduces the problemback down to N2. But how do you work with more than just two sequences at a time? It becomes a muchharder problem. You could painstakingly manually align all your sequences using some type of editor, andmany people do just that, but some type of an automated solution is desirable, at least as a starting point tomanual alignment. However, solving the dynamic programming algorithm for more than just two sequencesrapidly becomes intractable. Dynamic programming’s complexity, and hence its computational requirements,increases exponentially with the number of sequences in the dataset being compared (complexity=[sequencelength]number of sequences). Mathematically this is an N-dimensional matrix, quite complex indeed. Pairwisedynamic programming solves a two-dimensional matrix, and the complexity of the solution is equal to thelength of the longest sequence squared. Well, a three-member standard dynamic programming sequencecomparison would be a matrix with three axes, the length of the longest sequence cubed, and so forth. Youcan at least draw a three-dimensional matrix, but more than that becomes impossible to even visualize. Itquickly boggles the mind!Several different heuristics have been employed over the years to simplify the complexity of the problem.One program, MSA (Gupta et al., 1995), attempts to globally solve the N-dimensional matrix equation using abounding box trick. However, the algorithm’s complexity precludes its use in most situations, except with verysmall datasets. One way to still globally solve the algorithm and yet reduce its complexity is to restrict thesearch space to only the most conserved ‘local’ portions of all the sequences involved. This approach is usedby the program PIMA (Smith and Smith, 1992). MSA and PIMA are both available through the Internet atSteven M. ThompsonPage 4 several bioinformatics servers (in particular the Baylor College of Medicine’s Search Launcher [Smith, et al.1996] at http://searchlauncher.bcm.tmc.edu/) where you can investigate these resources on your own time.How the algorithm worksThe most common implementations of automated multiple alignment modify dynamic programming byestablishing a pairwise order in which to build the alignment. This modification is known as pairwise,progressive dynamic programming. Originally attributed to Feng and Doolittle (1987), this variation of thedynamic programming algorithm generates a global alignment, but restricts its search space at any one timeto a local neighborhood of the full length of only two sequences. Consider a group of sequences. First all arecompared to each other, pairwise, using a quick variation of standard dynamic programming. Thisestablishes an order for the set, most to least similar. The algorithm then takes the top two, most similarsequences, and aligns them using normal dynamic programming. Next it creates a consensus of the two, andaligns that