# Introduction to Multiple Sequence Analysis through GCG’s SeqLab

**View the full content.**View Full Document

0 0 9 views

**Unformatted text preview:**

A Brief Introduction to Multiple Sequence Analysis through GCG s SeqLab The SeqLab Graphical User Interface GUI is a front end to the GCG Wisconsin Sequence Analysis Package It provides an intuitive alternative to the command line by allowing menu driven access to most of GCG s 150 some different programs and is a great way to develop refine and analyze multiple sequence alignments So what s so great about a multiple sequence alignment They are very useful in the development of PCR primers and hybridization probes great for producing annotated publication quality graphics and illustrations invaluable in structure function studies through homology inference essential for building Profiles for remote homology similarity searching and required for molecular evolutionary phylogenetic inference programs such as those from PAUP Phylogenetic Analysis Using Parsimony and other methods and PHYLIP PHYLogeny Inference Package This introductory tutorial will illustrate many of SeqLab s multitude of features just the tipof the iceberg hopefully whetting your appetite enough to encourage further exploration Tuesday July 25 2006 7 00 PM A GCG Wisconsin Package SeqLab tutorial for the Woods Hole Marine Biological Laboratory s Workshop on Molecular Evolution Author and Instructor Steven M Thompson Steven M Thompson Page 2 4 3 06 Steve Thompson BioInfo 4U 2538 Winnwood Circle Valdosta GA USA 31601 7953 stevet bio fsu edu 229 249 9751 GCG is the Genetics Computer Group part of Accelrys Inc producer of the Wisconsin Package for sequence analysis 2006 BioInfo 4U Steven M Thompson Steven M Thompson Page 3 4 3 06 Introduction The power and sensitivity of sequence based computational methods dramatically increases with the addition of more data More data yields stronger analyses if done carefully Otherwise it can confound the issue The patterns of conservation become clearer by comparing the conserved portions of sequences amongst a larger and larger dataset Those areas most resistant to change are functionally the most important to the molecule The basic assumption is that those portions of sequence of crucial functional value are most constrained against evolutionary change They will not tolerate many mutations Not that mutations do not occur in these portions just that most mutations in the region are lethal so we never see them Other areas of sequence are able to drift more readily being less subject to evolutionary pressure Therefore sequences end up a mosaic of quickly and slowly changing regions over evolutionary time However in order to learn anything by comparing sequences we need to know how to compare them We can use those constrained portions as anchors to create a sequence alignment allowing comparison but this brings up the alignment problem and similarity It is easy to see that sequences are aligned when they have identical symbols at identical positions but what happens when symbols are not identical or the sequences are not the same length How can we know that the most similar portions of our sequences are aligned when is an alignment optimal and does optimal mean biologically correct How can anybody figure any of this out Multiple sequence dynamic programming As seen in pairwise dynamic programming looking at every possible position by sliding one sequence along every other sequence just will not work for alignment Therefore dynamic programming reduces the problem back down to N 2 But how do you work with more than just two sequences at a time It becomes a much harder problem You could painstakingly manually align all your sequences using some type of editor and many people do just that but some type of an automated solution is desirable at least as a starting point to manual alignment However solving the dynamic programming algorithm for more than just two sequences rapidly becomes intractable Dynamic programming s complexity and hence its computational requirements increases exponentially with the number of sequences in the dataset being compared complexity sequence length number of sequences Mathematically this is an N dimensional matrix quite complex indeed Pairwise dynamic programming solves a two dimensional matrix and the complexity of the solution is equal to the length of the longest sequence squared Well a three member standard dynamic programming sequence comparison would be a matrix with three axes the length of the longest sequence cubed and so forth You can at least draw a three dimensional matrix but more than that becomes impossible to even visualize It quickly boggles the mind Several different heuristics have been employed over the years to simplify the complexity of the problem One program MSA Gupta et al 1995 attempts to globally solve the N dimensional matrix equation using a bounding box trick However the algorithm s complexity precludes its use in most situations except with very small datasets One way to still globally solve the algorithm and yet reduce its complexity is to restrict the search space to only the most conserved local portions of all the sequences involved This approach is used by the program PIMA Smith and Smith 1992 MSA and PIMA are both available through the Internet at Steven M Thompson Page 4 4 3 06 several bioinformatics servers in particular the Baylor College of Medicine s Search Launcher Smith et al 1996 at http searchlauncher bcm tmc edu where you can investigate these resources on your own time How the algorithm works The most common implementations of automated multiple alignment modify dynamic programming by establishing a pairwise order in which to build the alignment This modification is known as pairwise progressive dynamic programming Originally attributed to Feng and Doolittle 1987 this variation of the dynamic programming algorithm generates a global alignment but restricts its search space at any one time to a local neighborhood of the full length of only two sequences Consider a group of sequences First all are compared to each other pairwise using a quick variation of standard dynamic programming This establishes an order for the set most to least similar The algorithm then takes the top two most similar sequences and aligns them using normal dynamic programming Next it creates a consensus of the two and aligns that consensus to the third sequence using standard dynamic programming Now create a consensus of the first three sequences and align that to the forth most similar Subgroups are clustered together similarly This