Alignment ProblemKey IssuesTypes of AlignmentLocal vs. Global Pairwise AlignmentsHow do you compare alignments?Scoring MatricesMultiple Sequence AlignmentImprovementsIterative AlgorithmsSoftwareComparisonsWhy Genome Sequencing?Modern Sequencing MethodsSlide 14Automated SequencingReads and ContigsClone-by-Clone vs. ShotgunIn a Perfect WorldDifficulties?The Fruit FlySlide 21AbstractionOverlap-Layout-ConsensusApproximation AlgorithmsHandling RepeatsSlide 26HybridizationSequencing-By-HybridizationBridges of KönigsbergPros and ConsGraph PreprocessingSlide 32Sizes of genomes and numbers of genesSequencing parametersSequence accuracySequence accuracy and sequencing costSequencing coverageNCBI Genome SummaryAlignment Problem(Optimal) pairwise alignment consists of considering all possible alignments of two sequences and choosing the optimal one.Sub-optimal (heuristic) alignment algorithms are also very important: e.g. BLASTKey IssuesTypes of alignments (local vs. global)The scoring systemThe alignment algorithmMeasuring alignment significanceTypes of AlignmentGlobal—sequences aligned from end-to-end.Local—alignments may start in the middle of either sequenceUngappe d—no insertions or deletions are allowedOther types: overlap alignments, repeated match alignmentsLocal vs. Global Pairwise AlignmentsA global alignment includes all elements of the sequences and includes gaps.A global alignment may or may not include "end gap" penalties. Global alignments are better indicators of homology and take longer to compute. A local alignment includes only subsequences, and sometimes is computed without gaps.Local alignments can find shared domains in divergent proteins and are fast to computeHow do you compare alignments?Scoring schemeWhat events do we score?MatchesMismatchesGapsWhat scores will you give these events?What assumptions are you making?Score your alignmentScoring MatricesHow do you determine scores?What is out there already for your use?DNA versus Amino Acids?TTACGGAGCTTCCTGAGATCCMultiple Sequence AlignmentGlobal versus Local AlignmentsProgressive alignmentEstimate guide treeDo pairwise alignment on subtreesClustalXImprovementsConsistency-based AlgorithmsT-Coffee - consistency-based objective function to minimize potential errorsGenerates pair-wise global (Clustal)Local (Lalign)Then combine, reweight, progressive alignmentIterative AlgorithmsEstimate draft progressive alignment (uncorrected distances)Improved progressive (reestimate guide tree using Kimura 2-parameter)Refinement - divide into 2 subtrees, estimate two profiles, then re-align 2 profilesContinue refinement until convergenceSoftwareClustalT-CoffeeMUSCLE (limited models)MAFFT (wide variety of models)ComparisonsSpeedMuscle>MAFFT>CLUSTALW>T-COFFEEAccuracyMAFFT>Muscle>T-COFFEE>CLUSTALWLots more work to do here!Why Genome Sequencing?Modern Sequencing MethodsSanger (1982) introduced a sequencing method amenable to automation.Whole-genome sequencing: Clone-By-Clone vs. Shotgun AssemblyDrosophila melongaster sequenced (Myers et al. 2000)Homo sapien sequenced (Venter et al. 2001)Main idea: Obtain fragments of all possible lengths, ending in A, C, T, G.Using gel electrophoresis, we can separate fragments of differing lengths, and then assemble them.Sanger (1982) introduced chain-termination sequencing.Automated SequencingPerkin-Elmer 3700:Can sequence ~500bp with 98.5% accuracyReads and ContigsSequencing machines are limited to about ~500-750bp, so we must break up DNA into short and long fragments, with reads on either end.Reads are then assembled into contigs, then scaffolds.Clone-by-Clone vs. ShotgunTraditionally, long fragments are mapped, and then assembled by finding a minimum tiling path. Then, shotgun assembly is used to sequence long fragments.Shotgun assembly is cheaper, but requires more computational resources.Drosophila was successfully sequenced using shotgun assembly.In a Perfect WorldDifficulties?Good coverage does not guarantee that we can “see” repeats.Read coverage is generally not “truly” random, due to complications in fragmentation and cloning.Any automated approach requires extensive post-processing.Phrap www.phrap.orgThe Fruit FlyDrosophila melongaster was sequenced in 2000 using whole genome shotgun assembly.Genome size is ~120Mbp for euchromatic (coding) portion, with roughly 13,600 genes.The genome is still being refined.NIH used a Clone-By-Clone strategy; Celera used shotgun assembly.Celera used 300 sequencing machines in parallel to obtain 175,000 reads per day.Efforts were combined, resulting in 8x coverage of the human genome; consensus sequence is 2.91 billion base pairs.AbstractionThe basic question is: given a set of fragments from a long string, can we reconstruct the string?What is the shortest common superstring of the given fragments?Overlap-Layout-ConsensusConstruct a (directed) overlap graph, where nodes represent reads and edges represent overlap. Paths are contigs in this graph. Problem: Find the consensus sequence by finding a path that visits all nodes in layout graph.Note: This is an idealization, since we must handle errors!Approximation AlgorithmsThe shortest common superstring problem is NP-complete.Greedily choosing edges is a 4-approximation, conjectured to be a 2-approximation.Another idea: TSP has a 2-approximation if the edge weights are metric (Waterman et al. 1976 gives such metrics).Handling RepeatsWe can estimate how much coverage a given set of overlapping reads should yield, based on coverage.Repeats will “seem” to have unusually good coverage.Celera’s algorithms are proprietary, but there is no explicit way to handle repeats in the overlap-layout-consensus paradigm.The Big PictureHybridizationSuppose we had a way to probe fragments of length k that were present in our sequence, from a hybridization assay.Commercial products: Affymetrix GeneChip, Agilent, Amersham, etc.Sequencing-By-HybridizationThen instead of reads, we have regularly sized fragments, k-mers.Construct a multigraph G with (k-1)-mers as nodes, with edges representing k-mers. G is a de Bruijn graph.Idea: An Eulerian path in G corresponds to the assembled sequence, and we don’t lose repeats (Pevzner 1989).Bridges of KönigsbergTheorem (Euler 1736): A graph has a
View Full Document