Stanford CS 374 - Lecture 4 - Mapping genomes onto each other -- Synteny detection - D2820962

Home> Schools> Stanford University> Computer Science (CS) > CS 374> Lecture 4 - Mapping genomes onto each other -- Synteny detection

DOC PREVIEW

Stanford CS 374 - Lecture 4 - Mapping genomes onto each other -- Synteny detection

School name Stanford University

Course Cs 374- Algorithms in Biology

Pages 9

This preview shows page 1-2-3 out of 9 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Mapping genomes onto each other -- Synteny detectionI Background and IntroductionWhy map genomes onto each other?How is this genomic mapping achieved?II MotivationWhy aren’t current methods good enough?III The PASH Methodology3) Algorithmic Complexity of the Pash methodThe Worst-case scenario…In a Practical scenarios…4) Significance of SimilaritiesIV Results and ConclusionMapping genomes onto each other - Synteny detection CS374 Fall 2004, Lecture 4,10/07/04Lecturer: Aswath Manohar Scribe: Chirag BhattMapping genomes onto each other -- Synteny detectionBased on the following papers: 1. Ken J. Kalafus, Andrew R. Jackson, and Aleksandar Milosavljevic, “Pash: Efficient Genome-Scale Sequence Anchoring by Positional Hashing”, Genome Research, 672-678, December2003.Additional Resources:1. http://www.genboree.org; Interactive version of Figure 3 and the Virtual Genome Paintingprogram.2. http://www.hgsc.bcm.tmc.edu/projects/rat/; Rat Genome Project http://www.ncbi.nih.gov/;National Center for Biotechnology Information home page.I Background and IntroductionWhy map genomes onto each other?One expected benefit of genome sequencing is the identification of functionalDNA elements through comparative methods. A comparison of the mouse andhuman genomes has revealed that approximately 5% of these genomes areunder purifying selection (Waterston et al. 2002). Rat/human or rat/mousegenome comparisons yield similar statistics (Rat Genome SequencingConsortium 2004), yet only about a third of this conserved sequence isaccounted for by known genes, indicating that a large set of functional elementsremain uncharacterized. Identification of functional elements by genomecomparison depends heavily on the quality of sequence alignments.How is this genomic mapping achieved?There are many Dynamic Programming algorithms that can be used to identifythese similarities such as Standard Dynamic Programming algorithms byNeedleman and Wunsch in 1970 and Smith and Waterman 1981. However,these methods are computationally very expensive. There are recent fasteralgorithms such as LAGAN, but they perform well on a megabase scale (i.e. aftersome pre-processing has been done at the genome-scale).Even faster comparisons are achieved by the various “seed-and-extend”methods. In a seed-and- extend method, one or more exactly matching k-mers(“seeds” or “hot-spots”) provide initial evidence of possible similarity. Theseseeds are then extended into sequence alignments.The extension step is more accurate than the seeding step, but it iscomputationally expensive, so these methods quickly abandon most candidatesimilarities because they do not immediately yield alignments that are likely to bestatistically significant.Mapping genomes onto each other - Synteny detection CS374 Fall 2004, Lecture 4,10/07/04Lecturer: Aswath Manohar Scribe: Chirag BhattII MotivationWhy aren’t current methods good enough?The seed and extend method has two main drawbacks. The first being, large-scale comparison such as the genome-scale comparison’s are computationallyexpensive, and hence restrict such methods to labs that have access to largecomputing clusters. Another limitation of current implementations of seed-and-extend methods is that they provide few options to trade sensitivity for speed.To address the limitations of seed-and-extend methods, the Positional Hashingmethod (Pash) was developed. The Positional Hashing method representssequences as collections of short k-mers rather than as individual bases,throughout the comparison process. Local clusters of matching k-mers arecollated together to identify sequence similarity. Whereas other methods achieveparallelism by requiring users to divide the sequences into many subsequencesand perform all pairwise comparisons between them (thus incurring a quadraticpenalty), Positional Hashing achieves seamless parallelism in linear time byassigning computing nodes to compare subsets of diagonals.III The PASH Methodology 1) Dividing the Comparison Problem Across Diagonals (refer to figure on the next page)Consider the comparison matrix as shown in Fig 1a where each dimensionrepresents an entire genome of the species we want to compare. Any sequencesimilarity that may occur would occur along any one of the diagonals of thiscomparison matrix. Hence, if we observed a match of a string of k consecutivebase pairs (i.e, a k-mer) along any of the diagonals we have a k-mer match.These matches are also known the “seeds” or “hot spots” referred to in thediscussion on seed and extend methods.In contrast to other sensitive comparison methods, Positional Hashingdivides the comparison problem into the sub-problems of findingsimilarities within subsets of diagonals, each subset consisting ofdiagonals L base pairs apart (Fig. 1B). These sub-problems are eachindependently solvable on a separate node of a computer cluster. Tofurther localize detection of similarities, diagonals are divided intodiagonal segments, also of length L (Fig. 1C, dashed lines).How is the comparison problem divided into subsets? (fig. 1A => 1B)Mapping genomes onto each other - Synteny detection CS374 Fall 2004, Lecture 4,10/07/04Lecturer: Aswath Manohar Scribe: Chirag BhattThe alignment diagonals that start at the same position modulo a fixed distance L(typically around 500 bp) are jointly referred to as a “diagonal”. and are denotedby D(d), d = 0, …, L - 1. The two compared sequences, say S and T, areconceptually divided into the following non-overlapping subsequences of lengthL: Si = S[i * L + 1, …, (i + 1)*L] where i = 0, …,|S|/L - 1 and Ti’ = T[i’ * L + 1, …, (i’ + 1) * L], where i’ = 0, …, |T|/L - 1.How the hash tables are built? (fig. 1B => 1C)Positional hash tables H(d)j,j+d, where j = 0, …, L - k, which correspondto the diagonal D(d) contain the indices i and i’ of k-mers Si[j + 1, …, j+ k] and Ti’[d + j + 1, …, d + j + k] for all i and i’. Identical k-mers aretranslated into the same hash key, and their corresponding indices areconsequently collected in the same hash

View Full Document