DOC PREVIEW
Stanford CS 374 - Lecture 4 - Mapping genomes onto each other -- Synteny detection

This preview shows page 1-2-3 out of 9 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Mapping genomes onto each other -- Synteny detectionI Background and IntroductionWhy map genomes onto each other?How is this genomic mapping achieved?II MotivationWhy aren’t current methods good enough?III The PASH Methodology3) Algorithmic Complexity of the Pash methodThe Worst-case scenario…In a Practical scenarios…4) Significance of SimilaritiesIV Results and ConclusionMapping genomes onto each other - Synteny detection CS374 Fall 2004, Lecture 4,10/07/04Lecturer: Aswath Manohar Scribe: Chirag BhattMapping genomes onto each other -- Synteny detectionBased on the following papers: 1. Ken J. Kalafus, Andrew R. Jackson, and Aleksandar Milosavljevic, “Pash: Efficient Genome-Scale Sequence Anchoring by Positional Hashing”, Genome Research, 672-678, December2003.Additional Resources:1. http://www.genboree.org; Interactive version of Figure 3 and the Virtual Genome Paintingprogram.2. http://www.hgsc.bcm.tmc.edu/projects/rat/; Rat Genome Project http://www.ncbi.nih.gov/;National Center for Biotechnology Information home page.I Background and IntroductionWhy map genomes onto each other?One expected benefit of genome sequencing is the identification of functionalDNA elements through comparative methods. A comparison of the mouse andhuman genomes has revealed that approximately 5% of these genomes areunder purifying selection (Waterston et al. 2002). Rat/human or rat/mousegenome comparisons yield similar statistics (Rat Genome SequencingConsortium 2004), yet only about a third of this conserved sequence isaccounted for by known genes, indicating that a large set of functional elementsremain uncharacterized. Identification of functional elements by genomecomparison depends heavily on the quality of sequence alignments.How is this genomic mapping achieved?There are many Dynamic Programming algorithms that can be used to identifythese similarities such as Standard Dynamic Programming algorithms byNeedleman and Wunsch in 1970 and Smith and Waterman 1981. However,these methods are computationally very expensive. There are recent fasteralgorithms such as LAGAN, but they perform well on a megabase scale (i.e. aftersome pre-processing has been done at the genome-scale).Even faster comparisons are achieved by the various “seed-and-extend”methods. In a seed-and- extend method, one or more exactly matching k-mers(“seeds” or “hot-spots”) provide initial evidence of possible similarity. Theseseeds are then extended into sequence alignments.The extension step is more accurate than the seeding step, but it iscomputationally expensive, so these methods quickly abandon most candidatesimilarities because they do not immediately yield alignments that are likely to bestatistically significant.Mapping genomes onto each other - Synteny detection CS374 Fall 2004, Lecture 4,10/07/04Lecturer: Aswath Manohar Scribe: Chirag BhattII MotivationWhy aren’t current methods good enough?The seed and extend method has two main drawbacks. The first being, large-scale comparison such as the genome-scale comparison’s are computationallyexpensive, and hence restrict such methods to labs that have access to largecomputing clusters. Another limitation of current implementations of seed-and-extend methods is that they provide few options to trade sensitivity for speed.To address the limitations of seed-and-extend methods, the Positional Hashingmethod (Pash) was developed. The Positional Hashing method representssequences as collections of short k-mers rather than as individual bases,throughout the comparison process. Local clusters of matching k-mers arecollated together to identify sequence similarity. Whereas other methods achieveparallelism by requiring users to divide the sequences into many subsequencesand perform all pairwise comparisons between them (thus incurring a quadraticpenalty), Positional Hashing achieves seamless parallelism in linear time byassigning computing nodes to compare subsets of diagonals.III The PASH Methodology 1) Dividing the Comparison Problem Across Diagonals (refer to figure on the next page)Consider the comparison matrix as shown in Fig 1a where each dimensionrepresents an entire genome of the species we want to compare. Any sequencesimilarity that may occur would occur along any one of the diagonals of thiscomparison matrix. Hence, if we observed a match of a string of k consecutivebase pairs (i.e, a k-mer) along any of the diagonals we have a k-mer match.These matches are also known the “seeds” or “hot spots” referred to in thediscussion on seed and extend methods.In contrast to other sensitive comparison methods, Positional Hashingdivides the comparison problem into the sub-problems of findingsimilarities within subsets of diagonals, each subset consisting ofdiagonals L base pairs apart (Fig. 1B). These sub-problems are eachindependently solvable on a separate node of a computer cluster. Tofurther localize detection of similarities, diagonals are divided intodiagonal segments, also of length L (Fig. 1C, dashed lines).How is the comparison problem divided into subsets? (fig. 1A => 1B)Mapping genomes onto each other - Synteny detection CS374 Fall 2004, Lecture 4,10/07/04Lecturer: Aswath Manohar Scribe: Chirag BhattThe alignment diagonals that start at the same position modulo a fixed distance L(typically around 500 bp) are jointly referred to as a “diagonal”. and are denotedby D(d), d = 0, …, L - 1. The two compared sequences, say S and T, areconceptually divided into the following non-overlapping subsequences of lengthL: Si = S[i * L + 1, …, (i + 1)*L] where i = 0, …,|S|/L - 1 and Ti’ = T[i’ * L + 1, …, (i’ + 1) * L], where i’ = 0, …, |T|/L - 1.How the hash tables are built? (fig. 1B => 1C)Positional hash tables H(d)j,j+d, where j = 0, …, L - k, which correspondto the diagonal D(d) contain the indices i and i’ of k-mers Si[j + 1, …, j+ k] and Ti’[d + j + 1, …, d + j + k] for all i and i’. Identical k-mers aretranslated into the same hash key, and their corresponding indices areconsequently collected in the same hash


View Full Document

Stanford CS 374 - Lecture 4 - Mapping genomes onto each other -- Synteny detection

Documents in this Course
Probcons

Probcons

42 pages

ProtoMap

ProtoMap

19 pages

Lecture 3

Lecture 3

16 pages

Load more
Download Lecture 4 - Mapping genomes onto each other -- Synteny detection
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 4 - Mapping genomes onto each other -- Synteny detection and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 4 - Mapping genomes onto each other -- Synteny detection 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?