DOC PREVIEW
Stanford CS 374 - Study Notes

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

“bti1018” — 2005/6/10 — page 351 — #1BIOINFORMATICSVol. 21 Suppl. 1 2005, pages i351–i358doi:10.1093/bioinformatics/bti1018De novo identification of repeat families inlarge genomesAlkes L. Price, Neil C. Jones and Pavel A. Pevzner∗Department of Computer Science and Engineering, University of California San Diego,La Jolla, CA 92093-0114, USAReceived on January 15, 2005; accepted on March 27, 2005Every time we compare two species that are closer toeach other than either is to humans, we get nearly killedby unmasked repeats.WebbMiller(Personal communication)ABSTRACTMotivation: De novo repeat family identification is a challen-ging algorithmic problem of great practical importance. As thenumber of genome sequencing projects increases, there is apressing need to identify the repeat families present in large,newly sequenced genomes. We develop a new method forde novo identification of repeat families via extension of con-sensus seeds; our method enables a rigorous definition ofrepeat boundaries, a key issue in repeat analysis.Results: Our RepeatScout algorithm is more sensitive and isorders of magnitude faster than RECON, the dominant toolfor de novo repeat family identification in newly sequencedgenomes. Using RepeatScout, we estimate that ∼2% of thehuman genome and 4% of mouse and rat genomes consist ofpreviously unannotated repetitive sequence.Availability: Source code is available for download at http://www-cse.ucsd.edu/groups/bioinformatics/software.htmlContact: [email protected] INTRODUCTIONRepetitive DNA comprises a significant fraction of euka-ryotic genomes, e.g. ∼20% of Caenorhabdits elegans andCaenorhabditis briggsae genomes (Stein et al., 2003) and∼50% of the human genome (International Human GenomeConsortium, 2001) have been identified as repetitive DNA.Repeat identification is a critical part of the analysis of anewlysequenced genome, both because repeats drive genomeevolution in diverse ways (Kazazian, 2004) and because of apragmatic need for thorough repeat masking prior to perform-ing homology searches.1RepeatMasker (Smit and Green,http://repeatmasker.org) is an important and widely used∗To whom correspondence should be addressed.1See Bourque et al. (2004) and Gibbs et al. (2004) for a description of diffi-culties with the identification of synteny blocks and similarity anchors causedby inadequate repeat masking of the rat genome.tool for identifying and masking individual repeat elementsgiven an existing library of repeat families such as RepbaseUpdate(Jurka,1998, 2000). However,RepeatMaskerdoes notaddress the need to build such libraries for newly sequencedgenomes; it, in fact, highlights this need. As the number ofgenome sequencing projects increases, there is a pressingneed to identify the repeat families present in large, newlysequenced genomes.All existing algorithms for building a set of repeat fam-ilies start with a set of pairwise similarities, such as a setof pairwise alignments generated by WU-BLAST (Gish,http://blast.wustl.edu) or REPuter (Kurtz and Schleiermacher,1999; Kurtz et al., 2000). The early single linkage clusteringapproach (Agarwal and States,1994) firstmerges overlappingsubstringsappearinginthesetofpairwisesimilarities,thenusesthe pairwise similarities to group the merged substrings intorepeatfamilies.Thesinglelinkageclusteringapproachhasbeensignificantly extended and improved in two recent algorithms,RepeatFinder (Volfovsky et al., 2001) and RECON (Bao andEddy, 2002). A different approach to gluing pairwise similar-ities into repeat families, which captures the mosaic subrepeatstructure exhibited by some repeat families, is implementedin the RepeatGluer algorithm (Pevzner et al., 2004). Also ofinterestisthePILERalgorithm(EdgarandMyers,2005),whichachieves high specificity in distinguishing different types ofrepeats, at the sacrifice of some sensitivity.Using a set of pairwise similarities as the starting point forbuilding a set of repeat families has two disadvantages. First,as the authors of RECON state, ‘difficulty in defining elementboundaries causes ...problems in clustering related elementsinto families’ (Bao and Eddy, 2002). Second, for large repeat-rich genomes, producing a set of pairwise similarities can bea prohibitively computationally intensive task. For example,Alu repeats, with >106copies in human genome, give riseto ∼1012pairwise alignments, making explicit constructionof a set of pairwise similarities computationally infeasible.One way around this problem is to iteratively run on a smallsample of the genome and then analyze progressively largersamples after masking the repeat families already identified(Bao and Eddy, 2002). This has shortcomings as well: repeatfamilies constructed from a small sample of the genome will© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] i351“bti1018” — 2005/6/10 — page 352 — #2A.L.Price et al.be less accurate (particularly with respect to boundaries), andmosaic subrepeats which have already been masked will beexcluded from appearing in other repeat families, leading tofragmentation. Emphasizing these difficulties, Bao and Eddyconcludethat ‘the problem ofautomated repeat sequence fam-ily classification is inherently messy and ill-defined and doesnot appear to be amenable to a clean algorithmic attack.’In this paper, we describe a surprisingly simple and fastmethod which addresses these problems. Our RepeatScoutalgorithm builds a set of repeat families by using high-frequency lmers (i.e. short substrings of length l) as seeds,and greedily extends each seed to a progressively longer con-sensus sequence, following the dynamically inferred align-ments between the consensus sequence and its occurrencesin the genome. The straightforward, transparent nature of ourapproach is a significant advantage over other repeat familyidentification algorithms. In addition, ourapproach utilizes anefficient method of similarity search and enables a rigorousdefinition of repeat boundaries.In the past fewyears, the RECON algorithm (Bao and Eddy,2002) has become the dominant tool for de novo repeat fam-ily identification in newly sequenced genomes. For example,RECON has been used to construct a library of C.briggsaerepeat families (Stein et al., 2003), making this an ideal testbed. A library of C.briggsae repeat families can be evalu-ated by analyzing the set of repeat elements


View Full Document

Stanford CS 374 - Study Notes

Documents in this Course
Probcons

Probcons

42 pages

ProtoMap

ProtoMap

19 pages

Lecture 3

Lecture 3

16 pages

Load more
Download Study Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Study Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Study Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?