Stanford CS 262 - CS262 Lecture 11 – Cont’d Fragment Assembly - D1725551

Home> Schools> Stanford University> Computer Science (CS) > CS 262> CS262 Lecture 11 – Cont’d Fragment Assembly

DOC PREVIEW

Stanford CS 262 - CS262 Lecture 11 – Cont’d Fragment Assembly

School name Stanford University

Course Cs 262- Computational Genomics

Pages 16

This preview shows page 1-2-3-4-5 out of 16 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Scribed by: Jasmyn Pangilinan CS262: Lecture 11 – Cont’d Fragment Assembly February 13, 2007 I. Review of Terminology: insert – fragment from our target DNA sequence that was incorporated in a circular genome so that we can replicate it so we can make clones of it so we can sequence it with gel electrophoresis Sequencing proceeds from the restriction site that defined the boundaries where the insert was inserted inside the vector towards the middle of the insert. Double barreled sequencing – sequence the forward and reverse strand so that we have two reads called forward and reverse mate pairs or forward and reverse linked reads. The two reads come at a given distance in your target sequence vector – a circular genome that is the host of the fragment. BAC – Bacterial artificial chromosome which is a certain kind of vector that can incorporate fragments of size 70 to 300kb nucleotides (100-200kb on average) read – a word that comes out of a sequencing machine that is 500-900 base pairs long on average. Typically there are two reads per clone sequenced at either ends of the insert with a known distance apart give or take a standard deviation of ~20%. coverage – the average number of reads (or inserts) that cover a position in the target sequence Some distinction between sequence and physical coverage: Sequence coverage refers to number of reads on average that covers a genomic position, whereas physical coverage refers to the number of inserted fragments (inserts) whose ends have been covered by a given region. shotgun sequencing – the process of obtaining many reads from a target sequence at random locations in DNA, detecting overlaps and assembling them with algorithms. II. Whole Genome Shotgun Sequencing Whole Genome Shotgun (wgs) sequencing is the process of breaking up a target genomic region into many fragments, incorporating them into vectors to make clone inserts, where we sequence both ends of each fragment which are called mate pair reads. Typically two 500 base pair (bp) reads in each direction of the insert are sequenced with some known distance between them depending on the insert size(plasmids are 2-10kb while cosmids and fosmids are 40kb). We then have lots of mate pairs that we can assemble with algorithms to be described today. III. Fragment Assembly How can we put an assembly together and avoid problems such as errors in reads (errors in the sequence)? Sequencing errors can happen at a rate of ~1% depending on the exact technology and on the position of the reads (ends of the reads tend to have more errors than the middle of the reads). However, the most difficult problem in assembly is repeats. How can we put together reads and reconstruct the original sequence? By itself, the information of two reads may have a lot of good sequence alignment but this is not enough to determine if they came from the same region in the genome. We need to determine if an alignment between two reads is due to a true overlap or because of a repeat. The problem of sequence assembly is a difficult problem. When we want to scale the assembly to a whole-genome level, we really want to apply linear-time algorithms, hence, anything we do using our reads has to have this. Otherwise, the problem will be too large for any computer today. Overview of Main algorithm for assembling a genome: 1. Overlap detection – find pairs of reads that overlap. cut many times at random genome plasmids (2 – 10 Kbp) cosmids (40 Kbp) known dist ~500 bp ~500 bp2. Merge Reads into Contigs – merge some pairs of reads into contigs while avoiding mistakes due to repeats or errors in the reads. Contigs are short for contiguous sequences 3. Link Contigs into Supercontigs – create larger structures of contigs by linking them into supercontigs. Some general details: Contigs tend to extend up to boundaries of repeats, as soon as we see a repeat, we stop. Supercontigs are ordered lists of contigs that span across repeats. Put together contigs flanking repeats using read pair information. Supercontigs will give us an effective multiple alignment of the reads within the supercontig. Reads will be ordered and oriented in respect to one another so that we can arrange them into columns that are multiply-aligned. 4. Consensus – consensus sequence can be achieved from the multiple alignment of the supercontigs. Each base is derived using weighted voting. (Alternative: take maximum-quality letter) More details about the algorithms for each of the above steps: 1. Find Overlapping Reads A. k-mers and Repeats Example: a mammalian genome at 10x coverage will require 60 million reads (with average 500 bp read length). We need an algorithm for finding every pair of reads that overlap without spending quadratic computation in the number of reads. The idea is to use a blast-based approach of finding shared k-mers between pairs of reads: For almost perfect alignments we could choose a k-mer of 50 bases in length. However, it is typically preferred to choose a k-mer size between 20-25 bp in order to accommodate sequencing errors. Start by listing all words of a given constant length (k-mer size) that occur within each read and its reverse aaactgcagtacggatct aaactgcag aactgcagt … gtacggatct tacggatct gggcccaaactgcagtac gggcccaaa ggcccaaac … actgcagta ctgcagtac gtacggatctactacaca gtacggatc tacggatct … ctactacac tactacacacomplement since we do not know which direction of DNA it was taken from (we need to order and orient them with respect to one another). Example: 524 bp-length read would generate 1000 words, word and position occurrences. Create a table indexed by read, position of the read, word and orientation. Then, order them lexicographically by word, the read it occurs on, the orientation and the position: This computation step takes constant length words, and we can use a linear sorting algorithm to take linear time. (for example, a radix sort or STL’s sort). Now we can find all pairs of reads that share a word, which is easy once sorted. However, this cannot be done in one step because it would require too much memory. This (read, pos., word, orient.) aaactgcag aactgcagt actgcagta … gtacggatc tacggatct gggcccaaa ggcccaaac gcccaaact … actgcagta ctgcagtac gtacggatc tacggatct acggatcta … ctactacac tactacaca (word, read, orient., pos.) aaactgcag aactgcagt

View Full Document