Stanford CS 262 - Lecture 10 - Computational Genomics - D2090649

Home> Schools> Stanford University> Computer Science (CS) > CS 262> Lecture 10 - Computational Genomics

DOC PREVIEW

Stanford CS 262 - Lecture 10 - Computational Genomics

School name Stanford University

Course Cs 262- Computational Genomics

Pages 9

This preview shows page 1-2-3 out of 9 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 9 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS262 Computational Genomics (Lecturer: Serafim Batzoglou)Scribe: Satish ViswanathamLecture 10: DNA Sequencing and Fragment AssemblyFebruary 8,2007The aim of DNA sequencing is to discover the nucleotide sequence of a given DNA. Because of its wide research implications on disease and biological discovery in general, there have been both government-funded and private efforts (Celera Genomics) to push forward the sequencing of the human genome. Using the shotgun approach, Celera Genomics completed a draft of the human genetic code in 2000. For whole genome sequencing, the current preferred protocol is Shotgun Sequencing advocated by Gene Meyers.Gel ElectrophoresisGel Electrophoresis is the main technology used to perform DNA sequencing today. The protocol was invented by Fred Sanger in 1975, who received a Nobel Prize for this work. DNA sequencing was, initially, very expensive and laborious, but now the process has become much more efficient and cheaper.The gel electrophoresis machine can sequence a read of 500-1000 nucleotides of DNA fragment ends. More than 1000 nucleotides result in too much noise in the signal. In 1990, the cost of sequencing was 30 dollars per nucleotide. In 2003, the cost was 1 cent per nucleotide. The trend of sequencing costs over the years shows that the cost is decreasing at an exponential rate. However, gel electrophoresis is likely to be replaced by other methods by the end of the decade. DNA must be excised into fragments of random length using restriction enzyme(RE)s. The RE digested fragment can then be inserted into a vector (cut with the same set of restriction enzymes) to form a clone. The clones can be replicated by the cellular machinery to give rise to many copies of different lengths 1,2..., thus forming a library of clones.A vector is basically a circular piece of DNA that can be replicated. Examples of a vector are: plasmid (2,000 to 10,000), cosmid (40,000), BAC (bacterial artificial chromosome 70,000 to 300,000) and YAC (Yeast Artificial chromosome > 300,000). The vector contains restriction sites that allows for the insertion of a DNA fragment in a site specific manner. After the fragment is inserted, it can get replicated as part of the vector by the cellular machinery. The insertion stops when a modified nucleotide (dideoxynucleoside) is inserted instead of nucleotide from DNA fragment. Also, we know exactly where in the vector the fragment has been incorporated (Restriction sites flank this fragment). Many types of vectors can be used, depending on the size of the insert. The length of the actual fragment insert can be controlled to +/- 20%. This procedure may error have error rate of 1% occasional insert or delete of a nucleotide.Procedure1. Start with a DNA primer that anneals (complementary to) the restriction site on the vector2. Grow DNA chain starting from primer3. The nucleosides that are incorporated include normal a, c, g, t as well as modified a, c, g, and t. The modified nucleosides serve as reaction terminating points once they get incorporated into the growing chain.4. If we perform sequencing procedure with sufficient copies of the clone, the reaction has a high likelihood to stop at every possible position in the sequence to get products of length 1,2,3,4,5…. n (where n = length of read)5. Separate products using gel electrophoresis (the smaller the fragment, the faster the migration when a current is applied)6. By looking at what the last position dideoxynucleoside (color-labeled) is for each length, can infer the complementary letter in the original DNA sequence at that particular position to piece together complete DNA sequenceData Analysis – PHRED et.alThe readout of gel electrophoresis is marked by 4 different colors that represent the four different terminating dideoxynucleosides. In the readout, peaks reflect the intensity of readout (how many products there are for each length). However, DNA sequencing is a noisy process that may give rise toambiguous/hard-to-interpret peak readouts. In addition, it is challenging to obtain long reads. PHRED (Phil’s Read Editor), designed by Phil Green, is a standard signal processing program that can translate the readout into actual DNA sequence.PHRED filters, smoothens, and corrects for length compressions. It works similar to a dynamic programming-based HMM parsing of the read: what is the most likely state (letter) for each position? Today, there are better programs out there than PHRED, but labs usually use PHRED because it’s regarded as standard. A read is typically 500-1000 nt, and a quality score (lower bounds on theaccuracy) is assigned to each read. Quality Score = -10 x log 10 Prob(Error) A quality score of 30 translates to a 1/1000 chance of error. A quality score of 40 is the maximum meaningful quality score you can assign to a read. An important method that dramatically changed the way we do sequencing is Double-Barreled Sequencing, which was invented around 1990. This method reads both right and left portions of a fragment, thus getting a paired/two linkedreads that came from the same fragment.Shotgun SequencingTake a DNA segment of genome excise into fragments at random positions get 2 reads (500 bp) for each end of each fragment cover each region with approx. 7 x redundancy match overlaps of reads (overlap of 50 letters or more is very unlikely to be by chance) extend into longer and longersequence reads.Definition of CoverageL = length of genomic segmentN = number of reminimal tilingadsl = length of each readC = coverageC = n l / LCoverage of 1 will consists of lot of duplicate reads and can result in lot of gaps in the genome being sequenced. Same goes for coverage 2 and 3. Reads are distributed randomly, and a good coverage is typically >= 7x. According to the Lander-Waterman Model: if the read distribution is uniform, coverage of 10x will give you a gapped region every 1,000,000 nucleotides (reads are usually 500).Overlap may be difference 5% difference of in a read. Assemble such aligned reads into larger regions. The most popular tool PHRAP was used for this purpose. There are better tools (not quadratic time) are available now.Sequence assembly is difficult due to repeats, which are significant in number for higher organisms. This is because higher organisms have more energy resources to spend in DNA replication without feeling the cost, and therefore incentive to keep their genomes concise is low. 50% of the human DNA is

View Full Document