DOC PREVIEW
Berkeley COMPSCI 294 - Lecture Notes

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS 294-8 Computational Biology for Computer Scientists Spring 2003 Lecture 6: February 6, 2003 Lecturer: Gene Myers Scribe: Bill Kramer Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. They may be distributed outside this class only the permission of the Instructor. 6.1 Basic sequencing problem Let bi and ei be the beginning and end of a sequence of nucleotides. Then the basic sequencing problem can be described as Given ? > 0 and R = ??Riir1?, find the shortest string and pairs ?? ebRiii1,?such that for every i if bi < ei then C[bi, ei] ?? ri else C[ei, bi] ?? ri Where C[,] is the shortest string such that ?reads ri ? (bi, ei) such that C[bi, ei] = ri.. C[bi, ei] ?? ri indicates C approximately matches ri to some value. In order to deal with this problem, two Watson-Crick complementary strands have to be accounted for, one left to right and the other right o left since the orientation of the strands is not known. Figure 6.1 shows this. However, matching under the shortest sequence criteria creates a problem if the sequence contains repeats of perfectly matching segments, as described in Lecture 5.4. The signature of such a repeat will be the fact that the apparent coverage a genome G in the reads of the repeat segment are greater than the rest of the G. The Kalmogorov-Smirnov test statistic measures this deviation – yet it looks only at the maximum deviation from the distribution.G bi ei bi ei Left to right Right to Left Figure 6.1: The two possible orientations of a sequence beginning at bi and ending at ei The actual challenge is the Kalmogorov-Smirnov is looking only for local deviations, which in a long sequence look like noise. Thus, the shortest sequence approach does not yield very accurate sequences. 1 0 G0 Figure 6.2: A graph of cumulative distribution of the Kalmogorov-Smirnov statistic shown in the blue solid line.. The idela cumulative distribution is shown as the dashed line. 6.2 The Overlap, Layout and Consensus Paradigm A better approach is to look for overlaps first, then produce a layout of the sequence and finally try to reach an agreement that the sequence is a reasonable representation. 6.2.1 Overlap For the overlap stage, it is necessary to compare every read against every other read, looking for significant overlap. Significant overlap can be defined as the two reads having matching nucleotides at their suffix or prefix ends for some N positions, where N is an arbitrary threshold.There are eight possible situations, based on the orientation of sequences. Say A and B are two sequences. We do not know their orientation – they may be 3’ to 5’ or 5’ to 3’. Figure 6.3 shows four of these situations, and the other four come from reversing the direction of A and B. A “Innie” B A B “Outie” A B A B Figure 6.3: Possible combinations of two overlapping reads A and B. A mirror image of these orientations are possible if the orientation of A and B are reversed. Thus for an overlap, it is necessary to compare A to B and A to B complement. From the comparisons, it is possible to create a directed graph where every read is represented as a vertex and every overlap as an edge. If there is overlap (above a threshold value) between read A and B, then an edge is put into the graph between vertices A and B. The graph will be bi-directed, meaning that each edge will have two direction arrows – one at each end of the edge. The arrows will either point into or out of a vertex. An arrow on vertex A will go away from the vertex if it corresponds to the overlap occurring at the suffix of A. It will point to the vertex if the overlay is at the prefix of A. Reads are always considered to be from 5’ to 3’, but it is not possible to tell which complement a read comes from. Hence the outdegree of a vertex is defined as the suffix matches and the indegree as the prefix matches. Proper matches are those for which there is no complete overlap – that is where neither sequence completely contains the other sequence. Containments where one read is entirely overlapped by another, is usually handled as a special case and is typically indicated with double line edges. However, for the purposes here they are ignored. The orientation of the arrows in the graphs entirely relative. Figure 6.4 shows the graph of two reads for the proper matches. Figure 6.5 shows the graph that corresponds to a possible match of four reads.A “Innie” B A B “Outie” A B A A A B B B A B A B Figure 6.4: Overlap graphs for possible configurations of overlapping reads. Read Overlaps C A D B A B C D Overlap Graph Figure 6.5: A bi-directed overlap graph for four reads with the indicated overlap All to all matching in this manner quickly gets to be very large. For the Human genome, there were typically 40 million reads, each with 30 Megabases. This then requires 16 x 1020 compares. 6.2.2 Layout Using the transitive property, is it possible to create a layout of the reads from the overlap graph. A layout is a spanning forest and is the minimal representation of a spanning tree. Any spanning forest of the overlap graph produces a possible layout of the reads. But there can be conflicts and uncertainties in the arrangement from a spanning tree. The question is how to provide a weight to the edge in order to get an consistent, optimum layout. Is it possible get a weight that corresponds to the length of the overlap? In the graph, paths of the proper edges, with possibly trees of contained edges included guarantees a consistent layout. This eliminates conflicts.C A H B I D G E F Figure 6.6: A possible overlap of reads and the corresponding spanning forest. A B C D E F G H I Figure 6.7: A graph of the proper edges and containments of the above forest. The next step to be considered is approximate matching. If A approximately matches B and B approximately matches C, does this imply A approximately matches C? No – because approximate matching is not transitive. If you have a string of length N, it is almost certainly the case that a string of 2 log N will exist by change as a substring. The minimum information in the layout graph has to do with alignment of strings, but also the length of the overlap. If you weigh every proper edge ri with the length of the overlap for the reads Li, then it is possible to create a metric or


View Full Document

Berkeley COMPSCI 294 - Lecture Notes

Documents in this Course
"Woo" MAC

"Woo" MAC

11 pages

Pangaea

Pangaea

14 pages

Load more
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?