Unformatted text preview:

CS 2427 Algorithms in Molecular Biology Lecture 2 13 January 2006 CS 2427 Algorithms in Molecular Biology Lecture 2 13 January 2006 Lecturer Michael Brudno Scribe Notes by Alex Hertel 1 Today s Topics Sequencing The Human Genome BAC by BAC Technique Celera s Technique The Role of Graph Theory 2 Sequencing The Human Genome The sequencing of the human genome also known as the Human Genome Project is one of the most ambitious and impressive research projects ever attempted It was completed several times lastly in 2003 though it may be completed again Our genome consists of approximately 3 10 9 base pairs b p and is divided into 23 pairs of chromosomes Each human chromosome is linear that is to say each chromosome has a distinct start and end point This is in contrast with the chromosomes of bacteria which are circular Also bacteria typically have only one chromosome DNA is a massive molecule which is not only coiled in the well known double helix structure but this structure is further folded over and folded upon itself A testament to how compressed it is within the nucleus of a cell is the fact that if it were not wrapped our genome would be about 4 meters long but instad fits in a micron sized cell nucleus 3 BAC by BAC Technique Since the human genome is so long it clearly cannot be sequenced manually the entire process must be highly automated for us to stand any chance of success Unfortunately our automated methods for taking a strip of DNA and sequencing it can only reliably handle strands that contain about 500 b p In other words our methods cannot come even remotely close to sequencing one whole chromosome let alone the entire genome We therefore are forced to cut up the DNA into smaller pieces and sequence them 1 CS 2427 Algorithms in Molecular Biology Lecture 2 13 January 2006 The standard technique for doing this previously was called the BAC by BAC method BAC stands for Bacterial Artificial Chromosome This name comes from the fact that during this method we use e coli bacteria to perfectly duplicate strands of human DNA by inserting those strands into the circular e coli genome and letting it divide The BAC by BAC technique works as follows 1 Make many copies of the human genome to be sequenced 2 Take a number of copies of the human genome and split them up into much smaller strands of DNA each consisting of about 150000 b p Each of these strands is called a BAC The overall idea is that we will sequence the BACs and then use the fact that they overlap to put all of those squences together in the obvious way and thereby obtain a sequencing of the entire human genome We therefore must make sure that all of the BACs together not only contain the entire human genome but that the overlaps are sufficiently large for us to be able to reliably put them together again Figure 1 below shows an example of how the genome can be tiled using many BACs 3 109 b p Human Genome BACs 150000 b p Figure 1 Many overlapping BAC strands compared to the entire genome 3 Once we have the BACs we have to find out where they came from in the genome and then choose the smallest set of BACs that completely covers the enitre genome This process is very expensive in human time 4 The next step is to sequence all of these BACs It is easy to see that if we correctly sequence the BACs then that will give us a correct sequencing of the entire human genome Sequencing the BACs is done analogously to sequencing the genome we take each BAC make copies of it and then randomly cut them up into small pieces which are approximately 4000 b p long We take these and use our automated sequencing technology to sequence approximately 500 b p at each end as shown below in Figure 2 Each of these 500 b p sequences is called a read and it tells us not only what is on one strand of the DNA but by complementary base pairing it immediately gives us the other strand as well For the smaller 4000 b p pieces the gap is 3000 b p long It is also normal to use longer 40000 b p pieces with a much 0 0 larger gap One important note is that each read is done from the 5 to the 3 end so the two reads from each fragment are from opposite strands and opposite ends of the fragment With 2 CS 2427 Algorithms in Molecular Biology Lecture 2 13 January 2006 enough overlapping reads it becomes possible to put them together in order to sequence the BAC 4 000 b p 500 b p Read 500 b p Read 3 000 b p Gap Figure 2 Each DNA fragment consists of two strands each of which is read from the opposite direction 4 Celera s Technique Celera Genomics a private company and its founder Craig Venter developed a technique which can be automated to a greater degree than the traditional method Celera s idea was to cut out the intermediate step of having to sequence the BACs instead they split the original DNA sequence into 40 000 and 4000 b p segments directly these are called cosmids and plasmids respectively and from there to read them and then assemble the reads for the whole genome Since the splitting of the genome into the shorter pieces is relatively random ensuring that the entire genome gets covered requires the sequencing of base pairs approximately ten times the size of the genome or 3 1010 b p This translates to approximately 6 10 7 reads This entire process is called whole genome shotgun sequencing since the process of fragmenting DNA strands is reminiscent of the tiny shot pellets coming out of a shotgun With a sufficient number of reads it is not hard to see that the entire problem of sequencing the human genome has been reduced to the combinatorial problem of amalgamating semi overlapping strings correctly 5 The Role of Graph Theory It turns out that the problem of putting all of the reads back together to either form a BAC or the human genome itself can be expressed as a problem in graph theory Simply take every individual read and create a node corresponding to its sequence For any two nodes reads that have an overlaping segment create a directed edge from the first node to the second one It is not hard to see that a Hamiltonian Path a simple path containing each node exactly once in the graph will yield a correct amalgamation of the genome fragments and therefore give us a correct sequencing of the BAC genome We will call the graph with reads as nodes and overlaps as edges string graphs 3 CS 2427 Algorithms in Molecular Biology Lecture 2 13 January 2006 Bidirected Graphs 0 There is a problem however although each of the two reads from a fragment is done


View Full Document

Toronto CSC 2427 - Lecture 2 Notes

Download Lecture 2 Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 2 Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 2 Notes and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?