4/29- Lecture 8: DNA Sequencing4/29- Lecture 8: DNA SequencingQuick aside on upcoming topics: sequencing, gene recognition- applying HMMs, large scale genomic alignment (human, mouse), multiple alignment, protein alignment across species- many short sequences, DNA microarrays, finding regulatory motifs, (time permitting) phylogeny & rearrangements, RNA structural folding. Exciting stuff! Organisms are characterized by their genomes- specific long sequence of 4 nucleotides- A, C, G, T. This sequence captures all information, in genes, needed for the animal to reproduce, develop, and perform all biological functions. In each cell, there is a full copyof the genome. Different genes are “expressed,” i.e. “turned on” in different types of cells.Genes are transcribed and then translated into proteins.Challenge: find the exact sequence of nucleotides in a given organism, e.g. human.Challenging both for technology and computational methods.Frequent question: which human was sequenced? Two answers: 1. Craig Venter- former CEO of Celera. “Anyone who doesn’t want his genome sequenced shouldn’t be in this business.” (paraphrased)2. It doesn’t matter- we’re all very similar.Polymorphism rate: the # of letters that differ between 2 organisms of the same species.In human, rate is very low- ~1/1,000-1/10,000SNP- single nucleotide polymorphism. (pronounced “snip”) There are also areas with longer differences. Pathological cases: extra copies of an entire chromosome, or fusion of two chromosomes-cause disease states.We’ll focus on SNP’s, which is only every few thousand in humans-relatively low rate.Small sea creature organism can vary by as much as 10%. SNP consortium- project to identify all human SNPs.Why are humans so similar? Generally, if you look at a smallpopulation, genetic variation is reduced with each successive generation.Mate AA and BB AB, AB.Mate AB and AB 50% chance AB, 25% AA, 25% BBWith enough generations, you’re likely to lose either A or B.Humans: small population in Africa interbred for a while. Went on to populate the rest ofthe earth, having already lost much variation.~130k years ago, humans left Africa.20k years ago, humans entered North America.Chart showing genetic variation in Africa ~150k years ago. Subset left, went to Mesopotamia- smaller genetic variation in this group, and they were the ones who populated the rest of the earth. Interestingly, this means there is MORE variation within Africa than you see in the rest of the world.Different colors in graphic above indicate genetic variation.How much of this variation has to do with skin color? Not much! Interbreeding for ~1000 years can change skin color in a population.Biologists who do studies on human variation are considering doing their studies only on Africans- more for their money, as it were.How do we sequence DNA? Can’t just stick it in a machine, get 1M reads out. Can only sequence ~500 at a time.Start by breaking DNA into pieces by shaking it. (though this doesn’t mean that when you jump up and down, your DNA breaks apart.) Do this with many copies of the genome, and you get overlapping pieces.Then incorporate those fragments into biological hosts- generally circular genomes wherewe can insert DNA into a known location. BAC- Bacterial artificial chromosomes.Plasmids, YACs.Each type incorporates a different sized fragment when mixed.This helps so that we know the approximate size of a fragment. Not precise- sometimes chimerics where 2 pieces join together, etc. but we know approximate length.Plasmids- 2,000-10,000Cosmid- 40,000.BAC- 70-300kYAC- >300kSynthesize DNA from a restriction site. A primer sticks to complementary site, starts transcription. This is done in a “DNA soup”which contains many, many individual nucleotides. It also incorporate one type of di-deoxynucleotide (A’s, G’s, C’s, or T’s) which, when incorporated, causes transcription to end at that point.Di-deoxynucleotides are marked such that they can be seen in a gel. At the end, you havefragments of all sizes, and you know what nucleotide they end in.Run these fragments from one side of a gel to the other. In each column, put a mixture offragments that end in a different nucleotide, introduce current. Bigger fragments travel slower- don’t get as far.At the end, we can see bands where the different sizes ended up.This is why we can only sequence so many at a time- it’s easy to tell the difference between molecules 1 base long and 2 base long. This gets impossible to measure for DNA that’s 1000 vs. 1001 base pairs long.+=DNAShakeDNA fragmentsVectorCircular genome(bacterium, plasmid)Knownlocation(restrictionsite)Slide below shows (extremely clean) output for DNA sequencing. This data has been filtered, smoothed, corrected for concentration. As you’d expect, you’ll see fewer long molecules- this is corrected for. Y axis represents strength of signal read by the machine.Question: how long does this take? Dr. B isn’t sure, but gives the example: a lab can sequence ~7x coverage of a mammalian genome (~30B reads) in ~1 year.Very interesting signal processing problem- do the best you can reading the signal.Electropherograms- output of reading.PHRED- method for calling the letters (“A”, “G”, etc.)- used by almost all labs. (By Phil Green at UW.) There are many better methods out there now, but inertia makes labs reluctant to switch, despite the potential gain in accuracy.Output of PHRED is a read- ~500-700 long.Also gives quality scores: -10*log10Prob(error)So, score of 30 means ~1/1000 reads are wrong.Sequencing from both ends is referred to as double barreled sequencing.How to sequence segments longer than 500? Shotgun sequencing.Cut into many pieces, ~7x coverage. Then sequence one or both ends of each fragment.Do this many times so that you have overlaps between reads.Each time you find an overlap like this, you may be able to stitch these reads together.Sometimes get surprises as to the length- e.g. a specific archaea (vs. eukaryotes and bacteria)- expected ~2M, got ~4.5-5M.Coverage: need enough redundancy. Can calculate statistically what’s needed- Lander-Waterman method. Coverage = nl/L, i.e. number of reads times average length of reads, divided by the length of the genome.Redundancy of 10, read 500 long, expect 1 gap per million letters- pretty good- considered gold standard.Frequency of gaps depends on coverage and length of segments.So,
View Full Document