Conditional Random FieldsHMMs and CRFsCRFs - Features“Features” that depend on many pos. in xConditional TrainingSlide 6Slide 7Slide 8Conditional Random Fields—SummaryDNA SequencingSlide 11Sequencing and Fragment AssemblyWhat can we do about repeats?Slide 14Slide 15Slide 16Slide 17Strategies for whole-genome sequencingWhole Genome Shotgun SequencingFragment Assembly (in whole-genome shotgun sequencing)Fragment AssemblySteps to Assemble a Genome1. Find Overlapping ReadsSlide 24Slide 25Slide 262. Merge Reads into ContigsSlide 28Slide 29Slide 30Slide 31Slide 32Overlap graph after forming contigsRepeats, errors, and contig lengthsSlide 35Slide 36Slide 37Slide 384. Derive Consensus SequenceSome AssemblersQuality of assemblies—mouseQuality of assemblies—dogHistory of WGAGenomes SequencedCS262 Lecture 9, Win07, BatzoglouConditional Random FieldsA brief descriptionCS262 Lecture 9, Win07, BatzoglouHMMs and CRFs•Features used in objective function P(x, ) for an HMM:akl, ek(b), where b Vl(i) = Vk(i – 1) + (a(k, l) + e(l, i)) = Vk(i – 1) + g(k, l, xi)Let’s generalize g!!! Vk(i – 1) + g(k, l, x, i)12K…12K…12K…………12K…x1x2x3xN21K2CS262 Lecture 9, Win07, BatzoglouCRFs - Features1. Define a set of features that may be importantNumber the features 1…n: f1(k, l, x, i), …, fn(k, l, x, i)•features are indicator true/false variables2. Train weights wj for each feature fjThen, g(k, l, x, i) = j=1…n fj(k, l, x, i) wjx1x2x3x6x4x5x7x10x8x9ii-1CS262 Lecture 9, Win07, Batzoglou“Features” that depend on many pos. in x•Score of a parse depends on all of x at each position•Can still do Viterbi because state i only “looks” at prev. state i-1 and the constant sequence x1x12x23x34x45x56x6…1x12x23x34x45x56x6…HMMCRFCS262 Lecture 9, Win07, BatzoglouConditional Training•Hidden Markov Model training:Given training sequence x, “true” parse Maximize P(x, )•Disadvantage:P(x, ) = P( | x) P(x)Quantity we care aboutso as to get a good parseQuantity we don’t care so muchabout because x is always givenCS262 Lecture 9, Win07, BatzoglouConditional TrainingP(x, ) = P( | x) P(x)P( | x) = P(x, ) / P(x)RecallFj(x, ) = # times feature fj occurs in (x, )= i=1…N fj(k, l, x, i) ; count fj in x, In HMMs, let’s denote by wj the weight of jth feature: wj = log(akl) or log(ek(b))Then,HMM: P(x, ) = exp[j=1…n wj Fj(x, )]CRF: Score(x, ) = exp[j=1…n wj Fj(x, )]CS262 Lecture 9, Win07, BatzoglouConditional TrainingIn HMMs,P( | x) = P(x, ) / P(x)P(x, ) = exp[wTF(x, )] P(x) = exp[wTF(x, )] =: ZThen, in CRF we can do the same to normalize Score(x, ) into a prob.PCRF( | x) = exp[j=1…n wjF(j, x, )] / ZQUESTION: Why is this a probability???CS262 Lecture 9, Win07, BatzoglouConditional Training1. We need to be given a set of sequences x and “true” parses 2. Calculate Z by a sum-of-paths algorithm similar to HMM•We can then easily calculate P( | x)3. Calculate partial derivative of P( | x) w.r.t. each parameter wj(not covered—akin to forward/backward)•Update each parameter with gradient descent4. Continue until convergence to optimal set of weights P( | x) = exp[j=1…n wjF(j, x, )] / Z is convexCS262 Lecture 9, Win07, BatzoglouConditional Random Fields—Summary 1. Ability to incorporate complicated non-local feature sets•Do away with some independence assumptions of HMMs•Parsing is still equally efficient2. Conditional training•Train parameters that are best for parsing, not modeling•Need labeled examples—sequences x and “true” parses (Can train on unlabeled sequences, however it is unreasonable to train too many parameters this way)•Training is significantly slower—many iterations of forward/backwardDNA SequencingCS262 Lecture 9, Win07, BatzoglouSome Terminologyinsert a fragment that was incorporated in a circular genome, and can be copied (cloned)vector the circular genome (host) that incorporated the fragmentBAC Bacterial Artificial Chromosome, a type of insert–vector combination, typically of length 100-200 kbread a 500-900 long word that comes out of a sequencing machinecoverage the average number of reads (or inserts) that cover a position in the target DNA pieceshotgun the process of obtaining many reads sequencing from random locations in DNA, to detect overlaps and assembleCS262 Lecture 9, Win07, BatzoglouSequencing and Fragment AssemblyAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT3x109 nucleotides50% of human DNA is composed of repeatsError!Glued together two distant regionsCS262 Lecture 9, Win07, BatzoglouWhat can we do about repeats?Two main approaches:•Cluster the reads•Link the readsCS262 Lecture 9, Win07, BatzoglouWhat can we do about repeats?Two main approaches:•Cluster the reads•Link the readsCS262 Lecture 9, Win07, BatzoglouWhat can we do about repeats?Two main approaches:•Cluster the reads•Link the readsCS262 Lecture 9, Win07, BatzoglouSequencing and Fragment AssemblyAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT3x109 nucleotidesCRDARB, CRDorARD, CRB ?A R BCS262 Lecture 9, Win07, BatzoglouSequencing and Fragment AssemblyAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT3x109 nucleotidesCS262 Lecture 9, Win07, BatzoglouStrategies for whole-genome sequencing 1. Hierarchical – Clone-by-clonei. Break genome into many long piecesii. Map each long piece onto the genomeiii. Sequence each piece with shotgunExample: Yeast, Worm, Human, Rat2. Online version of (1) – Walkingi. Break genome into many long piecesii. Start sequencing each piece with shotguniii. Construct map as you goExample: Rice genome3. Whole genome shotgunOne large shotgun pass on the whole genomeExample: Drosophila, Human (Celera), Neurospora, Mouse, Rat, DogCS262 Lecture 9, Win07, BatzoglouWhole Genome Shotgun Sequencingcut many times at randomgenomeforward-reverse paired readsplasmids (2 – 10 Kbp)cosmids (40 Kbp)known dist~500 bp~500 bpCS262 Lecture 9, Win07, BatzoglouFragment Assembly(in whole-genome shotgun sequencing)CS262 Lecture 9, Win07, BatzoglouFragment AssemblyGiven N reads…Given N reads…Where N ~ 30 Where N ~ 30 million…million…We need to
View Full Document