Conditional Random Fields“Features” that depend on many pos. in xSlide 3How many parameters are there, in general?Conditional TrainingSlide 6Slide 7Slide 8Conditional Random Fields—SummaryDNA SequencingDNA sequencingWhich representative of the species?Slide 13Human population migrationsWhy humans are so similarMigration of human variationSlide 17Slide 18Slide 19Slide 20Slide 21DNA Sequencing – OverviewSlide 23DNA Sequencing – vectorsDifferent types of vectorsDNA Sequencing – gel electrophoresisElectrophoresis diagramsChallenging to read answerSlide 29Slide 30Reading an electropherogramOutput of PHRED: a readMethod to sequence longer regionsSlide 34Conditional Random Fields12K…12K…12K…………12K…x1x2x3xK21K2CS262 Lecture 9, Win07, Batzoglou“Features” that depend on many pos. in xVl(i+1) = maxk [Vk(i) + g(k, l, x, i+1)]Whereg(k, l, x, i) = j=1…n fj(k, l, x, i) wjx1x2x3x6x4x5x7x10x8x9ii-1CS262 Lecture 9, Win07, Batzoglou“Features” that depend on many pos. in x•Score of a parse depends on all of x at each position•Can still do Viterbi because state i only “looks” at prev. state i-1 and the constant sequence x1x12x23x34x45x56x6…1x12x23x34x45x56x6…HMMCRFCS262 Lecture 9, Win07, BatzoglouHow many parameters are there, in general?•Arbitrarily many parameters!For example, let fj(k, l, x, i) depend on xi-5, xi-4, …, xi+5•Then, we would have up to K | |11 parameters!Advantage: powerful, expressive model•Example: “if there are more than 50 6’s in the last 100 rolls, but in the surrounding 18 rolls there are at most 3 6’s, this is evidence we are in Fair state”•Interpretation: casino player is afraid to be caught, so switches to Fair when he sees too many 6’s•Example: “if there are any CG-rich regions in the vicinity (window of 2000 pos) then favor predicting lots of genes in this region”Question: how do we train these parameters?CS262 Lecture 9, Win07, BatzoglouConditional Training•Hidden Markov Model training:Given training sequence x, “true” parse Maximize P(x, )•Disadvantage:P(x, ) = P( | x) P(x)Quantity we care aboutso as to get a good parseQuantity we don’t care so muchabout because x is always givenCS262 Lecture 9, Win07, BatzoglouConditional TrainingP(x, ) = P( | x) P(x)P( | x) = P(x, ) / P(x)RecallF(j, x, ) = # times feature fj occurs in (x, )= i=1…N fj(i-1, i, x, i) ; count fj in x, In HMMs, let’s denote by wj the weight of jth feature: wj = log(akl) or log(ek(b))Then,HMM: P(x, ) = exp[j=1…n wj F(j, x, )]CRF: Score(x, ) = exp[j=1…n wj F(j, x, )]CS262 Lecture 9, Win07, BatzoglouConditional TrainingIn HMMs,P( | x) = P(x, ) / P(x)P(x, ) = exp[j=1…n wjF(j, x, )] P(x) = exp[j=1…n wjF(j, x, )] =: ZThen, in CRF we can do the same to normalize Score(x, ) into a probPCRF( | x) = exp[j=1…n wjF(j, x, )] / ZQUESTION: Why is this a probability???CS262 Lecture 9, Win07, BatzoglouConditional Training1. We are given a training set of sequences x and “true” parses 2. Calculate Z by a sum-of-paths algorithm similar to HMM•We can then easily calculate P( | x)3. Calculate partial derivative of P( | x) w.r.t. each parameter wj(not covered—akin to forward/backward)d/dwi P( | x) = F(i, x, ) – E’(i, x, ’) •Update each parameter with gradient descent!4. Continue until convergence to optimal set of weights P( | x) = exp[j=1…n wjF(j, x, )] / Z is convex!!!CS262 Lecture 9, Win07, BatzoglouConditional Random Fields—Summary 1. Ability to incorporate complicated non-local feature sets•Do away with some independence assumptions of HMMs•Parsing is still equally efficient2. Conditional training•Train parameters that are best for parsing, not modeling•Need labeled examples—sequences x and “true” parses (Can train on unlabeled sequences, however it is unreasonable to train too many parameters this way)•Training is significantly slower—many iterations of forward/backwardDNA SequencingCS262 Lecture 9, Win07, BatzoglouDNA sequencingHow we obtain the sequence of nucleotides of a species…ACGTGACTGAGGACCGTGCGACTGAGACTGACTGGGTCTAGCTAGACTACGTTTTATATATATATACGTCGTCGTACTGATGACTAGATTACAGACTGATTTAGATACCTGACTGATTTTAAAAAAATATT…CS262 Lecture 9, Win07, BatzoglouWhich representative of the species?Which human?Answer one:Answer two: it doesn’t matterPolymorphism rate: number of letter changes between two different members of a speciesHumans: ~1/1,000Other organisms have much higher polymorphism ratesPopulation size!CS262 Lecture 9, Win07, BatzoglouCS262 Lecture 9, Win07, BatzoglouHuman population migrations•Out of Africa, ReplacementSingle mother of all humans (Eve) ~150,000yrSingle father of all humans (Adam) ~70,000yrHumans out of Africa ~40000 years ago replaced others (e.g., Neandertals)Evidence: mtDNA•Multiregional EvolutionFossil records show a continuous change of morphological featuresProponents of the theory doubt mtDNA and other genetic evidenceCS262 Lecture 9, Win07, BatzoglouWhy humans are so similarA small population that interbred reduced the genetic variationOut of Africa ~ 40,000 years agoOut of AfricaH = 4Nu/(1 + 4Nu)CS262 Lecture 9, Win07, BatzoglouMigration of human variationhttp://info.med.yale.edu/genetics/kkidd/point.htmlCS262 Lecture 9, Win07, BatzoglouMigration of human variationhttp://info.med.yale.edu/genetics/kkidd/point.htmlCS262 Lecture 9, Win07, BatzoglouMigration of human variationhttp://info.med.yale.edu/genetics/kkidd/point.htmlCS262 Lecture 9, Win07, BatzoglouHuman variation in Y chromosomeCS262 Lecture 9, Win07, BatzoglouCS262 Lecture 9, Win07, BatzoglouCS262 Lecture 9, Win07, BatzoglouDNA Sequencing – Overview•Gel electrophoresisPredominant, old technology by F. Sanger•Whole genome strategiesPhysical mappingWalkingShotgun sequencing•Computational fragment assembly•The future—new sequencing technologiesPyrosequencing, single molecule methods, …Assembly techniques•Future variants of sequencingResequencing of humansMicrobial and environmental sequencingCancer genome sequencing19752015CS262 Lecture 9, Win07, BatzoglouDNA SequencingGoal:Find the complete sequence of A, C, G, T’s in DNAChallenge:There is no machine that takes long DNA as an input, and
View Full Document