Dynamic Programming(cont’d)CS 498 SSSaurabh SinhaPrevious lecture cont’dAffine Gap Penalties• In nature, a series of k indels often come as a singleevent rather than a series of k single nucleotideevents:Normal scoring wouldgive the same scorefor both alignmentsThis is morelikely.This is lesslikely.Accounting for Gaps• Gaps- contiguous sequence of spaces in one of the rows• Score for a gap of length x is: -(ρ + σx) where ρ >0 is the penalty for introducing a gap: gap opening penalty ρ will be large relative to σ: gap extension penalty because you do not want to add too much of a penalty forextending the gap.Affine gap penalty in DP• When computing si,j, need to look at si,j-1,si,j-2, si,j-3,…. and si-1,j, si-2,j, …• Each cell needs O(n) time for update• O(n2) cells• Therefore, O(n3) algorithm• We can still do this in O(n2) timeAffine Gap PenaltyRecurrencessi,j = s i-1,j - σ max s i-1,j –(ρ+σ)si,j = s i,j-1 - σ max s i,j-1 –(ρ+σ)si,j = si-1,j-1 + δ (vi, wj) max s i,j s i,jContinue Gap in w (deletion)Start Gap in w (deletion): from middleContinue Gap in v (insertion)Start Gap in v (insertion):from middleMatch or MismatchEnd deletion: from topEnd insertion: from bottomReading assignmentSection 6.10 (J & P)Multiple AlignmentGene Prediction• Gene: A sequence of nucleotides codingfor protein• Gene Prediction Problem: Determine thebeginning and end positions of genes in agenomeGenePrediction:ComputationalChallengeThe Genetic CodeSOURCE:http://www.bioscience.org/atlases/genecode/genecode.htm• In 1961 Sydney Brenner and Francis Crickdiscovered frameshift mutations• Systematically deleted nucleotides fromDNA– Single and double deletions dramaticallyaltered protein product– Effects of triple deletions were minor– Conclusion: every triplet of nucleotides,each c o d on, codes for exactly oneamino acid in a proteinCodons• In 1964, Charles Yanofsky and Sydney Brennerproved collinearity in the order of codons withrespect to amino acids in proteins• As a result, it was incorrectly assumed that thetriplets encoding for amino acid sequences formcontiguous strips of information.GreatDiscoveryProvokingWrongAssumptionExonsandIntrons• In eukaryotes, the gene is a combination ofcoding segments (exons) that are interrupted bynon-coding segments (introns)• This makes computational gene prediction ineukaryotes even more difficult• Prokaryotes don’t have introns - Genes inprokaryotes are continuousCentralDogmaandSplicingexon1exon2 exon3intron1 intron2transcriptiontranslationsplicingexon= cod ingintron= non-codingBatzoglouGene prediction• More difficult in eukaryotes than inprokaryotes (due to introns).• In human genome, ~3% of DNAsequence is genes• Lot of “junk” DNA between genes, andeven inside genes (between exons).• Gene prediction must deal with this.Gene prediction: broadlyspeaking• Statistical approaches:look for features than appear frequentlyin genes and infrequently elsewhere• Similarity based approaches:a newly sequenced gene may be similarto a known gene.– even this is not so simple. The exonstructures may be different betweenotherwise similar genesStatistical approachesSplicingSignalsExons are interspersed with introns andtypically flanked by GT and AGSplicesitedetection5’3’Donor sitePosition%-8 … -2 -1 0 1 2 … 17A 26 … 60 9 0 1 54 … 21C 26 … 15 5 0 1 2 … 27G 25 … 12 78 99 0 41 … 27T 23 … 13 8 1 98 3 … 25From lectures by Serafim Batzoglou (Stanford)ConsensussplicesitesDonor: 7.9 bitsAcceptor: 9.4 bitsSplicing and gene prediction• Using splice sites (profiles) to predictgenes ?• Limited scope, too many falsepredictions• Detect potential coding regions by looking at ORFs– A region of length n is comprised of (n/3) codons– Stop codons break genome into segments betweenconsecutive Stop codons– The subsegments of these that start from the Start codon(ATG) are ORFsGenomic SequenceOpen reading frameATG TGAOpenReadingFrames(ORFs)ORFs• 6 reading frames in any given sequence– 6 ways to map the DNA sequence to codonsequence (+1,+2,+3,-1,-2,-3)– 3 on either strand• Look at all 6 reading frames for ORFs• Long open reading frames may be a gene– At random, we should expect one stop codonevery (64/3) ~= 21 codons– However, genes are usually much longer thanthis• A basic approach is to scan for ORFs whose lengthexceeds certain threshold– This is naïve because some genes (e.g. someneural and immune system genes) are relativelyshortLongvs.ShortORFsCodon usage• In a given sequence (e.g., an ORF), computefrequency distribution of codons (64 elementarray): codon usage array• Codon usage array for coding sequences isdifferent from that for non-coding sequences• If the codon usage array for an ORF is muchmore similar to that of coding sequences thanto that of non-coding sequences, the ORFcould be a geneCodon usage• Codons coding for “Arg” in human:– CGU: 37%, CGC: 38%, CGA: 7%, CGG:10%, AGA: 5%, AGG: 3%– In a coding sequence, codon CGC is 12times more likely than codon AGG– An ORF preferring CGC over AGG is likelyto be a geneCodonUsageinHumanGenomeCodon usage• One way to test if an ORF is a gene is tocompute– Pr(ORF sequence under a coding sequencemodel)– Pr(ORF sequence under a non-coding model)– Ratio of the two.• These methods work best in prokaryotes• The exon-intron trouble is not handled yet• Hidden Markov models that use codon usageideas and splice site ideas, all in one– We’ll see more of this in second half of coursePromoter Structure in Prokaryotes(E.Coli)Transcription startsat offset 0.• Pribnow Box (-10)• Gilbert Box (-30)• RibosomalBinding Site (+10)RibosomalBindingSiteStatistical approaches:summary• Splicing sites• Codon usage• Promoter motifs, such as -10 element,-30 element• Ribosome binding siteSimilarity based approaches• Some genomes may be very well-studied,with many genes having beenexperimentally verified.• Closely-related organisms may havesimilar genes• Unknown genes in one species may becompared to genes in some closely-related speciesThe basic approach• Given a protein sequence, and a genomicsequence, find a set of substrings of thegenomic sequence whose concatenation bestfits the protein sequence•
View Full Document