1• Tues, Nov 29:Gene Finding 1• Thurs, Dec 1:Gene Finding 2 • Tues, Dec 6:PS5 dueProject presentations 1 (see course web site for schedule)• Thurs, Dec 8Final papers dueProject presentations 2• Monday Dec 191pm - 4pm Final Exam, Room: HH B131Online FCE’s: Thru Dec 12Pairwise sequence alignment (global and local)Multiple sequence alignmentlocalglobalSubstitution matricesDatabase searchingBLASTEvolutionary tree reconstructionSequence statisticsProkaryotic Gene FindingEukaryotic Gene FindingWhat is a Gene?• Something that encodes a heritable trait• One gene, one enzyme• One gene, one polypeptide• One gene, one product (include RNA products)• “A complete chromosomal segment responsible for making a functional product”– coding region– regulatory region– expressed product– functional productSnyder and Gerstein, Science 2003Prokaryotic Gene Finding• Identify Open Reading Frames (ORFs)• Coding Statistics• Identify individual gene architecture features • Assemble an integrated gene description • Homology2Reading Frames• Each grouping of the nucleotides into consecutive triplets constitutes a reading frame.• Three reading frames in the 5’->3’direction• Three in the reverse direction on the opposite strand.A C G T A A C T G A C T A G G T G A A T ...G T A A C T G A C T A G G T G A A T ...C G T A A C T G A C T A G G T G A A..Open Reading FramesAn ORF is a contiguous set of codons, each specifying an amino acid (starting with ATG). GGGAGCATGGTGCACCTGACTCCTGAGGTGACTTAGACM V H L T P E V T StopAll coding sequences are ORF's, but not all ORF's encode proteinsProkaryotic Gene Finding• Identify Open Reading Frames (ORFs)• Coding Statistics• Identify individual gene architecture features • Assemble an integrated gene description • HomologyCoding Statistics• Codon usage– Determine codon (triplet) frequencies in known coding regions– Compare with codon frequencies in sliding window• Amino acid pair preference• CG contentFickett and Tung,1992Guigo and Fickett,1995(Electronicreserves)ccgcctggcgtcgcggtttgtttttcatctctcttcatctgca3CodingStatistics• Codon usage Species specific• Codon pair preference Species specific• Correlations in third base position• Amino acid usage• Amino acid pair preference• CG contentFickett and Tung,1992Guigo and Fickett,1995(Electronicreserves)ccgcctggcgtcgcggtttgtttttcatctctcttcatctgcaCodingStatistics• Codon usage Species specific• Codon pair preference Species specific• Amino acid usage Species specific• Amino acid pair preference• CG contentFickett and Tung,1992Guigo and Fickett,1995(Electronicreserves)ccgcctggcgtcgcggtttgtttttcatctctcttcatctgcaGly Val Ala Cys PheVal Ser• Codon usage Species specific• Codon pair preference Species specific• Amino acid usage Species specific• Amino acid pair preference Species specific• Correlations in third base position• CG contentCodingStatisticsFickett and Tung,1992Guigo and Fickett,1995(Electronic reserves)ccgcctggcgtcgcggtttgtttttcatctctcttcatctgcaGly Val Ala Cys Phe SerVal Ser• Codon usage Species specific• Codon pair preference Species specific• Amino acid usage Species specific• Amino acid pair preference Species specific• Third position Any organism –3rdbase tends to be the same much more often than chance• CG contentCodingStatisticsFickett and Tung,1992Guigo and Fickett,1995(Electronic reserves)ccgcctggcgtcgcggtttgtttttcatctctcttcatctgca4Coding Statistics continuedCG content Species specificIn E. coli:Coding regions are embedded in segments of uniform, 53% G+C, about 1000 bases longNon-coding regions are embedded in segments of uniform, 46% G+C, about 500 bases longaa, at, ta, tt occur more frequently than expected in coding regionsFickett and Tung,1992Guigo and Fickett,1995(Electronic reserves)tgccgcctggcgtcgcggtttctttttcatctctcttcatctgacggcggaccgcagcgccaaagaaaaagtagagagaagtagacc• Codon usage Species specific• Codon pair preference Species specific• Amino acid usage Species specific• Amino acid pair preference Species specific• Third position Any organism • CG content Species specificCodingStatisticsFickett and Tung,1992Guigo and Fickett,1995(Electronicreserves)Look for variations in these measures in coding and non-coding regions(intergenic and intragenic).Prokaryotic Gene Finding• Identify Open Reading Frames (ORFs)• Coding Statistics• Identify individual gene architecture features • Assemble an integrated gene description • Homology1 gaattcgataaatctctggtttattgtgcagtttatggttccaaaatcgccttttgctgtTTCCAA -3561 atatactcacagcataactgtatatacacccagggggcggaatgaaagcgttaacggcca-10 TATACT mRNAstart+ +10GGGGG Ribosomal binding site121 ggcaacaagaggtgtttgatctcatccgtgatcacatcagccagacaggtatgccgccga181 cgcgtgcggaaatcgcgcagcgtttggggttccgttccccaaacgcggctgaagaacatc241 tgaaggcgctggcacgcaaaggcgttattgaaattgtttccggcgcatcacgcgggattc301 gtctgttgcaggaagaggaagaagggttgccgctggtaggtcgtgtggctgccggtgaac361 cacttctggcgcaacagcatattgaaggtcattatcaggtcgatccttccttattcaagc421 cgaatgctgatttcctgctgcgcgtcagcgggatgtcgatgaaagatatcggcattatgg481 atggtgacttgctggcagtgcataaaactcaggatgtacgtaacggtcaggtcgttgtcg541 cacgtattgatgacgaagttaccgttaagcgcctgaaaaaacagggcaataaagtcgaac601 tgttgccagaaaatagcgagtttaaaccaattgtcgttgaccttcgtcagcagagcttca661 ccattgaagggctggcggttggggttattcgcaacggcgactggctgtaacatatctctg721 agaccgcgatgccgcctggcgtcgcggtttgtttttcatctctcttcatcaggcttgtct781 gcatggcattcctcacttcatctgataaagcactctggcatctcgccttacccatgattt841 tctccaatatcaccgttccgttgctgggactggtcgatacggcggtaattggtcatcttg901 atagcccggtttatttgggcggcgtggcggttggcgcaacggcggaccagctCTGNNNNNNNNNNCAGTTGACATATAAT, mRNA startGGAGGSIGNALS IN THE E.coli lexA GENEPATTERN ATG…TAAopen reading frameRepressor binding sitePromotor sequences5Prokaryotic Gene Finding• Identify Open Reading Frames (ORFs)• Coding Statistics• Identify individual gene architecture features• Assemble an integrated gene description • HomologyHomologySalzberg, Nature 2003Gene Finding Questions• Identify protein coding
View Full Document