CMU CS 10810 - Computational gene prediction - D1926337

Home> Schools> Carnegie Mellon University> Computer Science (CS) > CS 10810> Computational gene prediction

DOC PREVIEW

CMU CS 10810 - Computational gene prediction

School name Carnegie Mellon University

Course Cs 10810- Computational Genomics

Pages 19

This preview shows page 1-2-3-4-5-6 out of 19 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 19 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1GraduateGraduate ComputationalComputationalGenomicsGenomics02-710 / 10-81002-710 / 10-810 & MSCBIO2070& MSCBIO2070Computational gene predictionComputational gene predictionTakis BenosTakis BenosLecture #9, February 13, 2007Lecture #9, February 13, 2007Reading: Reading: hand-outs & papershand-outs & papersBenos 02-710/MSCBIO2070 8-FEB-2007 2OutlineOutline• The problem• Predicting the coding parts of the genome• Predicting splice sites• Predicting exons• Predicting whole gene structures• GenScan• TwinScan• N-SCAN2Benos 02-710/MSCBIO2070 8-FEB-2007 3The problemThe problemGiven a genomic DNA piece, predict the regions thatare more likely to be part of a (protein coding) gene.Protein coding gene parts:• (promoter region)• 5’ Untranslated Region (5’ UTR)• Open Reading Frame (ORF)• 3’ Untranslated Region (3’ UTR)Eukaryotes• Introns• ExonsBenos 02-710/MSCBIO2070 8-FEB-2007 4Finding Genes in YeastFinding Genes in YeastStart codonATGStop codonTAG/TGA/TAAMean coding length about 1500bp (500 codons)Transcript5 ʼ 3 ʼIntergenicCodingIntergenic5’ UTR3’ UTR3Benos 02-710/MSCBIO2070 8-FEB-2007 5Gene structure (eukaryotes)Gene structure (eukaryotes)Benos 02-710/MSCBIO2070 8-FEB-2007 6Codon Codon biasbiasE. coli (per 1,000)Phe UUU 22.3Phe UUC 16.6Leu UUA 13.9Leu UUG 13.7Leu CUU 11.0Leu CUC 11.1Leu CUA 3.9Leu CUG 52.6Ile AUU 30.3Ile AUC 25.1Ile AUA 4.4Met AUG 27.9Val GUU 18.3Val GUC 15.3Val GUA 10.9Val GUG 26.4yeast (per 1,000)Phe UUU 26.1Phe UUC 18.2Leu UUA 26.4Leu UUG 27.1Leu CUU 12.2Leu CUC 5.4Leu CUA 13.4Leu CUG 10.4Ile AUU 30.2Ile AUC 17.1Ile AUA 17.8Met AUG 20.9Val GUU 22.0Val GUC 11.6Val GUA 11.8Val GUG 10.74Benos 02-710/MSCBIO2070 8-FEB-2007 7Coding exon findingCoding exon findingMeasures of “coding potentials” related to codon usage:• Codon usage measure• Hexamer measure• Amino acid usage measure• Di-amino acid usage measureBenos 02-710/MSCBIO2070 8-FEB-2007 8Codon Codon usage measureusage measure• Staden & McLachlan (1982)• “Coding” probability of a window is the product of the codon probabilitiescalculated from a reference set (mRNA):!==3/1)()(LiijcPwP• Gribskov, Devereux & Burgess (1984)• “Coding” probability of a window is the product of log-likelihood of thein-frame vs. out-of-frame codon frequencies:!="=3/1)|()|()(LiiijcodingnoncPcodingcPwS5Benos 02-710/MSCBIO2070 8-FEB-2007 9Codon usage measure (cntd)Codon usage measure (cntd)• Hinds & Blake (1985)• Similar to Staden & McLachlan, but the P(ci) is calculated as thefrequency of the in-frame codon, ci, over all occurrences (bothin-frame and out-of-frame).• Claverie & Bougueleret (1986 & 1990)• p(wj) and q(wj) are the frequencies of the hexamer wj in exons andintrons, resp.• Two hexamer measures were used:1. d1(wj) = p(wj) / ( p(wj) + q(wj) )2. d2(wj) = p(wj) - q(wj)Benos 02-710/MSCBIO2070 8-FEB-2007 10Codon usage measure (cntd)Codon usage measure (cntd)• Borodovsky et al. (1986)• Construct four Markov Models from reference set: (1-3) each of the threeframes of the coding regions, (4) non-coding regions• For each window, calculate the (four) window probabilities given a particularmodel• Calculate the probabilities of each model given the window (using Bayes)• Markov models of order 1 through 5 were used.• Best results were obtained with MM of order 5 (hexamer frequencies)6Benos 02-710/MSCBIO2070 8-FEB-2007 11Predicting genesPredicting genesPrediction of exon(s)• Prediction of exact splice sites• Prediction of begin/end of the gene• Selection of exonsBEGIN ENDBenos 02-710/MSCBIO2070 8-FEB-2007 12Finding Genes in YeastFinding Genes in YeastStart codonATGStop codonTAG/TGA/TAAMean coding length about 1500bp (500 codons)Transcript5 ʼ 3 ʼIntergenicCodingIntergenic5’ UTR3’ UTR7Benos 02-710/MSCBIO2070 8-FEB-2007 13Splice sitesSplice sitesStart codonATGStop codonTAG/TGA/TAATranscript5 ʼ 3 ʼIntergenicIntergenic5’ UTR3’ UTRintronDonor/acceptorsplice sitesBenos 02-710/MSCBIO2070 8-FEB-2007 14Splice sites (Splice sites (cntdcntd))5 ʼ 3 ʼIntergenicIntergenicintronDonor site%-8 … -2 -1 0 1 2 … 17A 26 … 60 9 0 0 54 … 21C 26 … 15 5 0 1 2 … 27G 25 … 12 78 100 0 41 … 27T 23 … 13 8 0 99 3 … 258Benos 02-710/MSCBIO2070 8-FEB-2007 15Splice sites (Splice sites (cntdcntd))Benos 02-710/MSCBIO2070 8-FEB-2007 16Modeling splice sitesModeling splice sites WMM: weight matrix model = PSSM (Staden 1984) WAM: weight array model = 1st order Markov (Zhang & Marr1993) MDD: maximal dependence decomposition (Burge & Karlin 1997) Decision-tree algorithm to take pairwise dependencies into account Train separate WMM models for each subset9Benos 02-710/MSCBIO2070 8-FEB-2007 17Maximal DependenceMaximal DependenceDecompositionDecompositionBenos 02-710/MSCBIO2070 8-FEB-2007 18Prediction of gene structurePrediction of gene structureExtrinsic information:• Alignment with known protein sequences• Alignment with ESTs or cDNAs• Alignment with genomic DNA of related speciesPotential sources of problems:• Genes with no similarity in the database• Poor database quality• Genes expressed in specific conditions or at low levels• Alternatively spliced genes10Benos 02-710/MSCBIO2070 8-FEB-2007 19All together nowAll together now……IntrinsicmethodsExtrinsicmethodsSplicepredictionExonpredictionSplice siteselectionExonselectionComplete gene structureBenos 02-710/MSCBIO2070 8-FEB-2007 20HMMs HMMs for Gene Findingfor Gene FindingGTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAAExonExonExonIntronIntronIntergenicIntergenicIntergeneStateFirst ExonStateIntronState11Benos 02-710/MSCBIO2070 8-FEB-2007 21HMMs HMMs for Gene Findingfor Gene FindingGTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAAExonExonExonIntronIntronIntergenicIntergenicIntergeneStateFirst ExonStateIntronStateBenos 02-710/MSCBIO2070 8-FEB-2007 22GENSCANGENSCANE0E1E2I1I2EinitIntergenicAAPrFEsingleTI0EtermIFIT12Benos 02-710/MSCBIO2070 8-FEB-2007 23GENSCAN CharacteristicsGENSCAN Characteristics Burge and Karlin, Stanford, 1997 Explicit state duration HMM (with tricks) Intergenic and intronic regions have geometric length distribution Exons are only possible when correct flanking sequences are present Output probabilities for NC and CDS are 5th-order Markov Each CDS frame has its own model WAM models for start/stop codons and acceptor sites MDD model for donor sites Separate parameters for regions of

View Full Document