1• Thurs, Nov 30:Gene Finding 2 • Tues, Dec 5:PS5 due in my office at 5pm.Project presentation preparation – no class• Thurs, Dec 7PS5 returned in classFinal papers dueProject presentations• Monday Dec 188:30am – 11:30am Final Exam, DH 1211Online FCE’s: Thru Dec 11Pairwise sequence alignment (global and local)Multiple sequence alignmentlocalglobalSubstitution matricesDatabase searchingBLASTEvolutionary tree reconstructionSequence statisticsProkaryotic Gene FindingEukaryotic Gene FindingRecent results in gene findingCredits for slides:Serafim BatzoglouMarina AlexanderssonLior PachterSam GrossGene criteria• Open Reading Frames(ORFs)Computational• Sequence featuresComputational• Sequence conservationComputational• Evidence for transcriptionExperimental• Gene inactivation induces a phenotypeExperimentalSnyder and Gerstein,Science20032Sequence features• Coding statistics (e.g. codon bias)• Gene structureOpen Reading FramePromoter regionRibosome binding site5’3’Termination sequenceStart codon/Stop codonRepressor siteAn HMM that finds genes in E. coliKrogh et al,1995(Electronicreserves)A A AT T TA A C…stop codonsstart codons61 triplet modelsintergene modelobserved frequencies for E. coligenesOutstanding Problems• Model cannot account for drift in CG content• Does not take position dependencies into account• Solution:– kth order Markov chain– looks back k positions• Glimmer (Salzberg et al, 1998)– Finds 98% of all genes in a bacterial genome.Prokaryotic vs. Eukaryotic Genes• Prokaryotes– small genomes (0.5Mb to 10Mb)– high gene density (90%)– no introns (or splicing)– no RNA processing– simple regulatory regions– most long ORF’s are genes• Eukaryotes– large genomes– low gene density (3% - 50%)– intron/exon structure– splicing– complex regulatory regionsGenomic data: Must handle multiple genes and/or gene fragments in input sequence.3GenscanArchitecture:• Individual modules: intergenic region, promoter, 5’UTR, exon/intron, post-translation region• Semi Hidden Markov Model – various length distributions• Different statistical models for each module:– weight matrices + extensions, 3-periodic 5thorder Markov chainsIncorporates:• Descriptions of transcriptional, translational and splicing signals• Compositional features of exons, introns, intergenic, C+G regionsBurge and Karlin, 1997GenscanLarger predictive scope than previous models• Partial genes • Multiple genes separated by intergenic DNA • Genes on either/both DNA strands Proposed pipeline• Screen for repetitive elements• Predict protein sequences with GENSCAN • BLAST predictions to find homologs• Refine using spliced alignment of prediction with homolog (e.g., Gelfand, Mironov, Pevzner, 96)• Verify experimentallyBurge and Karlin, 1997GenScan States• N: intergenic region• P: promoter• F: 5’ untranslated region• Esngl: single exon (intronless) (translation start -> stop codon)• Einit:initial exon (translation start -> donor splice site)• Ek:phase k internal exon (acceptor splice site -> donor splice site)• Eterm:terminal exon (acceptor splice site -> stop codon)• Ik:phase k intron:• T: 3’ untranslated region• A: poly-A signalFig. 3, Burge and Karlin 1997Performance measurestrue negative true positivefalse negative false positiverealitypredictionBurset & Guigo, 1996Nucleotide LevelFNTPTPSn+=4Performance measuresBurset & Guigo, 1996Exon Levelboth edges must be correctly alignedwrong exonrealitypredictionmissed exoncorrect exonexonsactualexonscorrectSn__=exonspredictedexonscorrectSp__=Gene prediction performance0.580.700.640.81Exon Sp0.140.070.150.09Missing Exons0.050.780.93Genscan0.090.560.86GENEPARSER30.130.730.91GENEID+0.120.610.77FGENEHWrong ExonsExon SnNucl SnGenes with all exons correctly predicted by Genscan: 43%Burset & Guigo, 1996; Burge and Karlin, 1997Data set: Each sequence contains exactly one gene.Gene prediction performancewith more challenging benchmarksh178:Single gene data • 178 sequences• 1 gene/sequence• Nucleotides in gene regions: 53%• Coding nucleotides: 21%Gen178: Semi-artificial genomes*• 42 sequences• 4.1 genes/sequence• Nucleotides in genicregions: 8.6%• Coding nucleotides: 2.3%Guigo et al, 2000.* Multiple genes interspersed with random sequenceGenscan performance• Correct genes: 10% - 15%• Gen178 does not contain repeats, pseudogenes, huge introns with huge introns, ... Results are probably still overly optimistic• A lot of room for improvement...0.440.75Exon Sp0.140.08Missing Exons0.100.780.93h1780.410.640.89Gen178Wrong ExonsExon SnNucl SnGuigo et al, 2000.5Innovations in gene prediction since 2000¾ Spliced alignment with proteins or ESTs– Genewise, Procrustes• Dual-genome predictors– SLAM, TWINSCAN, SGP2• Multi-genome predictors– PhyloHMMs (Exoniphy), NSCAN• Also, better models of gene features (e.g., splice sites, UTRs) and better identification of pseudogenes.5’3’[cDNA][genomic DNA]Spliced alignmentsProcrustes (Gelfand et al,96), Genewise (Birney & Durbin, 97)Align genomic DNA with proteins or cDNAs5’3’Exon1Exon2Exon3Intron1 Intron2[cDNA][genomic DNA]Spliced alignmentsProcrustes (Gelfand et al,96), Genewise (Birney & Durbin, 97)Align genomic DNA with proteins or cDNAsSpliced alignmentsProcrustes (Gelfand et al,96), Genewise (Birney & Durbin, 97)Methods include gene feature models, e.g., splice sites, frameshifts, penalise stop codonsCTCATGAGGTGAGgtgaatagt......cgtaattagGTCTTCTGGGGCCA||||||||||||| <-15907-> |||||||||||||CTCATGAGGTGAG________________________GTCTTCTGGGGCCASpliced alignment methods are • More accurate for known genes• Less accurate for unknown genes6Innovations in gene prediction since 2000• Spliced alignment with proteins or ESTs– Genewise, Procrustes¾ Dual-genome predictors– SLAM, TWINSCAN, SGP2• Multi-genome predictors– PhyloHMMs (Exoniphy), NSCANDual genome predictors• TWINSCAN (Brent), SGP2 (Guigo)– Predict genes in pairwise alignments• SLAM (Pachter)– Simultaneous alignment and gene predictionHuman MouseHuman-mouse homologyComparison of 1196 orthologous genes• Sequence identity between genes in human/mouse– exons: 84.6%– introns: 35%– 5’ UTRs: 67%– 3’ UTRs: 69%Makalowski et al., 19967Percent identity is not uniform Example: HoxA human-mouseTWINSCANAugmented version of GENSCAN that generates aligned sequencesFig. 3, Burge and Karlin 1997Flicek
View Full Document