1• Tues, Nov 29:Gene Finding 1• Thurs, Dec 1:Gene Finding 2 • Tues, Dec 6:PS5 dueProject presentations 1 (see course web site for schedule)• Thurs, Dec 8Final papers dueProject presentations 2• Monday Dec 191pm - 4pm Final Exam, Room: HH B131Online FCE’s: Thru Dec 12Pairwise sequence alignment (global and local)Multiple sequence alignmentlocalglobalSubstitution matricesDatabase searchingBLASTEvolutionary tree reconstructionSequence statisticsProkaryotic Gene FindingEukaryotic Gene FindingOutline• Recap: Prokaryotic gene finding• Eukaryotic gene finding• The human gene complement• RegulationGene Finding Questions• Identify protein coding region• Identify Open Reading Frame• Predict mRNA (including UTR’s)• Predict intron/exon structureEukaryotes only• Regulatory signals• Protein sequence2Gene criteria• Open Reading Frames(ORFs)Computational• Sequence featuresComputational• Sequence conservationComputational• Evidence for transcriptionExperimental• Gene inactivation induces a phenotypeExperimentalSnyder and Gerstein,Science2003Sequence features• Coding statistics (e.g. codon bias)• Gene structureOpen Reading FramePromoter regionRibosome binding site5’3’Termination sequenceStart codon/Stop codonRepressor siteAn HMM that finds genes in E. coliKrogh et al,1995(Electronicreserves)A A AT T TA A C…stop codonsstart codons61 triplet modelsintergene modelobserved frequencies for E. coligenesOutstanding Problems• Model cannot account for drift in CG content• Does not take position dependencies into account• Solution:– kth order Markov chain– looks back k positions• Glimmer (Salzberg et al, 1998)– Finds 98% of all genes in a bacterial genome.3Some Problems• Overlapping genesSnyder and Gerstein,Science2003aggcctatgacgcctctcccagcatgggcctgaggctcctgtcccccactagtggcctgctSome Problems• Overlapping genes• Alternate splicingSnyder and Gerstein,Science2003exon1 exon2 exon3 exon4exon6exon5exon6exon1 exon2 exon3 exon5exon1 exon2 exon3 exon4Some Problems• Overlapping genes• Alternate splicing• PseudogenesSnyder and Gerstein,Science2003gcctatgacgcctctcccagcatgggcctgaggctcctgtcccccactagtggcctgctccgcctatgacgcctctcccagcatgagcctgaggctcctgtcccccactagtggcctgctccgcctatgacgcctctcccagcatgagcctgaggctcctgtcccccactagtggcctgctccGene Finding Challenges• Small protein-coding genes (<100 aa’s)• Non-protein-coding RNA genes• Regulatory regions• Genes with sparse conserved positions and little sequence similarity; e.g., beta-defensinsSalzberg, Nature, 2003Schutte et al., PNAS, 20014Outline• Recap: Prokaryotic gene finding• Eukaryotic gene finding• The human gene complement• RegulationProkaryotic vs. Eukaryotic Genes• Prokaryotes– small genomes (0.5Mb to 10Mb)– high gene density (90%)– no introns (or splicing)– no RNA processing– simple regulatory regions– most long ORF’s are genes• Eukaryotes– large genomes– low gene density (3% - 50%)– intron/exon structure– splicing– complex regulatory regionsGenomic data: Must handle multiple genes and/or gene fragments in input sequence.Source: http://www.nslij-genetics.org/gene5Genome statistics Size Gene number Density (1 gene per)Human 3300Mb 30K 100,000 Fly 180Mb 13.6K 9000 C. elegans 97Mb 19.1K 5000 Yeast 12Mb 6.3K 2000 E. coli 4.8Mb 3.2K 1400 H. influenzae 1.8Mb 1.7K 1000 http://www.ornl.gov/TechResources/Human_Genome/faq/compgen.htmlTypical human gene sizes Average gene length 30kb Coding region 1-2kb Exon length 150 - 200 bp Exon count 5-6 Single exon genes 8% GenscanArchitecture:• Individual modules: intergenic region, promoter, 5’UTR, exon/intron, post-translation region• Semi Hidden Markov Model – various length distributions• Different statistical models for each module:– weight matrices + extensions, 3-periodic 5thorder Markov chainsIncorporates:• Descriptions of transcriptional, translational and splicing signals• Compositional features of exons, introns, intergenic, C+G regionsBurge and Karlin, 1997GenscanLarger predictive scope than previous models• Partial genes • Multiple genes separated by intergenic DNA • Genes on either/both DNA strands Proposed pipeline• Screen for repetitive elements• Predict protein sequences with GENSCAN • BLAST predictions to find homologs• Refine using spliced alignment of prediction with homolog (e.g., Gelfand, Mironov, Pevzner, 96)• Verify experimentallyBurge and Karlin, 1997GenScan States• N: intergenic region• P: promoter• F: 5’ untranslated region• Esngl: single exon (intronless) (translation start -> stop codon)• Einit:initial exon (translation start -> donor splice site)• Ek:phase k internal exon (acceptor splice site -> donor splice site)• Eterm:terminal exon (acceptor splice site -> stop codon)• Ik:phase k intron:• T: 3’ untranslated region• A: poly-A signalFig. 3, Burge and Karlin 19976GenScan StatesFig. 3, Burge and Karlin 1997• N: intergenic region• P: promoter• F: 5’ untranslated region• Esngl: single exon (intronless) (translation start -> stop codon)• Einit:initial exon (translation start -> donor splice site)• Ek:phase k internal exon (acceptor splice site -> donor splice site)• Eterm:terminal exon (acceptor splice site -> stop codon)• Ik:phase k intron:• T: 3’ untranslated region• A: poly-A signalHow to model sequences with lengths that are not geometrically distributed?CODON model1)1()(−−=lpplPstop codonsp1- pResulting exon length distribution:Semi-hidden Markov model• Set of states: Q1, Q2,…• Transition matrix P(Q(t)|Q(t-1))• Initial distribution P(Q(0))• Each state has– a length distribution– a sequence generating model• Emission:Each state emits a sequence, according to a particular distribution, of length, d, according to a particular length frequency distributionSemi-hidden Markov model cont’d•A parse φ of length L is – A state sequence: Q1, Q2,…– A sequence of lengths: d1,d2,d3,..• An observed sequence, s, is scored using a modified Viterbi algorithm)|(maxarg sPoptϕϕ=7GenScan Training Set2.5M base pairs142 Single Exon Genes (SEGs)238 multi-exon gene1492 Exons1254 IntronsAn additional 1619 coding sequences (no introns)Promoter model based on published sources.Initial and transition probabilitiesTrained separately for four categories of G+C content– < 43% (G+C) – 43% - 51% (G+C)– 51% - 57% (G+C)– > 57% (G+C)• Gene
View Full Document