CMU BSC 03711 - lecture - D720708

Home> Schools> Carnegie Mellon University> Biological Sciences (BSC) > BSC 03711> lecture

DOC PREVIEW

CMU BSC 03711 - lecture

School name Carnegie Mellon University

Course Bsc 03711- Computational Molecular Biology and Genomics

Pages 16

This preview shows page 1-2-3-4-5 out of 16 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1• Thurs, Nov 30:Gene Finding 2 • Tues, Dec 5:PS5 due in my office at 5pm.Project presentation preparation – no class• Thurs, Dec 7PS5 returned in classFinal papers dueProject presentations• Monday Dec 188:30am – 11:30am Final Exam, DH 1211Online FCE’s: Thru Dec 11Pairwise sequence alignment (global and local)Multiple sequence alignmentlocalglobalSubstitution matricesDatabase searchingBLASTEvolutionary tree reconstructionSequence statisticsProkaryotic Gene FindingEukaryotic Gene FindingRecent results in gene findingCredits for slides:Serafim BatzoglouMarina AlexanderssonLior PachterSam GrossGene criteria• Open Reading Frames(ORFs)Computational• Sequence featuresComputational• Sequence conservationComputational• Evidence for transcriptionExperimental• Gene inactivation induces a phenotypeExperimentalSnyder and Gerstein,Science20032Sequence features• Coding statistics (e.g. codon bias)• Gene structureOpen Reading FramePromoter regionRibosome binding site5’3’Termination sequenceStart codon/Stop codonRepressor siteAn HMM that finds genes in E. coliKrogh et al,1995(Electronicreserves)A A AT T TA A C…stop codonsstart codons61 triplet modelsintergene modelobserved frequencies for E. coligenesOutstanding Problems• Model cannot account for drift in CG content• Does not take position dependencies into account• Solution:– kth order Markov chain– looks back k positions• Glimmer (Salzberg et al, 1998)– Finds 98% of all genes in a bacterial genome.Prokaryotic vs. Eukaryotic Genes• Prokaryotes– small genomes (0.5Mb to 10Mb)– high gene density (90%)– no introns (or splicing)– no RNA processing– simple regulatory regions– most long ORF’s are genes• Eukaryotes– large genomes– low gene density (3% - 50%)– intron/exon structure– splicing– complex regulatory regionsGenomic data: Must handle multiple genes and/or gene fragments in input sequence.3GenscanArchitecture:• Individual modules: intergenic region, promoter, 5’UTR, exon/intron, post-translation region• Semi Hidden Markov Model – various length distributions• Different statistical models for each module:– weight matrices + extensions, 3-periodic 5thorder Markov chainsIncorporates:• Descriptions of transcriptional, translational and splicing signals• Compositional features of exons, introns, intergenic, C+G regionsBurge and Karlin, 1997GenscanLarger predictive scope than previous models• Partial genes • Multiple genes separated by intergenic DNA • Genes on either/both DNA strands Proposed pipeline• Screen for repetitive elements• Predict protein sequences with GENSCAN • BLAST predictions to find homologs• Refine using spliced alignment of prediction with homolog (e.g., Gelfand, Mironov, Pevzner, 96)• Verify experimentallyBurge and Karlin, 1997GenScan States• N: intergenic region• P: promoter• F: 5’ untranslated region• Esngl: single exon (intronless) (translation start -> stop codon)• Einit:initial exon (translation start -> donor splice site)• Ek:phase k internal exon (acceptor splice site -> donor splice site)• Eterm:terminal exon (acceptor splice site -> stop codon)• Ik:phase k intron:• T: 3’ untranslated region• A: poly-A signalFig. 3, Burge and Karlin 1997Performance measurestrue negative true positivefalse negative false positiverealitypredictionBurset & Guigo, 1996Nucleotide LevelFNTPTPSn+=4Performance measuresBurset & Guigo, 1996Exon Levelboth edges must be correctly alignedwrong exonrealitypredictionmissed exoncorrect exonexonsactualexonscorrectSn__=exonspredictedexonscorrectSp__=Gene prediction performance0.580.700.640.81Exon Sp0.140.070.150.09Missing Exons0.050.780.93Genscan0.090.560.86GENEPARSER30.130.730.91GENEID+0.120.610.77FGENEHWrong ExonsExon SnNucl SnGenes with all exons correctly predicted by Genscan: 43%Burset & Guigo, 1996; Burge and Karlin, 1997Data set: Each sequence contains exactly one gene.Gene prediction performancewith more challenging benchmarksh178:Single gene data • 178 sequences• 1 gene/sequence• Nucleotides in gene regions: 53%• Coding nucleotides: 21%Gen178: Semi-artificial genomes*• 42 sequences• 4.1 genes/sequence• Nucleotides in genicregions: 8.6%• Coding nucleotides: 2.3%Guigo et al, 2000.* Multiple genes interspersed with random sequenceGenscan performance• Correct genes: 10% - 15%• Gen178 does not contain repeats, pseudogenes, huge introns with huge introns, ... Results are probably still overly optimistic• A lot of room for improvement...0.440.75Exon Sp0.140.08Missing Exons0.100.780.93h1780.410.640.89Gen178Wrong ExonsExon SnNucl SnGuigo et al, 2000.5Innovations in gene prediction since 2000¾ Spliced alignment with proteins or ESTs– Genewise, Procrustes• Dual-genome predictors– SLAM, TWINSCAN, SGP2• Multi-genome predictors– PhyloHMMs (Exoniphy), NSCAN• Also, better models of gene features (e.g., splice sites, UTRs) and better identification of pseudogenes.5’3’[cDNA][genomic DNA]Spliced alignmentsProcrustes (Gelfand et al,96), Genewise (Birney & Durbin, 97)Align genomic DNA with proteins or cDNAs5’3’Exon1Exon2Exon3Intron1 Intron2[cDNA][genomic DNA]Spliced alignmentsProcrustes (Gelfand et al,96), Genewise (Birney & Durbin, 97)Align genomic DNA with proteins or cDNAsSpliced alignmentsProcrustes (Gelfand et al,96), Genewise (Birney & Durbin, 97)Methods include gene feature models, e.g., splice sites, frameshifts, penalise stop codonsCTCATGAGGTGAGgtgaatagt......cgtaattagGTCTTCTGGGGCCA||||||||||||| <-15907-> |||||||||||||CTCATGAGGTGAG________________________GTCTTCTGGGGCCASpliced alignment methods are • More accurate for known genes• Less accurate for unknown genes6Innovations in gene prediction since 2000• Spliced alignment with proteins or ESTs– Genewise, Procrustes¾ Dual-genome predictors– SLAM, TWINSCAN, SGP2• Multi-genome predictors– PhyloHMMs (Exoniphy), NSCANDual genome predictors• TWINSCAN (Brent), SGP2 (Guigo)– Predict genes in pairwise alignments• SLAM (Pachter)– Simultaneous alignment and gene predictionHuman MouseHuman-mouse homologyComparison of 1196 orthologous genes• Sequence identity between genes in human/mouse– exons: 84.6%– introns: 35%– 5’ UTRs: 67%– 3’ UTRs: 69%Makalowski et al., 19967Percent identity is not uniform Example: HoxA human-mouseTWINSCANAugmented version of GENSCAN that generates aligned sequencesFig. 3, Burge and Karlin 1997Flicek

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4-5 out of 16 pages.

CMU BSC 03711 - lecture

Sign up for free to view:

Please select your school