DOC PREVIEW
CMU BSC 03711 - lecture

This preview shows page 1-2-3-4-5 out of 16 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1• Thurs, Nov 30:Gene Finding 2 • Tues, Dec 5:PS5 due in my office at 5pm.Project presentation preparation – no class• Thurs, Dec 7PS5 returned in classFinal papers dueProject presentations• Monday Dec 188:30am – 11:30am Final Exam, DH 1211Online FCE’s: Thru Dec 11Pairwise sequence alignment (global and local)Multiple sequence alignmentlocalglobalSubstitution matricesDatabase searchingBLASTEvolutionary tree reconstructionSequence statisticsProkaryotic Gene FindingEukaryotic Gene FindingRecent results in gene findingCredits for slides:Serafim BatzoglouMarina AlexanderssonLior PachterSam GrossGene criteria• Open Reading Frames(ORFs)Computational• Sequence featuresComputational• Sequence conservationComputational• Evidence for transcriptionExperimental• Gene inactivation induces a phenotypeExperimentalSnyder and Gerstein,Science20032Sequence features• Coding statistics (e.g. codon bias)• Gene structureOpen Reading FramePromoter regionRibosome binding site5’3’Termination sequenceStart codon/Stop codonRepressor siteAn HMM that finds genes in E. coliKrogh et al,1995(Electronicreserves)A A AT T TA A C…stop codonsstart codons61 triplet modelsintergene modelobserved frequencies for E. coligenesOutstanding Problems• Model cannot account for drift in CG content• Does not take position dependencies into account• Solution:– kth order Markov chain– looks back k positions• Glimmer (Salzberg et al, 1998)– Finds 98% of all genes in a bacterial genome.Prokaryotic vs. Eukaryotic Genes• Prokaryotes– small genomes (0.5Mb to 10Mb)– high gene density (90%)– no introns (or splicing)– no RNA processing– simple regulatory regions– most long ORF’s are genes• Eukaryotes– large genomes– low gene density (3% - 50%)– intron/exon structure– splicing– complex regulatory regionsGenomic data: Must handle multiple genes and/or gene fragments in input sequence.3GenscanArchitecture:• Individual modules: intergenic region, promoter, 5’UTR, exon/intron, post-translation region• Semi Hidden Markov Model – various length distributions• Different statistical models for each module:– weight matrices + extensions, 3-periodic 5thorder Markov chainsIncorporates:• Descriptions of transcriptional, translational and splicing signals• Compositional features of exons, introns, intergenic, C+G regionsBurge and Karlin, 1997GenscanLarger predictive scope than previous models• Partial genes • Multiple genes separated by intergenic DNA • Genes on either/both DNA strands Proposed pipeline• Screen for repetitive elements• Predict protein sequences with GENSCAN • BLAST predictions to find homologs• Refine using spliced alignment of prediction with homolog (e.g., Gelfand, Mironov, Pevzner, 96)• Verify experimentallyBurge and Karlin, 1997GenScan States• N: intergenic region• P: promoter• F: 5’ untranslated region• Esngl: single exon (intronless) (translation start -> stop codon)• Einit:initial exon (translation start -> donor splice site)• Ek:phase k internal exon (acceptor splice site -> donor splice site)• Eterm:terminal exon (acceptor splice site -> stop codon)• Ik:phase k intron:• T: 3’ untranslated region• A: poly-A signalFig. 3, Burge and Karlin 1997Performance measurestrue negative true positivefalse negative false positiverealitypredictionBurset & Guigo, 1996Nucleotide LevelFNTPTPSn+=4Performance measuresBurset & Guigo, 1996Exon Levelboth edges must be correctly alignedwrong exonrealitypredictionmissed exoncorrect exonexonsactualexonscorrectSn__=exonspredictedexonscorrectSp__=Gene prediction performance0.580.700.640.81Exon Sp0.140.070.150.09Missing Exons0.050.780.93Genscan0.090.560.86GENEPARSER30.130.730.91GENEID+0.120.610.77FGENEHWrong ExonsExon SnNucl SnGenes with all exons correctly predicted by Genscan: 43%Burset & Guigo, 1996; Burge and Karlin, 1997Data set: Each sequence contains exactly one gene.Gene prediction performancewith more challenging benchmarksh178:Single gene data • 178 sequences• 1 gene/sequence• Nucleotides in gene regions: 53%• Coding nucleotides: 21%Gen178: Semi-artificial genomes*• 42 sequences• 4.1 genes/sequence• Nucleotides in genicregions: 8.6%• Coding nucleotides: 2.3%Guigo et al, 2000.* Multiple genes interspersed with random sequenceGenscan performance• Correct genes: 10% - 15%• Gen178 does not contain repeats, pseudogenes, huge introns with huge introns, ... Results are probably still overly optimistic• A lot of room for improvement...0.440.75Exon Sp0.140.08Missing Exons0.100.780.93h1780.410.640.89Gen178Wrong ExonsExon SnNucl SnGuigo et al, 2000.5Innovations in gene prediction since 2000¾ Spliced alignment with proteins or ESTs– Genewise, Procrustes• Dual-genome predictors– SLAM, TWINSCAN, SGP2• Multi-genome predictors– PhyloHMMs (Exoniphy), NSCAN• Also, better models of gene features (e.g., splice sites, UTRs) and better identification of pseudogenes.5’3’[cDNA][genomic DNA]Spliced alignmentsProcrustes (Gelfand et al,96), Genewise (Birney & Durbin, 97)Align genomic DNA with proteins or cDNAs5’3’Exon1Exon2Exon3Intron1 Intron2[cDNA][genomic DNA]Spliced alignmentsProcrustes (Gelfand et al,96), Genewise (Birney & Durbin, 97)Align genomic DNA with proteins or cDNAsSpliced alignmentsProcrustes (Gelfand et al,96), Genewise (Birney & Durbin, 97)Methods include gene feature models, e.g., splice sites, frameshifts, penalise stop codonsCTCATGAGGTGAGgtgaatagt......cgtaattagGTCTTCTGGGGCCA||||||||||||| <-15907-> |||||||||||||CTCATGAGGTGAG________________________GTCTTCTGGGGCCASpliced alignment methods are • More accurate for known genes• Less accurate for unknown genes6Innovations in gene prediction since 2000• Spliced alignment with proteins or ESTs– Genewise, Procrustes¾ Dual-genome predictors– SLAM, TWINSCAN, SGP2• Multi-genome predictors– PhyloHMMs (Exoniphy), NSCANDual genome predictors• TWINSCAN (Brent), SGP2 (Guigo)– Predict genes in pairwise alignments• SLAM (Pachter)– Simultaneous alignment and gene predictionHuman MouseHuman-mouse homologyComparison of 1196 orthologous genes• Sequence identity between genes in human/mouse– exons: 84.6%– introns: 35%– 5’ UTRs: 67%– 3’ UTRs: 69%Makalowski et al., 19967Percent identity is not uniform Example: HoxA human-mouseTWINSCANAugmented version of GENSCAN that generates aligned sequencesFig. 3, Burge and Karlin 1997Flicek


View Full Document

CMU BSC 03711 - lecture

Documents in this Course
lecture

lecture

8 pages

Lecture

Lecture

3 pages

Homework

Homework

10 pages

Lecture

Lecture

17 pages

Delsuc05

Delsuc05

15 pages

hmwk1

hmwk1

2 pages

lecture

lecture

6 pages

Lecture

Lecture

10 pages

barnacle4

barnacle4

15 pages

review

review

10 pages

Homework

Homework

10 pages

Midterm

Midterm

12 pages

lecture

lecture

11 pages

lecture

lecture

32 pages

Lecture

Lecture

7 pages

Lecture

Lecture

17 pages

Lecture

Lecture

12 pages

Lecture

Lecture

21 pages

Lecture

Lecture

11 pages

Lecture

Lecture

28 pages

Homework

Homework

13 pages

Logistics

Logistics

11 pages

lecture

lecture

11 pages

Lecture

Lecture

8 pages

Lecture

Lecture

9 pages

lecture

lecture

8 pages

Problem

Problem

6 pages

Homework

Homework

10 pages

Lecture

Lecture

9 pages

Problem

Problem

7 pages

hmwk4

hmwk4

7 pages

Problem

Problem

6 pages

Problem

Problem

8 pages

Problem

Problem

6 pages

Problem

Problem

13 pages

lecture

lecture

9 pages

Problem

Problem

11 pages

Notes

Notes

7 pages

Lecture

Lecture

7 pages

Lecture

Lecture

10 pages

Lecture

Lecture

9 pages

Homework

Homework

15 pages

Lecture

Lecture

16 pages

Problem

Problem

15 pages

Load more
Download lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view lecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?