New version page

CMU BSC 03711 - Gene Finding2

Documents in this Course
lecture

lecture

8 pages

Lecture

Lecture

3 pages

Homework

Homework

10 pages

Lecture

Lecture

17 pages

Delsuc05

Delsuc05

15 pages

hmwk1

hmwk1

2 pages

lecture

lecture

6 pages

Lecture

Lecture

10 pages

barnacle4

barnacle4

15 pages

review

review

10 pages

Homework

Homework

10 pages

Midterm

Midterm

12 pages

lecture

lecture

11 pages

lecture

lecture

32 pages

Lecture

Lecture

7 pages

Lecture

Lecture

17 pages

Lecture

Lecture

12 pages

Lecture

Lecture

21 pages

Lecture

Lecture

11 pages

Lecture

Lecture

28 pages

Homework

Homework

13 pages

Logistics

Logistics

11 pages

lecture

lecture

11 pages

Lecture

Lecture

8 pages

Lecture

Lecture

9 pages

lecture

lecture

8 pages

Problem

Problem

6 pages

Homework

Homework

10 pages

Lecture

Lecture

9 pages

Problem

Problem

7 pages

hmwk4

hmwk4

7 pages

Problem

Problem

6 pages

lecture

lecture

16 pages

Problem

Problem

8 pages

Problem

Problem

6 pages

Problem

Problem

13 pages

lecture

lecture

9 pages

Problem

Problem

11 pages

Notes

Notes

7 pages

Lecture

Lecture

7 pages

Lecture

Lecture

10 pages

Lecture

Lecture

9 pages

Homework

Homework

15 pages

Lecture

Lecture

16 pages

Problem

Problem

15 pages

Load more
Upgrade to remove ads

This preview shows page 1-2-3-4-5 out of 16 pages.

Save
View Full Document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience

Upgrade to remove ads
Unformatted text preview:

1• Tues, Nov 29:Gene Finding 1• Thurs, Dec 1:Gene Finding 2 • Tues, Dec 6:PS5 dueProject presentations 1 (see course web site for schedule)• Thurs, Dec 8Final papers dueProject presentations 2• Monday Dec 191pm - 4pm Final Exam, Room: HH B131Online FCE’s: Thru Dec 12Pairwise sequence alignment (global and local)Multiple sequence alignmentlocalglobalSubstitution matricesDatabase searchingBLASTEvolutionary tree reconstructionSequence statisticsProkaryotic Gene FindingEukaryotic Gene FindingOutline• Recap: Prokaryotic gene finding• Eukaryotic gene finding• The human gene complement• RegulationGene Finding Questions• Identify protein coding region• Identify Open Reading Frame• Predict mRNA (including UTR’s)• Predict intron/exon structureEukaryotes only• Regulatory signals• Protein sequence2Gene criteria• Open Reading Frames(ORFs)Computational• Sequence featuresComputational• Sequence conservationComputational• Evidence for transcriptionExperimental• Gene inactivation induces a phenotypeExperimentalSnyder and Gerstein,Science2003Sequence features• Coding statistics (e.g. codon bias)• Gene structureOpen Reading FramePromoter regionRibosome binding site5’3’Termination sequenceStart codon/Stop codonRepressor siteAn HMM that finds genes in E. coliKrogh et al,1995(Electronicreserves)A A AT T TA A C…stop codonsstart codons61 triplet modelsintergene modelobserved frequencies for E. coligenesOutstanding Problems• Model cannot account for drift in CG content• Does not take position dependencies into account• Solution:– kth order Markov chain– looks back k positions• Glimmer (Salzberg et al, 1998)– Finds 98% of all genes in a bacterial genome.3Some Problems• Overlapping genesSnyder and Gerstein,Science2003aggcctatgacgcctctcccagcatgggcctgaggctcctgtcccccactagtggcctgctSome Problems• Overlapping genes• Alternate splicingSnyder and Gerstein,Science2003exon1 exon2 exon3 exon4exon6exon5exon6exon1 exon2 exon3 exon5exon1 exon2 exon3 exon4Some Problems• Overlapping genes• Alternate splicing• PseudogenesSnyder and Gerstein,Science2003gcctatgacgcctctcccagcatgggcctgaggctcctgtcccccactagtggcctgctccgcctatgacgcctctcccagcatgagcctgaggctcctgtcccccactagtggcctgctccgcctatgacgcctctcccagcatgagcctgaggctcctgtcccccactagtggcctgctccGene Finding Challenges• Small protein-coding genes (<100 aa’s)• Non-protein-coding RNA genes• Regulatory regions• Genes with sparse conserved positions and little sequence similarity; e.g., beta-defensinsSalzberg, Nature, 2003Schutte et al., PNAS, 20014Outline• Recap: Prokaryotic gene finding• Eukaryotic gene finding• The human gene complement• RegulationProkaryotic vs. Eukaryotic Genes• Prokaryotes– small genomes (0.5Mb to 10Mb)– high gene density (90%)– no introns (or splicing)– no RNA processing– simple regulatory regions– most long ORF’s are genes• Eukaryotes– large genomes– low gene density (3% - 50%)– intron/exon structure– splicing– complex regulatory regionsGenomic data: Must handle multiple genes and/or gene fragments in input sequence.Source: http://www.nslij-genetics.org/gene5Genome statistics Size Gene number Density (1 gene per)Human 3300Mb 30K 100,000 Fly 180Mb 13.6K 9000 C. elegans 97Mb 19.1K 5000 Yeast 12Mb 6.3K 2000 E. coli 4.8Mb 3.2K 1400 H. influenzae 1.8Mb 1.7K 1000 http://www.ornl.gov/TechResources/Human_Genome/faq/compgen.htmlTypical human gene sizes Average gene length 30kb Coding region 1-2kb Exon length 150 - 200 bp Exon count 5-6 Single exon genes 8% GenscanArchitecture:• Individual modules: intergenic region, promoter, 5’UTR, exon/intron, post-translation region• Semi Hidden Markov Model – various length distributions• Different statistical models for each module:– weight matrices + extensions, 3-periodic 5thorder Markov chainsIncorporates:• Descriptions of transcriptional, translational and splicing signals• Compositional features of exons, introns, intergenic, C+G regionsBurge and Karlin, 1997GenscanLarger predictive scope than previous models• Partial genes • Multiple genes separated by intergenic DNA • Genes on either/both DNA strands Proposed pipeline• Screen for repetitive elements• Predict protein sequences with GENSCAN • BLAST predictions to find homologs• Refine using spliced alignment of prediction with homolog (e.g., Gelfand, Mironov, Pevzner, 96)• Verify experimentallyBurge and Karlin, 1997GenScan States• N: intergenic region• P: promoter• F: 5’ untranslated region• Esngl: single exon (intronless) (translation start -> stop codon)• Einit:initial exon (translation start -> donor splice site)• Ek:phase k internal exon (acceptor splice site -> donor splice site)• Eterm:terminal exon (acceptor splice site -> stop codon)• Ik:phase k intron:• T: 3’ untranslated region• A: poly-A signalFig. 3, Burge and Karlin 19976GenScan StatesFig. 3, Burge and Karlin 1997• N: intergenic region• P: promoter• F: 5’ untranslated region• Esngl: single exon (intronless) (translation start -> stop codon)• Einit:initial exon (translation start -> donor splice site)• Ek:phase k internal exon (acceptor splice site -> donor splice site)• Eterm:terminal exon (acceptor splice site -> stop codon)• Ik:phase k intron:• T: 3’ untranslated region• A: poly-A signalHow to model sequences with lengths that are not geometrically distributed?CODON model1)1()(−−=lpplPstop codonsp1- pResulting exon length distribution:Semi-hidden Markov model• Set of states: Q1, Q2,…• Transition matrix P(Q(t)|Q(t-1))• Initial distribution P(Q(0))• Each state has– a length distribution– a sequence generating model• Emission:Each state emits a sequence, according to a particular distribution, of length, d, according to a particular length frequency distributionSemi-hidden Markov model cont’d•A parse φ of length L is – A state sequence: Q1, Q2,…– A sequence of lengths: d1,d2,d3,..• An observed sequence, s, is scored using a modified Viterbi algorithm)|(maxarg sPoptϕϕ=7GenScan Training Set2.5M base pairs142 Single Exon Genes (SEGs)238 multi-exon gene1492 Exons1254 IntronsAn additional 1619 coding sequences (no introns)Promoter model based on published sources.Initial and transition probabilitiesTrained separately for four categories of G+C content– < 43% (G+C) – 43% - 51% (G+C)– 51% - 57% (G+C)– > 57% (G+C)• Gene


View Full Document
Download Gene Finding2
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Gene Finding2 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Gene Finding2 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?