DOC PREVIEW
CMU BSC 03711 - Lecture

This preview shows page 1-2-3 out of 9 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 9 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1• Tues, Nov 30:Gene Finding 1• Thurs, Dec 2:Gene Finding 2, PS5 due• Tues, Dec 7:Project presentations 1• Thurs, Dec 9Project presentations 2Final papers due• Tues, Dec 14:DD: Extended office hours: 2:30pm – 5:30pm, MI 650 • Wed, Dec 15NS: office hours. DH 1321, noon – 2pm.• Friday Dec 178:30am Final Exam, Room: TBAOnline FCE’s: Thru Dec 10Pairwise sequence alignment (global and local)Multiple sequence alignmentlocalglobalSubstitution matricesDatabase searchingBLASTEvolutionary tree reconstructionSequence statisticsProkaryotic Gene FindingEukaryotic Gene FindingOutline• Recap: Prokaryotic gene finding• Eukaryotic gene finding• The human gene complement• RegulationGene Finding Questions• Identify protein coding region• Identify Open Reading Frame• Predict mRNA (including UTR’s)• Predict intron/exon structureEukaryotes only• Regulatory signals• Protein sequenceGene criteria• Open Reading Frames(ORFs)Computational• Sequence featuresComputational• Sequence conservationComputational• Evidence for transcriptionExperimental• Gene inactivation induces a phenotypeComputationalSnyder and Gerstein,Science2003Sequence features• Coding statistics (e.g. codon bias)• Gene structureOpen Reading FramePromoter regionRibosome binding site5’3’Termination sequenceStart codon/Stop codonRepressor site2An HMM that finds genes in E. coliKrogh et al,1995(Electronicreserves)A A AT T TA A C…stop codonsstart codons61 triplet modelsintergene modelobserved frequencies for E. coligenesOutstanding Problems• Model cannot account for drift in CG content• Does not take position dependencies into account• Solution:– kth order Markov chain– looks back k positions• Glimmer (Salzberg et al, 1998)– Finds 98% of all genes in a bacterial genome.Some Problems• Overlapping genesSnyder and Gerstein,Science2003aggcctatgacgcctctcccagcatgggcctgaggctcctgtcccccactagtggcctgctSome Problems• Overlapping genes• Alternate splicingSnyder and Gerstein,Science2003exon1 exon2 exon3 exon4exon6exon5exon6exon1 exon2 exon3 exon5exon1 exon2 exon3 exon4Some Problems• Overlapping genes• Alternate splicing• PseudogenesSnyder and Gerstein,Science2003gcctatgacgcctctcccagcatgggcctgaggctcctgtcccccactagtggcctgctccgcctatgacgcctctcccagcatgagcctgaggctcctgtcccccactagtggcctgctccgcctatgacgcctctcccagcatgagcctgaggctcctgtcccccactagtggcctgctccGene Finding Challenges• Small protein-coding genes (<100 aa’s)• RNA-coding genes• Regulatory regions• Genes with sparse conserved positions and little sequence similarity; e.g., beta-defensinsSalzberg, Nature, 2003Schutte et al., PNAS, 20013Outline• Recap: Prokaryotic gene finding• Eukaryotic gene finding• The human gene complement• RegulationProkaryotic vs. Eukaryotic Genes• Prokaryotes– small genomes (0.5Mb to 10Mb)– high gene density (90%)– no introns (or splicing)– no RNA processing– simple regulatory regions– most long ORF’s are genes• Eukaryotes– large genomes– low gene density (3% - 50%)– intron/exon structure– splicing– complex regulatory regionsGenomic data: Must handle multiple genes and/or gene fragments in input sequence.Genome statistics Size Gene number Density (1 gene per)Human 3300Mb 30K 100,000 Fly 180Mb 13.6K 9000 C. elegans 97Mb 19.1K 5000 Yeast 12Mb 6.3K 2000 E. coli 4.8Mb 3.2K 1400 H. influenzae 1.8Mb 1.7K 1000 http://www.ornl.gov/TechResources/Human_Genome/faq/compgen.htmlTypical human gene sizes Average gene length 30kb Coding region 1-2kb Exon length 150 - 200 bp Exon count 5-6 Single exon genes 8% Source: http://www.nslij-genetics.org/geneGenscanArchitecture:• Individual modules: intergenic region, promotor, 5’UTR, exon/intron, post-translation region• Semi Hidden Markov Model – various length distributions• Different statistical models for each module:– weight matrices + extensions, 3-periodic 5thorder Markov chainsIncorporates:• Descriptions of transcriptional, translational and splicing signals• Compositional features of exons, introns, intergenic, C+G regionsBurge and Karlin, 19974GenscanLarger predictive scope than previous models• Partial genes • Multiple genes separated by intergenic DNA • Genes on either/both DNA strands Proposed pipeline• Screen for repetitive elements• Predict protein sequences with GENSCAN • BLAST predictions to find homologs• Refine using spliced alignment of prediction with homolog (e.g., Gelfand, Mironov, Pevzner, 96)• Verify experimentallyBurge and Karlin, 1997GenScan States• N: intergenic region• P: promoter• F: 5’ untranslated region• Esngl: single exon (intronless) (translation start -> stop codon)• Einit:initial exon (translation start -> donor splice site)• Ek:phase k internal exon (acceptor splice site -> donor splice site)• Eterm:terminal exon (acceptor splice site -> stop codon)• Ik:phase k intron:Fig. 3, Burge and Karlin 1997GenScan States• N: intergenic region• P: promoter• F: 5’ untranslated region• Esngl: single exon (intronless) (translation start -> stop codon)• Einit:initial exon (translation start -> donor splice site)• Ek:phase k internal exon (acceptor splice site -> donor splice site)• Eterm:terminal exon (acceptor splice site -> stop codon)• Ik:phase k intron:Fig. 3, Burge and Karlin 1997How to model sequences with lengths that are not geometrically distributed?CODON model1)1()(−−=lpplPstop codonsp1- pResulting exon length distribution:Semi-hidden Markov model• Set of states: Q1, Q2,…• Transition matrix P(Q(t)|Q(t-1))• Initial distribution P(Qi)• Each state has– a length distribution– a sequence generating model• Emission:Each state emits a sequence, according to a particular distribution, of length, d, according to a particular length frequency distributionSemi-hidden Markov model cont’d• A parse of length L is – A state sequence: Q1, Q2,…– A sequence of lengths: d1,d2,d3,..• An observed sequence, s, is scored using a modified Viterbi algorithm)|(maxarg sPoptϕϕ=ϕ5GenScan Training Set2.5M base pairs142 Single Exon Genes (SEGs)238 multi-exon gene1492 Exons1254 IntronsAn additional 1619 coding sequences (no introns)Promotor model based on published sources.Initial and transition probabilitiesTrained separately for four categories of G+C content– < 43% (G+C) –43% -51%


View Full Document

CMU BSC 03711 - Lecture

Documents in this Course
lecture

lecture

8 pages

Lecture

Lecture

3 pages

Homework

Homework

10 pages

Lecture

Lecture

17 pages

Delsuc05

Delsuc05

15 pages

hmwk1

hmwk1

2 pages

lecture

lecture

6 pages

Lecture

Lecture

10 pages

barnacle4

barnacle4

15 pages

review

review

10 pages

Homework

Homework

10 pages

Midterm

Midterm

12 pages

lecture

lecture

11 pages

lecture

lecture

32 pages

Lecture

Lecture

7 pages

Lecture

Lecture

17 pages

Lecture

Lecture

12 pages

Lecture

Lecture

21 pages

Lecture

Lecture

11 pages

Lecture

Lecture

28 pages

Homework

Homework

13 pages

Logistics

Logistics

11 pages

lecture

lecture

11 pages

Lecture

Lecture

8 pages

Lecture

Lecture

9 pages

lecture

lecture

8 pages

Problem

Problem

6 pages

Homework

Homework

10 pages

Problem

Problem

7 pages

hmwk4

hmwk4

7 pages

Problem

Problem

6 pages

lecture

lecture

16 pages

Problem

Problem

8 pages

Problem

Problem

6 pages

Problem

Problem

13 pages

lecture

lecture

9 pages

Problem

Problem

11 pages

Notes

Notes

7 pages

Lecture

Lecture

7 pages

Lecture

Lecture

10 pages

Lecture

Lecture

9 pages

Homework

Homework

15 pages

Lecture

Lecture

16 pages

Problem

Problem

15 pages

Load more
Download Lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?