UMD CMSC 828G - Bacterial Gene Finding and Glimmer - D117092

Home> Schools> University of Maryland, College Park> Computer Science (CMSC) > CMSC 828G> Bacterial Gene Finding and Glimmer

DOC PREVIEW

UMD CMSC 828G - Bacterial Gene Finding and Glimmer

School name University of Maryland, College Park

Course Cmsc 828g- Advanced Topics in Information Processing:Data-Intensive Computing with MapReduce

Pages 51

This preview shows page 1-2-3-24-25-26-27-49-50-51 out of 51 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 51 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Bacterial Gene Finding and Glimmer (also Archaeal and viral gene finding)OutlineStep OneSlide 4Slide 5Slide 6Slide 7Slide 8The ProblemCodon CompositionBacterial ReplicationTermination of ReplicationBorrelia burgdorferi (Lyme disease pathogen) GC-skew plotSlide 14Codon-Composition Gene FindersProbabilistic MethodsFixed-Order Markov ModelsSlide 18GeneMarkInterpolated Markov Models (IMM)Real IMMsSlide 22Slide 23Glimmer IMMMore PreciselyIMMs vs Fixed-Order ModelsInterpolated Context Model (ICM)ICMSlide 29Sample ICM ModelSlide 31Fixed-Length Sequences and ICMsOverlapping OrfsGlimmer 2.0 Overlap CommentsGlimmer3Reverse ScoringSlide 37Finding Start SitesGlimmer3 vs. Glimmer2Other Glimmer3 FeaturesGlimmer3 OutputFinding Initial Training SetSlide 43Slide 44Running Glimmer3A novel application of GlimmerSlide 47Slide 48Slide 49Slide 50AcknowledgementsBacterial Gene Finding and Glimmer(also Archaeal and viral gene finding)Arthur L. Delcher and Steven SalzbergCenter for Bioinformatics and Computational BiologyUniversity of MarylandOutline•A (very) brief overview of microbial gene-finding–Codon composition methods–GeneMark: Markov models•Glimmer1 & 2–Interpolated Markov Model (IMM)–Interpolated Context Model (ICM)•Glimmer3–Reducing false positives–Improving coding initiation site predictions–Running Glimmer3Step One•Find open reading frames (ORFs).…TAGAAAAATGGCTCTTTAGATAAATTTCATGAAAAATATTGA…StopcodonStopcodonStep One•Find open reading frames (ORFs).•But ORFs generally overlap ……TAGAAAAATGGCTCTTTAGATAAATTTCATGAAAAATATTGA…StopcodonStopcodon…ATCTTTTTACCGAGAAATCTATTTAAAGTACTTTTTATAACT…ShiftedStopStopcodonReversestrandCampylobacter jejuni RM1221 30.3%GCAll ORFs on both strands shown- color indicates reading frameLongest ORFs likely to be protein-coding genesNote the low GC contentCampylobacter jejuni RM1221 30.3%GCPurple lines are the predicted genesPurple ORFs show annotated (“true”) genesCampylobacter jejuni RM1221 30.3%GCMycobacterium smegmatis MC2 67.4%GCNote what happens in a high-GC genomeCampylobacter jejuni RM1221 30.3%GCMycobacterium smegmatis MC2 67.4%GCPurple lines show annotated genesThe Problem•Need to decide which orfs are genes.–Then figure out the coding start sites•Can do homology searches but that won’t find novel genes–Besides, there are errors in the databases•Generally can assume that there are some known genes to use as training set.–Or just find the obvious onesCodon Composition•Find patterns of nucleotides in known coding regions (assumed to be available).–Nucleotide distribution at 3 codon positions–Hexamers–GC-skew•(G-C)/(G+C) computed in windows of size N–Amino-acid composition•Use these to decide which orfs are genes.–Prefer longer orfs–Must deal with overlapsBacterial ReplicationEarly replicationTheta structureTermination of ReplicationE. coliB. subtilisBorrelia burgdorferi (Lyme disease pathogen)GC-skew plotCodon CompositionNucleotide variation at codon position:Campylobacter jejuni Codon Position 1 2 3a 36% 36% 36% c 13% 17% 9% g 30% 14% 10% t 21% 33% 44% Mycobacterium smegmatis Codon Position 1 2 3a 19% 23% 6% c 27% 28% 48% g 42% 20% 39% t 12% 28% 7%Codon-Composition Gene Finders•ZCURVE–Guo, Ou & Zhang, NAR 31, 2003–Based on nucleotide and di-nucleotide frequency in codons–Uses Z-transform and Fisher linear discriminant•MED–Ouyang, Zhu, Wang & She, JBCB 2(2) 2004–Based on amino-acid frequencies–Uses nearest-neighbor classification on entropiesProbabilistic Methods•Create models that have a probability of generating any given sequence.•Train the models using examples of the types of sequences to generate.•The “score” of an orf is the probability of the model generating it.–Can also use a negative model (i.e., a model of non-orfs) and make the score be the ratio of the probabilities (i.e., the odds) of the two models.–Use logs to avoid underflowFixed-Order Markov Models•k th-order Markov model bases the probability of an event on the preceding k events.•Example: With a 3rd-order model the probability of this sequence:would be:{Context(G | CTA) (A | TAG) (T | AGA)P P P⋅ ⋅L L}ContextCTAG ATL LTargetTargetFixed-Order Markov Models•Advantages:–Easy to train. Count frequencies of (k+1)-mers in training data.–Easy to compute probability of sequence.•Disadvantages:–Many (k+1)-mers may be undersampled in training data.–Models data as fixed-length chunks.TargetFixed-LengthContextGeneMark•Borodovsky & McIninch, Comp. Chem 17, 1993.•Uses 5th-order Markov model.•Model is 3-periodic, i.e., a separate model for each nucleotide position in the codon.•DNA region gets 7 scores: 6 reading frames & non-coding―high score wins.•Lukashin & Borodovsky, Nucl. Acids Res. 26, 1998 is the HMM version.Interpolated Markov Models (IMM)•Introduced in Glimmer 1.0Salzberg, Delcher, Kasif & White, NAR 26, 1998.•Probability of the target position depends on a variable number of previous positions (sometimes 2 bases, sometimes 3, 4, etc.)•How many is determined by the specific context.•E.g., for context ggtta the next position might depend on previous 3 bases tta . But for context catta all 5 bases might be used.ggttaReal IMMs•Model has additional probabilities, λ, that determine which parts of the context to use.•E.g., the probability of g occurring after context atca is:(atca) (g | atca) (1 (atca))[ (tca) (g | tca) (1 (tca))[ (ca) (g | ca) (1 (ca))[ (a) (g | a) (1 (a)) (g)]]]PPPPPλλ λλ λλ λλ+ −+ −+ −+ −Real IMMs•Result is a linear combination of different Markov orders:where•Can view this as interpolating the results of different-order models.•The probability of a sequence is still the probability of the bases in the sequence.4 3 21 0(g | atca) (g | tca) (g | ca)(g | a) (g)b P b P b Pb P b P+ ++ +0 1 2 3 41b b b b b+ + + + =Real IMMs•Problem: How to determine the λ’s (or equivalently the bj’s)?•Traditionally done with EM algorithm using cross-validation (deleted estimation).–Slow–Hard to understand results–Overtraining can be a problem•We will cover EM later as part of HMMsGlimmer IMM•Glimmer assumes:–Longer context is always better–Only reason not to use it is undersampling in training data.•If sequence occurs frequently enough in training data, use it, i.e., •Otherwise, use frequency and χ2 significance to set λ.•Interpolation is always between only 2

View Full Document