DOC PREVIEW
U of I CS 498 - Statistical Methods

This preview shows page 1-2-3-24-25-26-27-49-50-51 out of 51 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 51 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Gene Prediction: Statistical MethodsApproaches to Gene PredictionGene Prediction AnalogySlide 4A Few Basic QuestionsStatistical Characteristics of a GeneGene StructureSlide 8Six Frames in a DNA SequenceThe “Sly Fox” & Effect of Base DeletionSplicing Signals and Exon BoudnariesSplicing mechanismDonor and Acceptor Sites: GT and AG dinucleotidesDonor and Acceptor Sites: Motif LogosCodon Usage in Human GenomeCodon FrequenciesDicodon FrequenciesGeneral Steps for Gene PredictionStep 1: Identify ORFsLong vs.Short ORFsPrediction of Translation StartsSlide 22Slide 23Slide 24Step 2: Identify Exon BoundariesPrediction of Splice Junction SitesSlide 27Prediction of Acceptor SitesSlide 29Prediction of Donor SitesSlide 31Prediction of Donors/AcceptorsPrediction of ExonsStep 3: Classify Candidate ExonsTesting Exons: Codon UsageCodon Usage and Likelihood RatioCodon Usage in Mouse GenomeExon Prediction Method 1: TestCodeTestCodeTestCode StatisticsTestCode Statistics (cont’d)TestCode Sample OutputExon Prediction Method 2: Likelihood Ratio/Suprevised LearningPrediction of Coding RegionsPrediction of ExonsSlide 46Slide 47Slide 48Gene Prediction in a New GenomePopular Gene Prediction AlgorithmsWhat You Should KnowGene Prediction: Statistical Methods(Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 20, 2005ChengXiang ZhaiDepartment of Computer ScienceUniversity of Illinois, Urbana-ChampaignMany slides are taken/adapted from http://www.bioalgorithms.info/slides.htm and Ying Xu’s lectureApproaches to Gene Prediction•Similarity-based approaches: –Exploit the fact that many genes are conserved across species–Can be highly reliable–Only good for finding unknown genes•Statistical approaches–Exploit statistical characteristics of coding regions and non-coding regions and other knowledge about genes–Can potentially detect new genes–May not be reliable•They can/should be combined–Currently no principled approaches for doing thisGiven a new genome, identify “known genes” first Learn from “known genes” to identify new geneGene Prediction Analogy•Newspaper written in unknown language–Certain pages contain encoded message, say 99 letters on page 7, 30 on page 12 and 63 on page 15. •How do you recognize the message? You could probably distinguish between the ads and the story (ads contain the “$” sign often)•Statistics-based approach to Gene Prediction tries to make similar distinctions between exons and introns.Noting the differing frequencies of symbols (e.g. ‘%’, ‘.’, ‘-’) and numerical symbols could you distinguish between a story and the stock report in a foreign newspaper? Statistical Approach: Metaphor in Unknown LanguageA Few Basic Questions•What is exactly a gene for the purpose of prediction? –In Prokaryotes, gene = mRNAProtein–In Eukaryotes, gene = Exon (coding region)•What does a gene look like? –Where does it start? –Where does it end? –What is the codon usage inside a gene (exon)?–What is the codon usage outside a gene (intron)?–… •How do we exploit such knowledge to identify genes?Statistical Characteristics of a Gene•Gene starts with a start codon•Gene ends at a stop codon•Splicing signals•Codon usage distributions•…Gene StructureUAA, UAG and UGA correspond to 3 Stop codons that (together with Start codon ATG) delineate Open Reading FramesGenetic Code and Stop CodonsSix Frames in a DNA Sequence•stop codons •start codonsGACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTGCTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACACGACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTGGACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTGGACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTGCTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACACCTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACACCTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACACThe “Sly Fox” & Effect of Base Deletion•In the following string THE SLY FOX AND THE SHY DOG•Delete 1, 2, and 3 nucleotifes after the first ‘S’:THE SYF OXA NDT HES HYD OGTHE SFO XAN DTH ESH YDO GTHE SOX AND THE SHY DOG•Which of the above makes the most sense?Splicing Signals and Exon BoudnariesExons are interspersed with introns and typically flanked by GT and AGSplicing mechanism(http://genes.mit.edu/chris/)Donor and Acceptor Sites: GT and AG dinucleotides•The beginning and end of exons are signaled by donor and acceptor sites that usually have GT and AC dinucleotides•Detecting these sites is difficult, because GT and AC appear very oftenexon 1 exon 2GT ACAcceptorSiteDonorSiteDonor and Acceptor Sites: Motif Logos(http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html)Donor: 7.9 bitsAcceptor: 9.4 bits(Stephens & Schneider, 1996)Codon Usage in Human GenomeBiased codon usage in exons allows us to distinguish exons from intronsCodon Frequencies•Coding sequences are translated into protein sequences•We found the following – the dimer frequency in protein sequences is NOT evenly distributedThe average frequency is 5%Some amino acids prefer to be next to each otherSome other amino acids prefer to be not next to each othershewanellaDicodon Frequencies•The biased (uneven) dimer frequencies are the foundation of many gene finding programs!•Basic idea of gene finding – if a dimer has lower than average dimer frequency; this means that proteins prefer not to have such dimers in its sequence; otherwise proteins prefer to have such dimersHence if we see a dicodon encoding this dimer, we may want to bet against this dicodon being in a coding region!General Steps for Gene Prediction•Identify candidate exons in Open Reading Frames (ORFs)–Determine ORFs: An ORF starts with a start codon and ends at a stop codon–Determine sites for receptors/donors•Evaluate the potential of a candidate exon for coding (Exploit codon usages)Step 1: Identify ORFsLong vs.Short ORFs•Long open reading frames may be a gene–At random, we should expect one stop codon every (64/3) ~= 21 codons–However, genes are usually much longer than this•A basic approach is to scan for ORFs whose length exceeds certain threshold–This is naïve because some genes (e.g. some neural and immune system genes) are relatively shortPrediction of Translation Starts•Translation start: ATG •How to predict a translation start•Collect a


View Full Document

U of I CS 498 - Statistical Methods

Documents in this Course
Lecture 5

Lecture 5

13 pages

LECTURE

LECTURE

39 pages

Assurance

Assurance

44 pages

LECTURE

LECTURE

36 pages

Pthreads

Pthreads

29 pages

Load more
Download Statistical Methods
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Statistical Methods and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Statistical Methods 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?