Stanford CS 262 - Lecture 19 - Gene Recognition and Motif Finding - D2800037

Home> Schools> Stanford University> Computer Science (CS) > CS 262> Lecture 19 - Gene Recognition and Motif Finding

DOC PREVIEW

Stanford CS 262 - Lecture 19 - Gene Recognition and Motif Finding

School name Stanford University

Course Cs 262- Computational Genomics

Pages 8

This preview shows page 1-2-3 out of 8 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Gene Recognition and Motif Finding Lecture 19IntroductionOnly the exons (coding regions) in DNA are ultimately expressed. The whole DNA molecule is transcribed to RNA, but introns are spliced out before translation. These exons are read in triplets called codons.Gene finding is a core bioinformatics problem that has been solved using a number of methodologies ranging from statistical attacks, to combinations of statistical, expert-based, and experimental methods. Here we look at three popular methods:1. Ab initio – in summary ab initio methods only look at the genomic DNA of a target genome to infer the gene structures by making use of the statistical properties of gene sequences. Example:Every gene in a genome starts with an ATG and ends with TGA, TAG, or TAA. Exons and introns have different statistical properties of the subsequences present in them. Boundaries between exons and introns have motifs with a certain information content. All these signals can be integrated together, often in a hidden Markov model, to predict gene structures.2. De novo – Relies on looking at the target sequence and on a set of informan sequences related to that species genome. This method makes use of the fact that most genes are common among related organisms (ie mammal genes). Example: Human tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Macaque tttcttagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Mouse ttgcttagACTTTAAAGTTGTCAAGCCGCGTTCTTGATAAAATAAGTATTGGACAACTTGTTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-cca Rat ttgcttagACTTTAAAGTTGTCAAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTATTAGTCTTCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccca Rabbit t--attagACTTTAAAGCTGTCAAGCCGTGTTCTAGATAAAATAAGTATTGGGCAACTTATTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtaccta Dog t-cattagACTTTAAAGCTGTCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTCGATGAAgtatgtaccta Cow t-cattagACTTTGAAGCTATCAAGCCGTGTTCTGGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgta-ctaArmadillo gca--tagACCTTAAAACTGTCAAGCCGTGTTTTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtgccta Elephant gct-ttagACTTTAAAACTGTCCAGCCGTGTTCTTGATAAAATAAGTATTGGACAACTTGTCAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Tenrec tc-cttagACTTTAAAACTTTCGAGCCGGGTTCTAGATAAAATAAGTATTGGACAACTTGTTAGTCTCCTTTCCAACAACCTGAACAAATTTGATGAAgtatgtatcta Opossum ---tttagACCTTAAAACTGTCAAGCCGTGTTCTAGATAAAATAAGCACTGGACAGCTTATCAGTCTCCTTTCCAACAATCTGAACAAGTTTGATGAAgtatgtagctg Chicken ----ttagACCTTAAAACTGTCAAGCAAAGTTCTAGATAAAATAAGTACTGGACAATTGGTCAGCCTTCTTTCCAACAATCTGAACAAATTCGATGAGgtatgtt--tgAbove is an exon from the human cystic fibrosis gene aligned to a number of related species. There is significant conservation of the sequences and of the boundaries. Denovo methods are more accurate than ab initio methods, but the disadvantage is that they will miss genes that are not shared between the target and the informants. Example: we can’t expect to find genes that are responsible for a skeleton if we compare human and yeast.3. Combined methods - There are also combined approaches that use experimental evidence, such as sequenced gene products from cells. These are the most accurate when such evidence is available, but they will miss genes that are hard to fish for experimentally. An example is genes that are expressed in low quantity or at specific developmental time points will be missed. The good news is that once we have accurate gene predictions, it is relatively straight forward to verify them experimentally, with RT-PCR experiments. For that reason, many researchers including Michael Brent, who has developed some of the best gene finding methods, believe that the best approach to obtain a complete catalog of human genes is to employ an accurate de novo gene predictor, and then verify predictions experimentally with rt-PCR.In light of these three methods, note that there are some notabale genetic landmarks that are consistently important for gene finding:Signals for Gene Finding1. Regular Gene structure2. Exon/intron lengths3. Codon composition4. Motifs at the boundaries of exons, introns…etc. (Start, stop codons, splice sites)5. Patterns of conservation6. Sequence d mRNAs7. (PCR for verification)Other hints in gene finding:• Intron lengths follow a geometric distribution. All exons follow smoothed density curves.• Base composition is characteristic due to the genetic code. Codons have specific ratios of peptides, and this can be used for gene recognition.Splice Site Detection – pictured here is a splice site, with the height of a letter denoting the popularity of that base in that position of the splice site. To detect splice sites you can just shift a fixed-width window over the sequence and calculate the probability of being at a splice site for each base. The table below models this method and can be used to determine intron-exon and exon-intron transitions. HMMs for Single Species Gene FindingTraditionally the problems of gene finding and alignment have been treated separately, but it’s becoming more and more apparent that they are closely connected. The gene finding can be helped by aligning the sequences, and the alignment can be helped by actually first locating the genes. Our approach is to combine the two, performing them simultaneously.Recall that introns are spliced out of RNA. Intergene segments are just code segments between genes.We can combine gene finding and alignment by using a Markov model. In the context of gene finding we observe the DNA sequence and the state sequence we want to determine consists of exons, introns and intergene states. That is you want to determine for each DNA base which state it belongs to, and thus predict the exon boundaries. exonexonexonintronintronintergeneintergeneThis is the state space of the Generalized HMM (used by GENSCAN), performing single species gene finding. The hidden states are the exons in red and the introns and introns/intergene in green. There are a lot of states because the sequence is translated in triplets, so there are three different ways to translate the same sequence, or three different reading frames. The end product has to be a sequence divisible by three, and if one exon ends in the middle of a codon, that codon has to be finished in the beginning of the next codon. Thus in each exon we would have to

View Full Document