Stanford CS 262 - Lecture 13--Gene Recognition - D1732441

Home> Schools> Stanford University> Computer Science (CS) > CS 262> Lecture 13--Gene Recognition

DOC PREVIEW

Stanford CS 262 - Lecture 13--Gene Recognition

School name Stanford University

Course Cs 262- Computational Genomics

Pages 5

This preview shows page 1-2 out of 5 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 5 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Lecture # 13--Gene RecognitionSplicing Complex:CS 262 Computational GenomicsProf. BatzoglouLecture # 13, May 15th 2003Scribed by Clifford BryantLecture # 13--Gene RecognitionOverview:The Human genome has approximately 3 billion base pairs and contains perhaps 30,000 genes. With the human genome, or any genome for that matter, simply knowing the base sequence is not enough. We must identify the genes. There are a few features that make genes recognizable. For one thing, all genes that code for proteins must have a start (ATG) codon. Please note that the presence of ATG doesn’t define a gene, only that all genes begin their protein coding regions with this sequence. Genes consist of introns and exons. Introns are portions that are spliced out prior to translation to protein, and exons are the actual protein coding regions. It turns out the very feature that defines introns—namely that they are spliced out and ligated back together-- means that they are flanked with endonuclease sensitive (splice) sites. These have defined sequence, another marker for recognition.It also turns out that the exons—the ones that get translated into the actual phenotype—are highly conserved because sucessful motifs tend to stay the same. This implies that a comparison between to analogous stretches of DNA from different hosts will give clues as to the location of exons through localized sequence conservation.Fig. 1: Schematic Stretch of Human Genome with putative exons as equal signs1A B C|____=_____==____=___=_________|___=_____=_______=_____=_=__________=_|Questions:Where is the start/end of the gene?How many genes? (say, AB and BC are genes or perhaps AC is a gene)Is there a gene we have missed ? (false negative)Have we really identified the introns and exons correctly?Approaches to Gene Finding:i) Homology: use of a search (such as BLAST) to compare our sequence to a database of other sequences that have been annotated, ie where the genes have been defined. One method is Procrustes which combined homology search with a model and expresses confidence as a percentage whether a certain sequence is a gene.ii) Ab Initio methods such as HMM’s of computational interest, see belowiii) Hybrids: not reliant on database: Use two sequences from related organisms, look at regions of high conservation likely to be genes.Hidden Markov Models for Genes:Gene Organization leads naturally to HMM’s. When we go to an intron, we always go from an exon. This can be modeled as a state transition. In general:i) The Exon State: -encodes protein-see positional bias because the triplet code is enforced.-no stop codons until the end. See many triplets yet none happens to be a stop codon (TAG/TAA)—can make a probabilistic argument that scarcity of stop codons suggests an exon.ii) Intron State:-flanked by at least one exon-higher sequence variabilityGHMM for Gene Finding:~10% of genome is genes~2% is protein coding~90% is intergene The illustration shown in the lecture slides is considered (despite the proliferation of arrows and circles) to be a simple model. Features:4 exon states: Single, Beginning, Intermediate, Ending—all have different boundary properties (see above) There are also 3 different intron states.Shifted frame gives an entirely different protein sequence. The trplet code is enforced by having a different state for each position wrt beginning of frame. Index divisible by 3, index 1 mod 3 or index 2 mod 3. Each position (of 3) has different statistical properties.Deciding which state you are in—If you encounter a stop codon, you’d better be in the “exon final” state.Observed Duration:For how many residues do you stay in a given state? The intron state looks like a decaying exponential. Number of observations decreases with length. Suggests that we jump out of the intron state as in Markov model. Exons show an optimum length. The length distribution is a shape that goes up in the middle, with the bulk on the left hand side. It may be modeled by a negative binomial (similar to Poisson). Length distributions and their implications have influenced modeling decisions in the creation of gene finding algorithms 2. It is possible to implement a mathematical formula, or alternatively, we can run a simulation and resample data. The Vitterbi algorithm is run at the end to avail the most likely state sequence. Splicing Complex:Donor and acceptor sites in the splicing complex are shown with a probabilistic analysis. Look at residue by position and overall sequence. A bar is set—2 bits high. If height of a letter in figure is as high as the bar, know that we have a 1.00 probability of being a thatletterTotal amount of information is the sum of letter heights over the entire sequence. It is calculated for both. The stretch with 7.9 bits, according to this scale, will specify slightlyless than 4 letters (< seq. length)The height of the letters is determined by statistical analysis of occurrence patterns correlated to neighborhood. In turn, transition probabilities in a Markov model can be correlated to splice site probability. Bayesian models can be constructed involving factors of P(exon->intron) and P(in splice site). Coding Potential:Some triplets are preferred. There are 20 different tRNA’s (one for each AA). Not all triplets are bound equally by a given tRNA. Not all triplets occur equally frequently. Wecan use this information to identify exons because they may bear distinctive triplet distributions. Interspecies:Human-Mouse very well conserved with 98-99% of genes shared, meaning within genes have exact same number of same length exons. High level of sequence homology, to thetune of 85% within exons, but only 35% for introns, reflecting evolution cited earlier. More detailed and recent analysis has shown local level rearrangements, but on the macrolevel homology is high. Less important regions evolve faster. A notable exception:Man, Monkey,…ChickenHOX-A gene has much variability, for instance between Man and Mouse. HOX-A is a regulatory gene, implicated in development. Perhaps this is why we look differently fromthem. Twinscan:An augmented Genscan:. How so: homology—sequence conservation modeled a little bit.Algorithm: - Align two sequences, one human, one other. - Mark each human base with a gap (-), mismatch (+), or match (|)Here, we are creating a new alphabet, to wit the Cartesian product of the {A, C,

View Full Document