Stanford CS 262 - Comparative Gene Recognition, Suffix Trees - D1781811

Home> Schools> Stanford University> Computer Science (CS) > CS 262> Comparative Gene Recognition, Suffix Trees

DOC PREVIEW

Stanford CS 262 - Comparative Gene Recognition, Suffix Trees

School name Stanford University

Course Cs 262- Computational Genomics

Pages 10

This preview shows page 1-2-3 out of 10 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 10 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS262 Winter 2005 Lecture 12, 2/10/05Lecturer: Serafim Batzoglou Scribe: William LiuComparative Gene Recognition, Suffix TreesComparative Gene Recognition, Suffix Trees1 BackgroundWhen raised with the question of how one can find genes using our relatives, we need to notice thefollowing: all mammals and all eukaryotic organisms share a common biology; every cell has anucleus in which DNA is packed and based on the environment and other factors, specific regions ofthe DNA are unpack and genes are expressed. Even more appealing, mammals share approximatelythe same genes. Given that there are approximately 22,000 genes in mammals, about 95% or moreare similar within mammals. Therefore it seems intuitive to determine regions of similarly acrossdifferent mammalian species as a first step towards locating genes.We have already discussed the process of using HMMs for finding genes. Specifically we talkedabout Genscan and its ability to annotate a given sequence for regions which it believes is intergenic,coding exons, or introns. There are many indicators that give insight as to where these regions residein the sequence. Namely, we have start and stop codons for gene regions, splice site signals todetermine the exon/intron boundaries, coding biases with codon triplet pairs (the triplet codes for aparticular amino acid), and many more.In specifically dealing with Genscan and its HMM, intron duration is modeled using a geometricdistribution and exon duration is a general model. That is to say, exon duration is obtained throughtraining from the actual data.2 Comparison-Based MethodThe goal of the comparison-based methods to gene finding is the following: given 2 or moregenomes, with their alignments, we want to be able to deduce common gene structures across theseorganisms. The principle behind why this method works is because when we take genomes that arerelatively “close” to each other, the gene structure will be preserved between the two genomes.Moreover, the parts of the sequences that code for proteins (the regions that we are interested in) aremuch more similar than parts that do not. Therefore, if we are given two genomes that areevolutionary similar, we can use sequence similarity as a signal for gene/coding regions.If we look at a particular example of the human and mouse genomes (Makalowski et al., 1996) we aregiven the following numbers:Sequence Identity between genes in human/mouseExons: 84.6%Proteins: 85.4%Introns: 35%5’ UTRs: 67%3’ UTRs: 69%* 27 Proteins were 100% identical in the two genomes1CS262 Winter 2005 Lecture 12, 2/10/05Lecturer: Serafim Batzoglou Scribe: William LiuComparative Gene Recognition, Suffix TreesWe can clearly see the relationship of conserved coding regions as opposed to non-coding regions. Inexons and its associated proteins, we see that approximately 85% are similar in the two genomes. Ina similar manner, the graph below (Fig. A) describes similarity between multiple species withhumans. Beginning with macaque, pig, rabbit, mouse, rat, and finally ending chicken. The blueregions show the exons and the red regions the conserved non-coding regions. The plot clearly showsthat exon regions, for the most part, are conserved across the different species. Similarly, it showsthat, as expected, we are more similar to our close evolutionary relatives (i.e. the monkey) as opposedto distant species (i.e. the chicken, where we have almost nothing in common). This close similarity could also cause problems in these comparative methods. Namely, if all thespecies chosen were very evolutionarily similar, then there would be so many regions of similaritythat it will be extremely difficult to determine which regions are coding/non-coding. Thereforespecies selection in these comparative methods is extremely important. From the given VISTA plot,it seems like rabbit would be the best species choice to compare with human because it yields thecleanest representation of the conserved exons region.Figure A: VISTA plot of homologous regions in 6 different organisms (macaque, pig, rabbit, mouse, rat, chicken respectively) versus human. However, we cannot discount the possible conservation in non-coding/intron regions. Even in Fig. A we can see that there is a non-coding regions (the red portion) which is conserved throughout most of the species. Standard comparative gene finders will have many problems in annotating regions wherenon-coding introns are conserved as well as the coding exons. An infamous example of a region that shows this pathology is the HoxA region in human versus mouse (Fig. B). Luckily, there are ways to model these non-coding conserved regions such that we do not falsely annotate them as exon regions (we will see this later in the discussion of SLAM).2CS262 Winter 2005 Lecture 12, 2/10/05Lecturer: Serafim Batzoglou Scribe: William LiuComparative Gene Recognition, Suffix TreesFigure B: Visual alignment of the HoxA region in human vs. mouse. The region contains 11 genes with two exons each (blue). Notice that the non-coding intron regions (red) are just as similar as the exon regions 2.1 TwinScanTwinscan is a gene-structure prediction system which extends the original HMM model from Genscan. As with the HMM model from Genscan, exon duration is modeled from actual training data and intron duration is modeled using a geometric distribution. The change that Twinscan introduces is simply introducing an alignment of two sequences as input to the HMM model in Genscan. By changing the input to the HMM, the alphabet of the HMM needs to change as well. The authors of the Twinscan paper augmented the alphabet in the following manner:3CS262 Winter 2005 Lecture 12, 2/10/05Lecturer: Serafim Batzoglou Scribe: William LiuComparative Gene Recognition, Suffix TreesGiven the alignment of two sequences (i.e. human vs. mouse), mark each human base as aligned with:a gap ( - ), mismatch ( : ), or match( | )Thus the new alphabet is the following (4x3=12 letters):={A - , A : , A | , C - , C : , C | , G - , G : , G | , T - , T : , T | }Since we have a new alphabet, we also need to adjust the emission distributions of the given states. These parameters (e.g. ek(b)) are trained on a database of real genes from human/mouse. Again, we can use the maximal likelihood method to estimate these parameters (EM Algorithm). Intuitively,

View Full Document