Unformatted text preview:

Scanning Biological Sequences for InterestingFeaturesThe scan of a single biological sequence is one of the basicapplications of statistic in biology. It comes up in many forms:1. Scanning the genome for CpG islands.2. Scanning protein sequences for charge clusters.3. Scanning protein sequences for transmembrane domains.One third of all proteins are membrane proteins embeddedwithin the cell’s fatty outer layer. For example, here is a cartoonof the protein integrin. These proteins contain a hydrophobictransmembrane domain.Given the protein sequence, can the transmembrane domainbe computationally identified?From Brendel and Karlin, 1992.From Brendel and Karlin, 1992.Comparing genomesComparisons at the genome level are a much hardercomputational and theoretical problem.From International Human Genome Sequencing Consortium(2001), Nature.At the finer scale, we can start to see patterns.From Gregory et al. (2002), Nature.Within the genome of a single species, there are manyduplications, translocations, and inversions.From The Arabidopsis Genome Initiative (2000), Nature.How genomes involve through duplication.From Deonier, Tavaré and Waterman, 2005.How much of the genome is functional?How much of the genome is conserved?IYeast genome contains 70% coding sequences.IHuman genome contains 1.2% protein coding sequence.Does the stationarity assumption work?From Venter J.C. et al, 2001 Science.Definition of TermsIHomology (of genes) = similarity due to common ancestry.There are two types of homology, the distinction dependson ordering of speciation and gene duplication dates.IOrthologues = the “same” gene in different organisms, thatis, common ancestry goes back to a speciation event.IParalogues = different genes in the same organism, thatis, common ancestry goes back to a gene duplication.IThere are other forms of homology, such as lateral genetransfer.SyntenyILinked genes = genes that reside on the samechromosome.IConserved synteny = a group of linked genes that arehighly conserved and hypothesized to be homologous.Isyntenic segment = A group of landmarks that appear inthe same order on a single chromosome in each of the twospecies.Isyntenic block = A set of adjacent syntenic segments.SyntenySyntenyGenome Alignment1. To align a whole genome we assume that the syntenicregions have already been found through homologousgenes. Next, the vast non-coding regions need to bealigned.2. Alignment of non-coding regions is much harder, due tothe low conservation.3. To combine speed and sensitivity, most programs use usean anchored-alignment approach: In a first step, a fastsearch tool is used to identify a chain of high-scoringsequence similarities. These similarities are then used asanchor points for the final alignment, where a moresensitive method aligns those regions that are left overbetween the identified anchor points.4. This is what the fast pair-wise alignment algorithms BLASTand FASTA. For genome alignment, the programs differ byhow the details of how the anchors are strung up, howmany anchors to use, etc.For example, CHAOS, which was developed here byBatzouglou’s group, uses the following seed-and-extensionscheme.Scoring functions for DNA sequencesContinuous time Markov chainsQuestions to think about1. How should one frame the null hypothesis in genomealignment, or is it even relevant?2. How should one choose the parameters for the alignment?3. How sensitive is the “optimal” alignment to the alignmentparameters?4. What does “homology” mean when it applies to non-codingregions? What is the unit of measurement? Can it possiblybe inferred at the nucleotide


View Full Document

Stanford STATS 345 - Scanning Biological Sequences

Download Scanning Biological Sequences
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Scanning Biological Sequences and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Scanning Biological Sequences 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?