Comparing Mouse vs Human GenomesComparisons at the genome level are a much hardercomputational and theoretical problem.From International Human Genome Sequencing Consortium(2001), Nature.At the finer scale, we can start to see patterns.From Gregory et al. (2002), Nature.Within the genome of a single species, there are manyduplications, translocations, and inversions.From The Arabidopsis Genome Initiative (2000), Nature.How genomes involve through duplication.From Deonier, Tavaré and Waterman, 2005.How much of the genome is conserved?IYeast genome contains 70% coding sequences.IHuman genome contains 1.2% protein coding sequence.Does the stationarity assumption work?From Venter J.C. et al, 2001 Science.Definition of TermsIHomology (of genes) = similarity due to common ancestry.There are two types of homology, the distinction dependson ordering of speciation and gene duplication dates.IOrthologues = the “same” gene in different organisms, thatis, common ancestry goes back to a speciation event.IParalogues = different genes in the same organism, thatis, common ancestry goes back to a gene duplication.IThere are other forms of homology, such as lateral genetransfer.SyntenyIlinked genes = genes that reside on the samechromosome.Iconserved synteny = a group of linked genes that arehighly conserved and hypothesized to be homologous.Isyntenic segment = A group of landmarks that appear inthe same order on a single chromosome in each of the twospecies.Isyntenic block = A set of adjacent syntenic segments.SyntenyIlinked genes = genes that reside on the samechromosome.Iconserved synteny = a group of linked genes that arehighly conserved and hypothesized to be homologous.Isyntenic segment = A group of landmarks that appear inthe same order on a single chromosome in each of the twospecies.Isyntenic block = A set of adjacent syntenic segments.SyntenyIlinked genes = genes that reside on the samechromosome.Iconserved synteny = a group of linked genes that arehighly conserved and hypothesized to be homologous.Isyntenic segment = A group of landmarks that appear inthe same order on a single chromosome in each of the twospecies.Isyntenic block = A set of adjacent syntenic segments.SyntenyIlinked genes = genes that reside on the samechromosome.Iconserved synteny = a group of linked genes that arehighly conserved and hypothesized to be homologous.Isyntenic segment = A group of landmarks that appear inthe same order on a single chromosome in each of the twospecies.Isyntenic block = A set of adjacent syntenic segments.SyntenySyntenyGenome Alignment1. To align a whole genome we assume that the syntenicregions have already been found through homologousgenes. Next, the vast non-coding regions need to bealigned.2. Alignment of non-coding regions is much harder, due tothe low conservation.3. To combine speed and sensitivity, most programs use usean anchored-alignment approach: In a first step, a fastsearch tool is used to identify a chain of high-scoringsequence similarities. These similarities are then used asanchor points for the final alignment, where a moresensitive method aligns those regions that are left overbetween the identified anchor points.4. This is what the fast pair-wise alignment algorithms BLASTand FASTA. For genome alignment, the programs differ byhow the details of how the anchors are strung up, howmany anchors to use, etc.Genome Alignment1. To align a whole genome we assume that the syntenicregions have already been found through homologousgenes. Next, the vast non-coding regions need to bealigned.2. Alignment of non-coding regions is much harder, due tothe low conservation.3. To combine speed and sensitivity, most programs use usean anchored-alignment approach: In a first step, a fastsearch tool is used to identify a chain of high-scoringsequence similarities. These similarities are then used asanchor points for the final alignment, where a moresensitive method aligns those regions that are left overbetween the identified anchor points.4. This is what the fast pair-wise alignment algorithms BLASTand FASTA. For genome alignment, the programs differ byhow the details of how the anchors are strung up, howmany anchors to use, etc.Genome Alignment1. To align a whole genome we assume that the syntenicregions have already been found through homologousgenes. Next, the vast non-coding regions need to bealigned.2. Alignment of non-coding regions is much harder, due tothe low conservation.3. To combine speed and sensitivity, most programs use usean anchored-alignment approach: In a first step, a fastsearch tool is used to identify a chain of high-scoringsequence similarities. These similarities are then used asanchor points for the final alignment, where a moresensitive method aligns those regions that are left overbetween the identified anchor points.4. This is what the fast pair-wise alignment algorithms BLASTand FASTA. For genome alignment, the programs differ byhow the details of how the anchors are strung up, howmany anchors to use, etc.Genome Alignment1. To align a whole genome we assume that the syntenicregions have already been found through homologousgenes. Next, the vast non-coding regions need to bealigned.2. Alignment of non-coding regions is much harder, due tothe low conservation.3. To combine speed and sensitivity, most programs use usean anchored-alignment approach: In a first step, a fastsearch tool is used to identify a chain of high-scoringsequence similarities. These similarities are then used asanchor points for the final alignment, where a moresensitive method aligns those regions that are left overbetween the identified anchor points.4. This is what the fast pair-wise alignment algorithms BLASTand FASTA. For genome alignment, the programs differ byhow the details of how the anchors are strung up, howmany anchors to use, etc.For example, CHAOS, which was developed here byBatzouglou’s group, uses the following seed-and-extensionscheme.Questions to think about1. How should one frame the null hypothesis in genomealignment? Is it relevant?2. How should one choose the parameters for the alignment?3. How sensitive is the “optimal” alignment to the alignmentparameters?4. What does “homology” mean when it applies to non-codingregions? What is the unit of measurement? Can it possiblybe inferred at the nucleotide level?Questions to think about1. How should one frame the null hypothesis in genomealignment? Is it relevant?2. How should one choose the parameters for the
View Full Document