CS 2427 Algorithms in Molecular Biology Lecture 13 1 March 2006 CS 2427 Algorithms in Molecular Biology Lecture 13 1 March 2006 Presenter Dr Tim Hughes Scribe Notes by Pingzhao Hu 1 Introduction to Microarrays Gene expression profiling using nucleic acid hybridization based methods has become popular in medical and biological research and development These methods use highdensity Microarrays to allow the researcher to screen for the expression of thousands of genes simultaneously in a single experiment Basically these arrays can be mainly divided into two categories One is Affymetrix GeneChip Microarray also called onecolor microarray and another is cDNA Microarray also called two color Microarray Affymetrix GeneChip Microarray uses Oligonucleotides of length 25 base pairs Typically a mRNA molecule of interest gene is represented by a probe set composed of 11 20 probe pairs of these oligonucleotides Each probe pair is composed of a perfect match PM probe a section of the mRNA molecule of interest and a mismatch MM probe The difference between PM and MM is that MM has changed the middle 13th base of the PM for measuring non specific binding There probe pairs are usually called a probe set as follows Perfect match Mismatch AGGCTATCGCACTCCAGTGG AGGCTATCGTACTCCAGTGG Since the Oligos is short a gene on the array can be included many probe sets RNA samples are prepared labeled and hybridized with the arrays cDNA Microarray contains probe from a known gene on each spot There probes on the array are longer pieces of DNA that are complementary to the genes in study Usually cDNA probes for making the array can be produced from a commercially available cDNA library so that a closer representation of the entire genome of an organism can be made on the array It is also be possible to use PCR to amplify specific genes from genomic DNA to generate the cDNA probes The produced cDNA probes then can be mechanically spotted onto a glass slide Since cDNA probs are much longer than oligos in Affymetrix GeneChip a probe is almost identical to a gene in a successful hybridization with a clone in the cDNA Microarrays Usually two samples experimental sample and control sample also called baseline sample are prepared for being hybridized to the arrays The control sample can be labeled with a green fluorescing dye called Cy3 and the experimental sample labeled with a red fluorescing dye called Cy5 If there is more of an mRNA transcript in the control sample than in the experimental sample more Cy3 will bind to the probe on the array and the spot will fluoresce green 1 CS 2427 Algorithms in Molecular Biology Lecture 13 1 March 2006 Otherwise the spot fluoresce red In many cases the two samples have the same amount of transcript Therefore the spot will fluoresce yellow Besides these two types of arrays other arrays include Ink jet array and mastless array but they are not widely used Though there are many different types of Microarrays as discussed above the analysis for the data obtained from these Microarrays are mainly divided into two levels One is low level analysis also called preprocessing which is focusing on summarizing and normalized data another is high level analysis which is focused on mining the summarized and normalized data The detail of these two analyses is discussed as follows 2 Preprocessing Low level analysis microarray expression data Preprocessing microarray gene expression data includes many issues For different types of microarray the preprocessing methods are also very different Here we focus on two issues one is summarization and another is normalization Summarization is to get an expression value for probe spot with two colors on cDNA Microarray and probe set with 11 20 probe pairs on Affymetrix Microarray Normalization is to make all data directly comparable For the cDNA Microarray data the summarization and normalization are two separately steps Usually we use log2 Cy3 Cy5 to get an expression value for each spot There are many normalization methods for this type of Microarray such as variance stabilizing normalization vsn For the Affymetrix Microarray data the summarization and normalization are usually a unified step Some methods for this purpose include MAS5 developed by Affymetrix which use PM MM to adjust for non specific binding and background noise model based method developed by Li and Wang 2001 which use PM MM values a non linear normalization and takes multi array summaries into account for detecting and removal outlier 3 High level analysis of multiple experiments There are different goals to analyze multiple microarray experiments In the course the following goals were discussed 1 Identify differentially expressed genes by comparing experiments from different biological conditions 2 Clustering experiments and transcripts 3 Predicting gene functions from microarray experiments 3 1 Identifying differentially expressed genes To identify differentially expressed genes is to compare the expression levels of genes in samples obtained from different biological conditions such as people with cancers and people without cancers One of useful tools to aid for this purpose is to draw the scatter plot in which x axis is the expression values of one condition say cancers and y axis is the expression values of another condition say without cancers The genes that are highly expressed in one condition and lowly expressed in another condition will be most 2 CS 2427 Algorithms in Molecular Biology Lecture 13 1 March 2006 interesting Fig 1 modified based on Prof Hughes s slide illustrates a similar example where each condition Cup5 and Vma8 respectively Note WT is baseline has only one array As we can see gene A is highly expressed in Cup5 but lowly expressed in Vma8 while gene B is highly expressed Vma8 but lowly expressed in Cup5 For most other genes they either highly or lowly expressed in both conditions Therefore genes A and B are the most interesting genes r 1 5 1 VMA A 0 5 0 Log10 Cup5 0 5 1 0 CbU B 1 5 1 5 1 0 5 0 0 5 1 0 Log10 Vma8 WT Figure 1 Scatter plot of expression values in two conditions The drawback of this method is that it cannot tell how significant the set of selected genes are A more precise method to make the conclusion may be obtained by calculating pvalue for each gene using a supervised statistical test such as t test or a model based method such as mixture model For example if we have more microarrays in the above experiments say we have nCup 5 nVma 8 arrays
View Full Document