CSC 2427 Algorithms in Molecular Biology Lecture 14 Lecturer Michael Brudno Scribe Note Hyonho Lee Department of Computer Science University of Toronto 03 March 2006 Microarrays Revisited In the last lecture the guest lecturer Tim Hughes talked about microarrays and gene expression Microarray is basically a two dimensional array in which each gene or a set of genes are attached to Using a microarray we can measure the expression of a certain gene under various circumstances Note that a gene is not always expressed It is sometimes on and sometimes off depending on various circumstances such as the type of a cell the external conditions or the division of a cell The exception is a housekeeping gene which is always on under any circumstance since it is needed for the maintenance of the cell After doing a series of microarray experiments we get the result of two dimensional array where each row represents a gene and each column represents each experiment For example we can measure the gene expressions of various types of cells In these experiments each column represents the type of a cell for example brain cell liver cell or cancer cell We could also measure the gene expressions over time line For example the gene expressions of an embryo are changing over time as the embryo develops In this case the column of the array represents time line so we can see in which period 1 each gene is expressed Each entry of the array shows the expression of a gene for each experiment When a gene is expressed mRNA of a cell binds to the DNA of the microarray In a microarray many copies of the same DNAs are attached to each location So if there are many mRNAs that bind to the gene then the microarray shows high level of expression There are two main types of microarrays 1 channel microarray e g Affymetrix and 2 channel microarray e g cDNA microarray In 1 channel microarray we only prepare the test cell The result shows how much each gene is expressed by the test cell Usually greener image shows higher level of gene expression In 2 channel microarray we prepare both the control cell and the test cell The control cell is usually a mix of all kinds of cell tissues Then mRNA of the control sample is dyed green and the mRNA of the test cell is dyed red If a gene is more expressed in the control sample than the test cell then the microarray result shows green If the gene is more expressed in the test cell than in the control sample then the microarray shows read If the gene is equally expressed then the result is yellow After getting the result of microarray experiment we normalize the result to make it comparable Using the normalized microarray data we can make a cluster of genes that have a similar expression pattern or similar gene functions In another words we investigate which genes work together There are many techniques for clustering such as principal component analysis PCA independent component analysis ICA and Bayesian networks One way to measure the correlation of two genes is Pearson correlation which is Pn i 1 Xi X Yi Y n 1 SX SY where X and Y are the microarray data distribution of two genes and SX and SY are variations of X and Y respectively 2 Gibbs sampling for motif finding In the promoter of a gene there is a transcription factor binding site TFBS which binds the transcription factors when the gene is expressed A transcription factor is a protein and without its binding RNA polymerase does not transcribe DNA Since a specific transcription factor binds a specific binding site it is very useful to find the binding sites in the promoter One way to find the binding site is phylogenetic footprinting Since functional sequences are usually well conserved than nonfunctional sequences we could predict the binding site using footprinting This will be covered in the next lecture In this lecture we focus on finding regulatory motifs Since many genes usually participate in the same process at the same time many genes tend to be co expressed Hence it is believed that a short motif which is widespread among many genes may have an important role to bind the transcription factors Regulatory motifs usually have short fixed length They are repetitive even in a single gene but very variable Thus we want find a pattern rather than a fixed sequence For example our target motif would be like GC T T G AAT C One solution to find a motif is Gibbs sampling Gibbs sampling is basically a special case of Monte Carlo Markov Chain method Suppose we want to find a motif of length K given t DNA sequences X1 X2 Xt Then the Gibbs sampling algorithm is an iterative algorithm described as follows 1 After each iteration we are given t locations a1 at for X1 Xt respectively Let xj be the substring of Xi starting at ai 2 We randomly choose one gene Xi from X1 Xt 3 We calculate a 4 K position weight matrix PWM from the remaining t 1 sequences X1 Xi 1 Xi 1 Xt The each entry Ma b of the PWM indicates the frequency of the nucleotide a at the bth position of xj s 4 We also calculate the background probability for each nucleotide Let 3 Xj xj be the subsequence of Xi after removing xi The background probability Ba is the proportion of the occurrences of nucleotide a in all t subsequences X1 x1 Xt xt 5 Now we are ready to pick a new K long substring from Xi based on the PWM and the background probability For each K long substring yj starting at j of Xi we calculate two probabilities the probability from the PWM and the probability from the background probability For example if yj ACGT then we calculate P yj motif MA 1 MC 2 MG 3 MT 4 and P yj motif for P yj Background BA BC BG BT Then we calculate P yj Background each yj 6 We select a position k in Xi based on the odds ratio of P yj motif P yj Background P y motif j Hence if a position p has a higher value of P yj Background then p is more likely to be chosen as k We will use a1 ai 1 k ai 1 at for the next iteration The Gibbs sampling algorithm is very similar to the expectation maximization EM algorithm If we run the Gibbs sampling algorithm infinitely then it guarantees that we will find the best motif We normally runs the Gibbs sampling algorithm for a certain number of steps In the Gibbs sampling algorithm we choose a new motif based on the PWM and the background probability So if one entry of the PWM is zero then we never choose a motif that includes the entry For example if MA 1 0 then any motif starting with A is never chosen We do not want this kind of situation since even
View Full Document