DOC PREVIEW
Stanford CS 262 - Regulation of Genes

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Regulation of genesFinding Regulatory MotifsMOTIF FINDINGScribed by Madhura SharangpaniIn the previous lecture we studied : Regulation of genesRegulation of GenesGeneRegulatory ElementRNA polymerase(Protein)Transcription Factor(Protein)DNAGene regulation is the process responsible for the dynamic behavior of cells.Regulations of genes is controlled from tiny sequences ie regulatory motifs. Themotifs can be located far away from genes, although most of times a good chunkappear upstream of the genes a few hundred bases apart. The regulatory elementscontrol when RNA transcription will be initiated and what protein products will itproduce. We use micro array technology to measure transcription level of all genes ata given time and location in a cell. This initiates the process of finding regulatorymotifs for genes.In today’s lecture we will study about finding regulatory motifs. Finding Regulatory MotifsDefinition: Given a collection of sequences of genes, possibly with commonexpressions, to find a short, maybe constant size sequence that is shared by all thesequences.Motif finding is a very difficult process since the sequence of letters that we arelooking at is not always conserved in their respective locations. We see a pattern ofconservation where each letter is conserved a certain percentage of time.Problem definition:The problem of motif finding basically falls into two broad categories: Probabilisticand Combinatorial.Finding Regulatory MotifsGiven a collection of genes with common expression,Find the TF-binding motif in common...1. Probabilistic definition:In probabilistic view we can think of a motif as a matrix of size 4xW (W being thelength of the motif) of probability values, each entry giving the probability of theletters A, C, G, T occurring in that column position.So the problem is, given the motif ie the matrix M and a set of input sequences, findthe best occurrence of the motif in each sequence. Total score of the motif is theprobability of best score of the motif in each sequence. We need to find the motifmatrix M that maximizes the motif score.2. Combinatorial definition:The motif is a consensus sequence. M: m1…..mwWhere: each mi is the letter A, C, G or T.If you try to find the occurrence of the best motif in each sequence, ie the occurrencewith smallest number of differences, the total number of differences in all sequencesshould be minimal.The problem basically involves finding best local alignment (of length W) of a largenumber of sequences.Depending on the problem definition there are various approaches to solve thisproblem.Probabilistic approaches:1. Expectation maximization: MEME.2. Gibbs sampling:(Implemented in AlignACE and later in BioProspector)Combinatorial approaches:1. Consensus2. Teiresias3. SP-STAR4. othersDiscrete Formulation of the problem:Given input sequences S = { x1, x2, …….. xn }each with given lengthA motif W is a consensus string w1…wk.Find the motif W with best match with x1….xn.Definition of Best:d(W,xr) = minimum hamming distance between W and a word in x1.D(W,S) = d(W,xr). rCombinatorial Methods:Few approaches to deal with this problem in combinatorial view are:1. Exhaustive Searches2. Consensus3. Multiprofiler, Teiresias, SP-STAR, Winnower.1] Exhaustive Search:For W = AAA…A upto TTT…T, ie for every single letter W (there are 4^k suchwords) we need to find the value of d(W,S) .Running Time: O (K*N*4^k)Where N = summation of lengths of all sequences.Advantage: The algorithm finds the best motif W.Disadvantage: Time requirement is very high.1. Sample driven algorithm:For W = a k long word in some sequence xi, find d(W,S).Report W = argmin(d(W,S)) or report a local improvement of W.Running Time: O (k*N^2)Advantage: Time it takes is less as compared to previous approach.Disadvantage: If a true motif does not occur in the data and if true motif is weak, thenrandom motif may score better than any instance of true motif.2] ConsensusConsensus is an algorithm to find motifs heuristically. Number of cycles in the algorithm is equal to number of sequences – 1.Cycle 1:For every pair of words in two different sequences, compute best alignment.For each word W in SFor each word W’ in SCreate alignment (gap free) of W, W’.Keep C1 best alignments A1, …. , Ac1.Cycle t: For each word W in SFor each alignment A1 from cycle t = 1Create alignment ( gap free ) of W, A1.Keep Ct best alignments A1,…….,Act.Essentially we match every word in the sequence with every alignment that we havekept in the memory in previous cycles.Running Time:O(N^2) + O(N*C1) + O(N*C2) + ….. + O(N*Cn)= O(N^2 + NCtotal)WhereCtotal =  Ci, C being the number of motifs you keep from one cycle to another. typically O(N*C), where C is a big constant.One advantage of the algorithm is that you can control the total running time bycontrolling the value of C.3] Multiprofiler:Extended sample driven approachGiven a K-long word W, define:Na(W) = words W’ in S s.t. d(W, W’) <= aie words with atmost ‘a’ difference.Idea:Assuming W is the occurrence of true motif W*, we use Na(W) to correct the ‘errors’in W ie the differences betweens W and the true motif.Algorithm:We basically reduce our problem to find the best K-long motifs.For each word W in S:For L = 1 to LmaxFind all strong L long wordlets G in Na(W) -> W’.Modify W by wordlet GCompute d(W’,S)Report W* = argmin d(W’,S)Probabilistic Methods:Expectation Maximization in Motif FindingExpectation maximization is a classical computer science algorithm that can be applied tomotif finding. We will see how it is applied in the MEME system that is a very old andsuccessful motif finder.Algorithm:1. Given genomic sequences find all K-long words. 2. Divide the words into two categories or bags, motif or background in a way thatmaximizes the likelihood of this classification. Notice that we are trying to find astrongest motif word can occur many times in a sequence.3. Find likeliest motif and background models and the best classification of words.One additional parameter that we will study in this algorithm is , the a priory probabilityof occurrence of a motif.  tells you of all the word that you have in the bag, how manywords you expect to be motif words. You multiply motif probability with , thebackground probability with (1-) and then you divide them into two bags.


View Full Document

Stanford CS 262 - Regulation of Genes

Documents in this Course
Lecture 8

Lecture 8

38 pages

Lecture 7

Lecture 7

27 pages

Lecture 4

Lecture 4

12 pages

Lecture 1

Lecture 1

11 pages

Biology

Biology

54 pages

Lecture 7

Lecture 7

45 pages

Load more
Download Regulation of Genes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Regulation of Genes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Regulation of Genes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?