DOC PREVIEW
U of I CS 498 - Motif finding with Gibbs sampling

This preview shows page 1-2-3-25-26-27 out of 27 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 27 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Motif finding with GibbssamplingCS 498 SSRegulatory networks• Genes are switches, transcription factors are(one type of) input signals, proteins areoutputs• Proteins (outputs) may be transcriptionfactors and hence become signals for othergenes (switches)• This may be the reason why humans have sofew genes (the circuit, not the number ofswitches, carries the complexity)Decoding the regulatory network• Find patterns (“motifs”) in DNA sequencethat occur more often than expected bychance– These are likely to be binding sites fortranscription factors– Knowing these can tell us if a gene isregulated by a transcription factor (i.e., the“switch”)GENEACAGT GATRANSCRIPT IONFACTORPROTEINTranscriptional regulationGENEACAGT GATRANSCRIPT IONFACTORPROTEINTranscriptional regulationA motif model• To define a motif, lets say we know wherethe motif starts in the sequence• The motif start positions in their sequencescan be represented as s = (s1,s2,s3,…,st)Genes regulatedby same transcription factorMotifs: Matrices and Consensus a G g t a c T t C c A t a c g tAlignment a c g t T A g t a c g t C c A t C c g t a c g G _________________ A 3 0 1 0 3 1 1 0Matrix C 2 4 0 0 1 4 0 0 G 0 1 4 0 0 0 3 1 T 0 0 0 5 1 0 1 4 _________________Consensus A C G T A C G T• Line up the patterns bytheir start indexes s = (s1, s2, …, st)• Construct “position weightmatrix” with frequencies ofeach nucleotide in columns• Consensus nucleotide ineach position has thehighest frequency incolumnPosition weight matrices• Suppose there were t sequences to begin with• Consider a column of a position weight matrix• The column may be (t, 0, 0, 0)– A perfectly conserved column• The column may be (t/4, t/4, t/4, t/4)– A completely uniform column• “Good” profile matrices should have moreconserved columnsInformation Content• In a PWM, convert frequencies to probabilities• PWM W: Wβk = frequency of base β at position k• qβ = frequency of base β by chance• Information content of W:! W"klogW"kq""#{A,C,G,T }$k$Information Content• If Wβk is always equal to qβ, i.e., if W issimilar to random sequence, informationcontent of W is 0.• If W is different from q, information contentis high.Detecting Subtle SequenceSignals: a Gibbs SamplingStrategy for Multiple AlignmentLawrence et al. 1993Motif Finding Problem• Given a set of sequences, find the motifshared by all or most sequences, while itsstarting position in each sequence isunknown• Assumption:– Each motif appears exactly once in onesequence– The motif has fixed lengthGenerative Model• Suppose the sequences are aligned, thealigned regions are generated from a motifmodel• Motif model is a PWM. A PWM is aposition-specific multinomial distribution.– For each position i, a multinomial distributionon (A,C,G,T): qiA,qiC,qiG,qiT• The unaligned regions are generated froma background model: pA,pC,pG,pTNotations• Set of symbols:• Sequences: S = {S1, S2, …, SN}• Starting positions of motifs: A = {a1, a2, …, aN}• Motif model ( ) : qij = P(symbol at the i-th position = j)• Background model: pj = P(symbol = j)• Count of symbols in each column: cij= count of symbol, j,in the i-th column in the aligned region!!Probability of data given model!!="==Wi jcijijqASP1||1),|(#!!="==Wi jcjijpASP1||10),|(#Scoring Function• Maximize the log-odds ratio:• Is greater than zero if the data is a bettermatch to the motif model than to thebackground model!!="==Wi jcijijqASP1||1),|(#!!="==Wi jcjijpASP1||10),|(#!!="===Wi jjijijpqcASPASPF1||10log),|(),|(log##Optimization and Sampling• To maximize a function, f(x):– Brute force method: try all possible x– Sample method: sample x from probabilitydistribution: p(x) ~ f(x)– Idea: suppose xmax is argmax of f(x), then it isalso argmax of p(x), thus we have a highprobability of selecting xmaxMarkov Chain sampling• To sample from a probability distribution p(x),we set up a Markov chain s.t. each staterepresents a value of x and for any two states,x and y, the transitional probabilities satisfy:)()(1lim xpxCNN=!"• This would then imply that if the Markov chainis “run” for “long enough”, the probabilitythereafter of being in state x will be p(x)! p(x)Pr(x " y) = p(y)Pr(y " x)Gibbs sampling to maximize F• Gibbs sampling is a special type of Markov chainsampling algorithm• Our goal is to find the optimal A = (a1,…aN)• The Markov chain we construct will only havetransitions from A to alignments A’ that differ from Ain only one of the ai• In round-robin order, pick one of the ai to replace• Consider all A’ formed by replacing ai with someother starting position ai’ in sequence Si• Move to one of these A’ probabilistically• Iterate the last three stepsAlgorithmRandomly initialize A0;Repeat:(1) randomly choose a sequence z from S;A* = At \ az; compute θt from A*;(2) sample az according to P(az = x), which isproportional to Qx/Px; update At+1 = A* ∪ x;Select At that maximizes F;Qx: the probability of generating x according to θt; Px: the probability of generating x according to the background model! qij=cijcikk"AlgorithmCurrent solution AtAlgorithmChoose one az to replaceAlgorithmFor each candidate sitex in sequence z, calculate Qx and Px:Probabilities of samplingx from motif model andbackground model resp.xAlgorithmAmong all possible candidates, choose one(say x) with probabilityproportional to Qx/PxxAlgorithmSet At+1 = A* ∪ xxAlgorithmRepeatxLocal optima• The algorithm may not find the “global” ortrue maximum of the scoring function• Once “At” contains many similarsubstrings, others matching these will bechosen with higher probability• Algorithm will “get locked” into a “localoptimum”– all neighbors have poorer scores, hence lowchance of moving out of this


View Full Document

U of I CS 498 - Motif finding with Gibbs sampling

Documents in this Course
Lecture 5

Lecture 5

13 pages

LECTURE

LECTURE

39 pages

Assurance

Assurance

44 pages

LECTURE

LECTURE

36 pages

Pthreads

Pthreads

29 pages

Load more
Download Motif finding with Gibbs sampling
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Motif finding with Gibbs sampling and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Motif finding with Gibbs sampling 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?