U of I CS 498 - Motif finding with Gibbs sampling - D1325504

Home> Schools> University of Illinois> Computer Science (CS) > CS 498> Motif finding with Gibbs sampling

DOC PREVIEW

U of I CS 498 - Motif finding with Gibbs sampling

School name University of Illinois

Course Cs 498- Special Topics

Pages 27

This preview shows page 1-2-3-25-26-27 out of 27 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Motif finding with GibbssamplingCS 498 SSRegulatory networks• Genes are switches, transcription factors are(one type of) input signals, proteins areoutputs• Proteins (outputs) may be transcriptionfactors and hence become signals for othergenes (switches)• This may be the reason why humans have sofew genes (the circuit, not the number ofswitches, carries the complexity)Decoding the regulatory network• Find patterns (“motifs”) in DNA sequencethat occur more often than expected bychance– These are likely to be binding sites fortranscription factors– Knowing these can tell us if a gene isregulated by a transcription factor (i.e., the“switch”)GENEACAGT GATRANSCRIPT IONFACTORPROTEINTranscriptional regulationGENEACAGT GATRANSCRIPT IONFACTORPROTEINTranscriptional regulationA motif model• To define a motif, lets say we know wherethe motif starts in the sequence• The motif start positions in their sequencescan be represented as s = (s1,s2,s3,…,st)Genes regulatedby same transcription factorMotifs: Matrices and Consensus a G g t a c T t C c A t a c g tAlignment a c g t T A g t a c g t C c A t C c g t a c g G _________________ A 3 0 1 0 3 1 1 0Matrix C 2 4 0 0 1 4 0 0 G 0 1 4 0 0 0 3 1 T 0 0 0 5 1 0 1 4 _________________Consensus A C G T A C G T• Line up the patterns bytheir start indexes s = (s1, s2, …, st)• Construct “position weightmatrix” with frequencies ofeach nucleotide in columns• Consensus nucleotide ineach position has thehighest frequency incolumnPosition weight matrices• Suppose there were t sequences to begin with• Consider a column of a position weight matrix• The column may be (t, 0, 0, 0)– A perfectly conserved column• The column may be (t/4, t/4, t/4, t/4)– A completely uniform column• “Good” profile matrices should have moreconserved columnsInformation Content• In a PWM, convert frequencies to probabilities• PWM W: Wβk = frequency of base β at position k• qβ = frequency of base β by chance• Information content of W:! W"klogW"kq""#{A,C,G,T }$k$Information Content• If Wβk is always equal to qβ, i.e., if W issimilar to random sequence, informationcontent of W is 0.• If W is different from q, information contentis high.Detecting Subtle SequenceSignals: a Gibbs SamplingStrategy for Multiple AlignmentLawrence et al. 1993Motif Finding Problem• Given a set of sequences, find the motifshared by all or most sequences, while itsstarting position in each sequence isunknown• Assumption:– Each motif appears exactly once in onesequence– The motif has fixed lengthGenerative Model• Suppose the sequences are aligned, thealigned regions are generated from a motifmodel• Motif model is a PWM. A PWM is aposition-specific multinomial distribution.– For each position i, a multinomial distributionon (A,C,G,T): qiA,qiC,qiG,qiT• The unaligned regions are generated froma background model: pA,pC,pG,pTNotations• Set of symbols:• Sequences: S = {S1, S2, …, SN}• Starting positions of motifs: A = {a1, a2, …, aN}• Motif model ( ) : qij = P(symbol at the i-th position = j)• Background model: pj = P(symbol = j)• Count of symbols in each column: cij= count of symbol, j,in the i-th column in the aligned region!!Probability of data given model!!="==Wi jcijijqASP1||1),|(#!!="==Wi jcjijpASP1||10),|(#Scoring Function• Maximize the log-odds ratio:• Is greater than zero if the data is a bettermatch to the motif model than to thebackground model!!="==Wi jcijijqASP1||1),|(#!!="==Wi jcjijpASP1||10),|(#!!="===Wi jjijijpqcASPASPF1||10log),|(),|(log##Optimization and Sampling• To maximize a function, f(x):– Brute force method: try all possible x– Sample method: sample x from probabilitydistribution: p(x) ~ f(x)– Idea: suppose xmax is argmax of f(x), then it isalso argmax of p(x), thus we have a highprobability of selecting xmaxMarkov Chain sampling• To sample from a probability distribution p(x),we set up a Markov chain s.t. each staterepresents a value of x and for any two states,x and y, the transitional probabilities satisfy:)()(1lim xpxCNN=!"• This would then imply that if the Markov chainis “run” for “long enough”, the probabilitythereafter of being in state x will be p(x)! p(x)Pr(x " y) = p(y)Pr(y " x)Gibbs sampling to maximize F• Gibbs sampling is a special type of Markov chainsampling algorithm• Our goal is to find the optimal A = (a1,…aN)• The Markov chain we construct will only havetransitions from A to alignments A’ that differ from Ain only one of the ai• In round-robin order, pick one of the ai to replace• Consider all A’ formed by replacing ai with someother starting position ai’ in sequence Si• Move to one of these A’ probabilistically• Iterate the last three stepsAlgorithmRandomly initialize A0;Repeat:(1) randomly choose a sequence z from S;A* = At \ az; compute θt from A*;(2) sample az according to P(az = x), which isproportional to Qx/Px; update At+1 = A* ∪ x;Select At that maximizes F;Qx: the probability of generating x according to θt; Px: the probability of generating x according to the background model! qij=cijcikk"AlgorithmCurrent solution AtAlgorithmChoose one az to replaceAlgorithmFor each candidate sitex in sequence z, calculate Qx and Px:Probabilities of samplingx from motif model andbackground model resp.xAlgorithmAmong all possible candidates, choose one(say x) with probabilityproportional to Qx/PxxAlgorithmSet At+1 = A* ∪ xxAlgorithmRepeatxLocal optima• The algorithm may not find the “global” ortrue maximum of the scoring function• Once “At” contains many similarsubstrings, others matching these will bechosen with higher probability• Algorithm will “get locked” into a “localoptimum”– all neighbors have poorer scores, hence lowchance of moving out of this

View Full Document