Stanford CS 262 - Gibbs Sampling - D2615500

Home> Schools> Stanford University> Computer Science (CS) > CS 262> Gibbs Sampling

DOC PREVIEW

Stanford CS 262 - Gibbs Sampling

School name Stanford University

Course Cs 262- Computational Genomics

Pages 15

This preview shows page 1-2-3-4-5 out of 15 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 15 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS262: Computational Genomics Lecture 15, February 22nd, 2005 Scribe: Aditya Krishnan In this session we will cover a set of topics, which can be classified as regulatory motif finding. We’ll begin with an investigation into the Gibbs sampling in motif finding. Here the algorithm will be discussed followed by a discussion of the advantages and disadvantages of the algorithm and a suggested improvement repeats and a better background model. The second part will deal with phylogenetic footprinting . Using a small example the exact algorithm will be discussed . Rapid global alignment forms the third part of the lecture. The fourth and final component of the lecture is a discussion on the methods to chain local alignments, which is essentially sparse dynamic programming in O(NlogN) time. In the previous lecture we saw problem of finding regulatory motifs using among many others the expectation-maximization algorithm. We’ll now start with the discussion of the Gibbs sampling. Gibbs Sampling Problem Formulation: As a particular Markov Chain Monte Carlo (MCMC) method, the Gibbs sampling is widely used for various Bayesian problems. It is well suited to coping with incomplete information and is often suggested for such applications including motif finding. The first statistical motif finder using Gibbs Sampling is AlignACE and its improved version is called BioProspector. The problem is formulated as follows: • Given: (a). x1, …, xN, (b). motif length K, and (c). background B, • We look for: (i) Model M, and (ii) Locations a1,…, aN in x1, …, xN • To maximize log-odds likelihood ratio: N K ∑∑∑∑ ∑∑∑∑ Log M (k, xiai+k) / B (xiai+k) i=1 k=1 Algorithm: Given the problem, the basic steps of the algorithm are: • Initialization: select random locations in sequences x1, …, xN, and compute an initial model M from these locations • Sampling Iterations: i) Predictive Step: remove one sequence xi, then recalculate model;ii) Sampling Step: pick a new location of motif in xi according to probability the location is a motif occurrence. Here are some details of the algorithm. In the initialization step, we select random continuous locations a1,…, aN in x1, …, xN, and compute the model M: N Mkj = (1/N) ∑∑∑∑ (xiai+k = j ) i=1 That is to say, Mkj is the number of occurrences of letter j in motif position k, over the total number of sequences. M In the predictive step, we Select a sequence x = xi from all the sequences x1, x2, …, xN, and remove xi. Recompute the model according to the following formula: where bj are pseudocounts to avoid 0s, and B = ∑j βj In the sampling step, for every K-long word xj,…,xj+k-1 in x, we define two quatities Qj and Pj and compute the probabilities corresponding to the new model. The new motif in x i will be chosen to follow the probabilities just calculated. 0 |X| ))(()1(1,1∑≠=+=++−=NisskajkjjxBNMsβQj = Prob[ word | motif ] = M(1,xj)´…´M(k,xj+k-1) Pi = Prob[ word | background ] B(xj)´…´B(xj+k-1) Let When applying the Gibbs Sampling method, we usually run the algorithm several times instead of just once, i.e. we will do multiple initializations and run Gibbs Sampling after each initialization till convergence. Finally, the common motif will be reported. Advantages/Disadvantages The Gibbs Sampling is very similar to the EM algorithm. Actually it is a generalized EM algorithm. The advantages and disadvantages of Gibbs Sampling are stated as follows: Advantages: • The algorithm is easier to implement • Less dependent on initial parameters • More versatile, easier to enhance with heuristics Disadvantages: • More dependent on all sequences to exhibit the motif • Less systematic search of initial parameter space Repeats, and a Better Background Model Another problem when using Gibbs Sampling is that repeats may be considered as motif, especially low-complexity CACACA… AAAAA, etc. The solution to this problem is to use a better background model, i.e. we use higher order, more elaborate background model: • 0th order: B = { pA, pC, pG, pT } • 1st order: B = { P(A|A), P(A|C), …, P(T|T) } …… • Kth order: B = { P(X | b1…bK); X, bi{A,C,G,T} } This method has been applied to EM and Gibbs algorithm up to 3rd order. ∑+−==1||1//kxjjjjjjPQPQAPhylogenetic Footprinting What is Phylogenetic Footprinting? According to Tagle et al. 1988, functional sequences evolve slower than nonfunctional ones. If we consider a set of orthologous sequences from different species, we will want to identify unusually well conserved regions. Problem formulation: Given: • A phylogenetic tree T, • A set of orthologous sequences at leaves of T, • The motif length k, • The threshold d We want to: • Find each set S of k-mers, one k-mer from each leaf, such that the “parsimony” score of S in T is at most d. Note the problem is NP-hard. Example: We want to find the motif with length 4 in the following phylogenetic tree: AGTCGTACGTGAC... (Human) AGTAGACGTGCCG... (Chimp) ACGTGAGATACGT... (Rabbit) GAACGGAGTACGT... (Mouse) TCGTGACGGTGAT... (Rat)The size of the motif sought is k = 4. The solution has parsimony score with 1 mutation as shown above. AGTCGTACGTGAC ... AGTAGACGTGCCG ... ACGTGAGATACGT ... GAACGGAGTACGT ... TCGTGACGGTGAT ... ACGG ACGT ACGT ACGT CLUSTALW multiple sequence alignment (rbcS gene) Cotton ACGGTT-TCCATTGGATGA---AATGAGATAAGAT---CACTGTGC---TTCTTCCACGTG--GCAGGTTGCCAAAGATA-------AGGCTTTACCATT Pea GTTTTT-TCAGTTAGCTTA---GTGGGCATCTTA----CACGTGGC---ATTATTATCCTA--TT-GGTGGCTAATGATA-------AGG--TTAGCACA Tobacco TAGGAT-GAGATAAGATTA---CTGAGGTGCTTTA---CACGTGGC---ACCTCCATTGTG--GT-GACTTAAATGAAGA-------ATGGCTTAGCACC Ice-plant TCCCAT-ACATTGACATAT---ATGGCCCGCCTGCGGCAACAAAAA---AACTAAAGGATA--GCTAGTTGCTACTACAATTC--CCATAACTCACCACC Turnip ATTCAT-ATAAATAGAAGG---TCCGCGAACATTG--AAATGTAGATCATGCGTCAGAATT--GTCCTCTCTTAATAGGA-------A-------GGAGC Wheat TATGAT-AAAATGAAATAT---TTTGCCCAGCCA-----ACTCAGTCGCATCCTCGGACAA--TTTGTTATCAAGGAACTCAC--CCAAAAACAAGCAAA Duckweed TCGGAT-GGGGGGGCATGAACACTTGCAATCATT-----TCATGACTCATTTCTGAACATGT-GCCCTTGGCAACGTGTAGACTGCCAACATTAATTAAA Larch

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4-5 out of 15 pages.

Stanford CS 262 - Gibbs Sampling

Sign up for free to view:

Please select your school