DOC PREVIEW
Stanford CS 262 - Lecture Notes

This preview shows page 1-2-3-4-5 out of 16 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 16 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Computational Genomics: CS 262 Lecture Notes for May 27th 2003 Scribed by Sumedh Pathak 1. Gibbs Sampling in Motif Finding Gibbs sampling is another algorithm which is used to find motifs in sequences. AlignACE and Bio-Prospector are two motif-finding tools which use Gibbs sampling techniques. Bio-Prospector was developed at Stanford and is an improvement on the earlier AlignACE technique. In Gibbs sampling, we are given N sequences x1, …., xN, a motif length K, and a background model B, which is a set of probabilities that gives us the ratio of a residue being in the background model versus the true model. We have to find a model M, which gives us the probability of residue frequencies in each position in a motif, and positions a1, …. aN, in sequences x1, …..xN. These locations signify the start of the motif in each sequence, and thus we can obtain an alignment of these motifs by finding the positions a1, …. aN, that maximizes the log-odds likelihood ratio given by: ∑∑==++NiKkikaikaiixBxkM11)(),(log Intuitively we would like to select most similar words to as to maximize the sum of the log likelihood. This would thus give us the optimal motif in the sequences. 2. Gibbs Sampling: Algorithm: There are two basic steps to the Gibbs sampling algorithm, an initialization step followed by a predictive update step which involves iterative sampling. Initialization: a. Select random locations ai, … aN in the sequences x1, … xN.b. Compute an initial model M from these locations. The initial model M is calculated as: ∑=+==NikakjjxNMi1)(1 i.e. Mkj is the number of occurrences of letter j (from the alphabet A, T, C, G) in motif position k, over the total number of sequences. Sampling Iterations: (Predictive Update) a. Remove a sequence x = xi b. Recalculate model M c. Pick a new location of motif in xi according to probability that the location is a motif occurrence. The model can be recomputed in the predictive update step as: ))(()1(1,1∑≠=+=++−=NisskajkjjxBNMsβ Where βj are pseudo counts to avoid 0’s and B = Σj βj The sampling is done by calculating the probability of a word being generated by the background model or the motif model as: Qj = Prob[ word | motif ] = M(1, xj)×…×M(k, xj + k-1) Pj = Prob[ word | background ] = B(xj)×…×B(xj + k-1) Let, ∑+−==1||1//kxjjjjjjPQPQA In the above equation Aj can be thought of as the probability that the word of length k starting at position j is a motif compared to the probability of a word at any position being a motif. We can now sample a random new position ai according to the probabilities A1,…, A|x|-k+1.This would translate into a probability mass function for a given sequence xi. Thus, running Gibbs sampling to find motifs involves three main steps: a. Initialize b. Run predictive update and sampling until convergence c. Repeat steps a. and b. several times and report common motifs. This algorithm is very similar to EM, and has some advantages and disadvantages: Advantages: a. Easier to implement than EM. b. Less dependant on initial parameters c. More versatile, and easier to enhance with heuristics. E.g.: use cutoffs to sample motifs from sequences that only match or increase the maximum score already found. Disadvantages: a. More dependent on all sequences to exhibit the motif. b. Less systematic search of initial parameter space. c. Repeat DNA can be confused as motif: e.g. Low complexity CACA or AAAA etc. There is a solution for the above problem, where we can use a more elaborate background model. E.g.: 0th order: B = {pA, pC, pG, pT } 1st order: B = {P(A|A), P(A|C), …, P(T|T) } … Kth order: B = {P(X | b1…bK); X, bi ∈{A,C,G,T} } 3. Example application of Gibbs sampling: Motifs in Yeast Motifs in yeast were studied by Tavazoie et al. in George Church’s lab at Harvard University. They used AlignACE to find the motifs in the Yeast genome. Data gathered by Cho et al. was used, which provided the expression of yeast genes over 15 time points. 3000 genes with the most variable expression were selected for the data processing. They were clustered according to common expression using K-means clustering methods. They obtained 30 clusters with around 50-190 genes per cluster, and the clusters correlate to well known gene functions.AlignACE was then used for 600 long upstream regions, and around 50 regions were used per trial, as Gibbs sampling does not work well for a large number of sequences at a time. The results are shown in the graphs below:4. Phylogenetic foot printing as a motif finding application: Phylogenetic foot printing is based on the premise that functional sequences evolve much slower than non-functional ones. These can also be considered as evolutionarily constrained regions. Thus we can assume that evolutionary motifs that are functional would be well conserved compared to non-motif sequences. This then motivates the ‘substring parsimony problem’, where, given a phylogenetic tree T, a set of orthologous sequences at the leaves of T, length K of a motif and a threshold d, we need to find each set S of k-mers, one k-mer from each leaf, such that the parsimony score of S in T is at most d. The parsimony score can be considered to be the number of mutations required to change one sequence to another. This problem is considered to be NP-hard. Example:AGTCGTACGTGAC ... AGTAGACGTGCCG ... ACGTGAGATACGT ... GAACGGAGTACGT ... TCGTGACGGTGAT ... ACGTACGACGGACGT The parsimony score in this tree is 1, since it takes one mutation to change ACGG to ACGT. An exact algorithm has been proposed to solve this problem, as follows: (generalizing Sankoff and Rousseau 1975) We define: Wu [s] = best parsimony score for sub tree rooted at node u, if u is labeled with string s. Example: c ACGG: +∞ ACGT: 0 c ACGG: 1 ACGT: 0 ... 4k entries AGTCGTACGTGACGGGACGTG CACGTGAG ATACGAACGGAGTAC TCGTGACGGTGc ACGG:∞ ACGT :0 ... c ACGG: 2 ACGT: 1 ...c ACGG:∞ ACGT :0 ... c ACGG:∞ ACGT :0 ... c ACGG: 1 ACGT: 1 ... c ACGG: 0 ACGT: 2 ...c ACGG: 0 ACGT: +∞ We thus calculate the Wu[s] for all strings s of length k for each sub tree, and pick the best string s. We thus would calculate 4k entries for each sub tree.The algorithm is as follows: Wu [s] = ∑ min (Wv [t] + d(s, t)) v : child t of u We calculate Wu[s] as the sum over all v, which are the children of u, of


View Full Document

Stanford CS 262 - Lecture Notes

Documents in this Course
Lecture 8

Lecture 8

38 pages

Lecture 7

Lecture 7

27 pages

Lecture 4

Lecture 4

12 pages

Lecture 1

Lecture 1

11 pages

Biology

Biology

54 pages

Lecture 7

Lecture 7

45 pages

Load more
Download Lecture Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?