UCLA COMSCI 260 - Learning Majority Functions - D942648

Home> Schools> University of California, Los Angeles> Computer Science (COMSCI) > COMSCI 260> Learning Majority Functions

DOC PREVIEW

UCLA COMSCI 260 - Learning Majority Functions

School name University of California, Los Angeles

Course Comsci 260- Machine Learning Theory

Pages 3

This preview shows page 1 out of 3 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 3 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 3 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CS260: Machine Learning TheoryLecture 9: Irrelevant Features & the Winnow AlgorithmOctober 24, 2011Lecturer: Jennifer Wortman Vaughan1 Learning Majority FunctionsRecall that in the previous lecture we proved the following mistake bound for the Perceptron algorithm.Theorem 1. Suppose there exists a u of unit length and values γ > 0 and D > 0 such that ∀t yt(xt·u) ≥ γand kxtk ≤ D. Then, the number of mistakes made by the Perceptron algorithm is no more than (D/γ)2.Let’s see an example of how we can apply this bound.Suppose that every day we would like to make a prediction about a binary event, for example, whetheror not it will rain. Each day, we have access to the binary predictions of set of n experts (e.g., weatherforecasters), some of which are more accurate than others. Suppose that there exists some subset of kexperts such that the majority opinion of these k experts is always correct, but we don’t know which expertsare part of that subset. Over time, we would like to learn which experts to follow.We can represent this problem as an online classification problem. At each round t, let xt∈ {−1, +1}ndenote the vector of predictions of the n experts. Let w be a vector of length n where wi= 1 if expert i isin the special subset and wi= 0 otherwise. The correct prediction for us to make at time t is 1 if w ·xt≥ 0and 0 otherwise.We could run the Perceptron algorithm to learn the subset of special experts. Assume that k is odd sothat the majority is always well-defined. Let’s examine how to apply the mistake bound above in this case.First, we need a perfect target function of unit length. To get this, we can simply scale the vector w, andlet u = w/√k. Next, we must figure out the margin γ, which is the smallest possible value of yt(xt· u).Since each component of xtis either +1 or −1, and k is odd, the dot product must be a multiple of 1/√k.Thus, γ ≥ 1/√k. Clearly, D =√n.Applying Theorem 1, we are able to guarantee that the Perceptron makes a number of mistakes no morethan nk. Note that we do not need to know k to run the algorithm.(Exercise: Verify that we would get the same mistake bound if we scaled the predictions of the experts.)This example brings out the fact that the Perceptron mistake bound can sometimes have a “hidden”dependence on the dimension of the problem. In this case, the bound also depends on k, which can beviewed as the number of “relevant” features of the problem.2 WinnowThe mistake bound that we derived above has a linear dependence on the total number of experts or features.We will now introduce the Winnow algorithm, which was designed to have better performance in cases inAll CS260 lecture notes build on the scribes’ notes written by UCLA students in the Fall 2010 offering of this course. Althoughthey have been carefully reviewed, it is entirely possible that some of them contain errors. If you spot an error, please email Jenn.1which the total number of features is large but the number of relevant features is much smaller. Winnowlooks similar to the Perceptron algorithm, but uses multiplicative weight updates instead of additive.Like the Perceptron, Winnow maintains a current weight vector wtat each round t. We will use wi,ttodenote the weight on feature i at round t, and xi,tto denote the ith component of xt. We will go back tohaving yt∈ {0, 1} instead of {−1, +1}, and will assume that each xtis a binary string in {0, 1}n.The Winnow Algorithm (with parameter β)1. Initialize all the weights to one: w1,1= w2,1= . . . = wn,1= 12. For each example xt,• If wt· xt≥ n, then output 1, else output 0• If the algorithm makes a mistake, then– If yt= 1, then ∀i such that xi,t= 1, wi,t+1← wi,t(1 + β)– If yt= 0, then ∀i such that xi,t= 1, wi,t+1←wi,t1+β– In both cases (yt= 1 or 0), ∀ xi,t= 0, wi,t+1← wi,tElse wt+1← wtAs we can see, if we make a mistake on a positive (or negative) example at time t, we increase (ordecrease) the weights of all features for which xi,t= 1. These are precisely the features that contribute towt· xt. Because updates are multiplicative, the weights grow faster than in the Perceptron algorithm, andit turns out that Winnow makes fewer mistakes than the Perceptron algorithm when the number of relevantfeature is much less than the total number of features.2.1 Learning Majority Functions with WinnowIt can be shown that Winnow can learn majority functions with a mistake bound of 2k2log n. Clearly, wecan see that the Winnow algorithm gives a better mistake bound than the Perceptron for this class when thenumber of relevant features is much smaller than the number of total features.The proof of this result is a little complicated, so we will instead show a mistake bound for a slightlymore simple case: learning disjunctions with Winnow. The intuitions are similar.2.2 Learning Disjunctions with WinnowWe consider an example of learning monotone disjunctions using linear thresholds with Winnow. We showthe algorithm will make at most O(k log n) mistakes, where n is the total number of variables (i.e., thedimension of xt) and k is the number of variables that appear in the disjunction.Recall that a monotone disjunction is a disjunction in which no literals appear negated. It is easy toextend this result to the class of all disjunctions. The n in the mistake bound will increase to 2n. (Exercise:Figure out how to do this.)Theorem 2. Suppose there exists a monotone disjunction f of k variables such that for all t, f(xt) = yt.Then the Winnow algorithm with β = 1 makes at most 2 + 3k(1 + log n) mistakes.Proof: We will first bound the number of mistakes that are made on positive examples, and then separatelybound the number of mistakes made on negative examples. Adding these bounds will yield the result.We will refer to the weights of variables that appear in the disjunction as “relevant weights.”2Step 1: Bound the number of mistakes on positive examples (yt= 1)We start by making a few observations:1. When we make a mistake on a positive example, at least one relevant weight is doubled. If no relevantweight is doubled, this would imply that all relevant features were 0, which would in turn imply thatyt= 0, a contradiction.2. The relevant weights never decrease. Suppose that a relevant weight did decrease. This would implythat some relevant feature was 1 when yt= 0, also a contradiction.3. Once a relevant feature i has a weight wi,t≥ n, it can no longer be updated. If wi,t≥ n

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1 out of 3 pages.

UCLA COMSCI 260 - Learning Majority Functions

Sign up for free to view:

Please select your school