Statistical Learning TheoryWhy learning doesn’t always work • Unrealizability – f may not be in H or not easily represented in H • Variance – There may be many ways to represent f – depends on the specific training set • Noise/stochasticity – Elements that cannot be predicted: Missing attributes or stochastic process • Complexity – Finding f may be intractableRegularization • Forcing solutions to be simple – Add penalty for complex models – E.g. accuracy + size of tree – Number of samples in Thin-KNN – Sum of weights or number of nonzero weights (number of connections) in NN • Minimum Description Length (MDL)Example: Smart Investing Task: Pick stock analyst based on past performance. Experiment: – Have analyst predict “next day up/down” for 10 days. – Pick analyst that makes the fewest errors. Situation 1: – 1 stock analyst {A1}, A1 makes 5 errors Situation 2: – 3 stock analysts {A1,B1,B2}, B2 best with 1 error Situation 3: – 1003 stock analysts {A1,B1,B2,C1,…,C1000}, C543 best with 0 errors Which analysts are you most confident in: A1, B2, or C543?Outline Questions in Statistical Learning Theory: – How good is the learned rule after n examples? – How many examples do I need before the learned rule is accurate? – What can be learned and what cannot? – Is there a universally best learning algorithm? In particular, we will address: What is the true error of h if we only know the training error of h? – Finite hypothesis spaces and zero training error – (Finite hypothesis spaces and non-zero training error)Game: Randomized 20-Questions Game: 20-Questions – I think of object f – For i = 1 to 20 • You get to ask 20 yes/no questions about f and I have to answer truthfully – You make a guess h – You win, if f=h Game: Randomized 20-Questions – I pick function f H, where f: X {-1,+1} – For i = 1 to 20 • World delivers instances x X with probability P(x) and I have to tell you f(x) – You form hypothesis h H trying to guess my f H – You win if f(x)=h(x) with probability at least 1-ε for x drawn according to P(x).Inductive Learning Model • Probably Approximately Correct (PAC) Learning Model: – Take any function f from H – Draw n Training examples Dtrain from P(x), label as y=f(x) – Run learning algorithm on Dtrain to produce h from H – Gather Test Examples Dtest from P(x) – Apply h to Dtest and measure fraction (probability) of h(x)≠f(x) – How likely is it that error probability is less than some threshold ε (for any f from H)? Real-world Process (x1,y1), …, (xn,yn) Learner (xn+1,yn+1), … Training Data Dtrain Test Data Dtest P(x), y=f(x) P(x), y=f(x) h DtrainWhat are the chances of a wrong hypothesis making correct predictions? H Hbad fUseful Formulas • Binomial Distribution: The probability of observing x heads in a sample of n independent coin tosses, where in each toss the probability of heads is p, is • Union Bound: • Unnamed:Chances of getting it wrong • Chances that hbHbad is consistent with N examples – ErrorRate(hb)> so chances it agrees with an example is (1- ) – Chances it agrees with N examples (1- )N – P(Hbad contains a consistent hypothesis) = |Hbad| (1- )N |H| (1- )N – We want to reduce this below some probability so |H| (1- )N – Given (1- )e- we get HN ln1ln1Size of hypothesis space |H| • How many possible Boolean functions are there on n binary attributes? • A = n • B = 2n • C = 22n • D = • E = n22n222Size of hypothesis space |H| x1 x2 x3 Function 1 1 1 y0 1 1 0 y1 1 0 1 y2 1 0 0 Y3 0 1 1 Y4 0 1 0 Y5 0 0 1 Y6 0 0 0 Y7 N=3 |H|=256 N=10|H|=1.8x10308All Boolean functions • If |H|= then • So we need to see the entire space to determine the function reliably n22 )2(1ln1nONApproach • Look for simplest hypothesis • Limit the size of |H| by only looking at simple (limited) subspaceExample: Decision lists • List of tests, executed serially • k-DL: Each test contains at most k literals • Includes as a subset k-DT – All decision trees of depth at most kExample: Decision lists • Number of possible tests of size k from n attributes is • Total size of hypothesis space |H| is – Each test can yield Yes, No, or be Absent – Tests can be ordered in any sequence • Therefore number of training samples is reasonable for small k kkinOinknC 02),()log(22)(kknnOnkDL )log(1ln12kknnON)!,(3)(),(knCnkDLknCExample: Decision lists • Search for simple (small k) tests that classify large portion of data • Add test to list, remove classified datapoints • Repeat with remaining dataInductive bias • The inductive bias of a learning algorithm is the set of assumptions that the learner uses to predict outputs given inputs that it has not encountered (Mitchell, 1980) – No Free Lunch (Mitchell, Wolpert,…) – Bias-free learning is futile* *Wolpert and Macready have proved that there are free lunches in coevolutionary optimizationGeneralization Error Bound: Finite H, Zero Training Error • Model and Learning Algorithm – Learning Algorithm A with a finite hypothesis space H – Sample of n labeled examples Dtrain drawn according to P(x) – Target function f H At least one h H has zero training error ErrDtrain(h) – Learning Algorithm A returns zero training error hypothesis ĥ • What is the probability that the true prediction error of ĥ is larger than ?Generalization Error Bound: Finite H, Non-Zero Training Error • Model and Learning Algorithm – Sample of n labeled examples Dtrain – Unknown (random) fraction of examples in Dtrain is mislabeled (noise) – Learning Algorithm A with a finite hypothesis space H – A returns hypothesis ĥ=A(S) with lowest training error • What is the probability that the prediction error of ĥ exceeds the fraction of training errors by more than ?Overfitting vs. Underfitting [Mitchell, 1997] With probability at least (1-):VC-Dimension • The capacity of a hypothesis space H – The maximum number of points with arbitrary labelings that could be separated (“shattered”) – VC dimension of linear classifiers is 3Representational power • Machine f can shatter a set of points x1, x2 .. Xr if
View Full Document