1Foundations of Artificial IntelligenceStatistical Learning TheoryCS472 – Fall 2007Thorsten JoachimsOutlineQuestions in Statistical Learning Theory:– How good is the learned rule after n examples?– How many examples do I need before the learned rule is accurate?– What can be learned and what cannot?– Is there a universally best learning algorithm?In particular, we will address: What is the true error of h if we only know the training error of h?– Finite hypothesis spaces and zero training error– (Finite hypothesis spaces and non-zero training error)Game: Randomized 20-QuestionsGame: 20-Questions– I think of object f–For i = 1 to 20• You get to ask 20 yes/no questions about f and I have to answer truthfully– You make a guess h– You win, if f=hGame: Randomized 20-Questions– I pick function f ∈H, where f: X Æ{-1,+1}–For i = 1 to 20• World delivers instances x ∈X with probability P(x) and I have to tell you f(x)– You form hypothesis h ∈H trying to guess my f ∈H– You win if f(x)=h(x) with probability at least 1-ε for x drawn according to P(x).Inductive Learning Model• Probably Approximately Correct (PAC) Learning Model:– Take any function f from H–Draw n Training examples Dtrainfrom P(x), label as y=f(x)– Run learning algorithm on Dtrainto produce h from H– Gather Test Examples Dtestfrom P(x)– Apply h to Dtestand measure fraction (probability) of h(x)≠f(x)– How likely is it that error probability is less than some threshold ε(for any f from H)?Real-world Process(x1,y1), …, (xn,yn)Learner(xn+1,yn+1), …Training Data DtrainTest Data DtestP(x), y=f(x) P(x), y=f(x)hDtrainMeasuring Prediction PerformanceGeneralization Error Bound: Finite H, Zero Training Error• Model and Learning Algorithm– Learning Algorithm A with a finite hypothesis space H– Sample of n labeled examples Dtraindrawn according to P(x)– Target function f ∈HÎ At least one h ∈H has zero training error ErrDtrain(h) – Learning Algorithm A returns zero training error hypothesis ĥ• What is the probability δthat the prediction error of ĥ is larger thanε?(x1,y1), …, (xn,yn)Learner(xn+1,yn+1), …Training Sample DtrainTest Sample DtestDtrainhˆ2Useful Formulas• Binomial Distribution: The probability of observing xheads in a sample of n independent coin tosses, where in each toss the probability of heads is p, is• Union Bound:• Unnamed:Sample Complexity: Finite H, Zero Training Error• Model and Learning Algorithm– Sample of n labeled examples Dtrain– Learning Algorithm A with a finite hypothesis space H– At least one h ∈H has zero training error ErrDtrain(h) – Learning Algorithm L returns zero training error hypothesis ĥ• How many training examples does A need so that with probability at least (1-δ)it learns an ĥ with prediction error less thanε?(x1,y1), …, (xn,yn)Learner(xn+1,yn+1), …Training Sample DtrainTest Sample DtestDtrainhˆExample: Smart InvestingTask: Pick stock analyst based on past performance.Experiment: – Have analyst predict “next day up/down” for 10 days.– Pick analyst that makes the fewest errors.Situation 1: – 1 stock analyst {A1}, A1 makes 5 errorsSituation 2: – 3 stock analysts {A1,B1,B2}, B2 best with 1 errorSituation 3: – 1003 stock analysts {A1,B1,B2,C1,…,C1000}, C543 best with 0 errorsWhich analysts are you most confident in, A1, B2, or C543?Useful Formula• Hoeffding/Chernoff Bound:For any distribution P(X) where X can take the values 0 and 1, the probability that an average of an i.i.d. sample deviates from its mean p by more than ε is bounded asGeneralization Error Bound: Finite H, Non-Zero Training Error• Model and Learning Algorithm– Sample of n labeled examples Dtrain– Unknown (random) fraction of examples in Dtrainis mislabeled (noise)– Learning Algorithm A with a finite hypothesis space H– A returns hypothesis ĥ=A(S) with lowest training error • What is the probability δthat the prediction error of ĥ exceeds the fraction of training errors by more thanε?(x1,y1), …, (xn,yn)Learner(xn+1,yn+1), …Training Sample DtrainTest Sample DtestDtrainhˆOverfitting vs. Underfitting[Mitchell, 1997]With probability at least (1-δ):3Generalization Error Bound:Infinite H, Non-Zero Training Error• Model and Learning Algorithm– Sample of n labeled examples Dtrain– Learning Algorithm A with a hypothesis space H with VCDim(H)=d– A returns hypothesis ĥ=A(S) with lowest training error •• Given hypothesis space H with VCDim(H) equal to d and a training sample Dtrainof size n, with probability at least (1-δ)it holds thatThis slide is not relevant for
View Full Document