CORNELL CS 4700 - Statistical Learning Theory - D2824951

Home> Schools> Cornell University> Computer Science (CS) > CS 4700> Statistical Learning Theory

DOC PREVIEW

CORNELL CS 4700 - Statistical Learning Theory

School name Cornell University

Course Cs 4700- Foundations of Artificial Intelligence

Pages 27

This preview shows page 1-2-3-25-26-27 out of 27 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 27 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Statistical Learning TheoryWhy learning doesn’t always work • Unrealizability – f may not be in H or not easily represented in H • Variance – There may be many ways to represent f – depends on the specific training set • Noise/stochasticity – Elements that cannot be predicted: Missing attributes or stochastic process • Complexity – Finding f may be intractableRegularization • Forcing solutions to be simple – Add penalty for complex models – E.g. accuracy + size of tree – Number of samples in Thin-KNN – Sum of weights or number of nonzero weights (number of connections) in NN • Minimum Description Length (MDL)Example: Smart Investing Task: Pick stock analyst based on past performance. Experiment: – Have analyst predict “next day up/down” for 10 days. – Pick analyst that makes the fewest errors. Situation 1: – 1 stock analyst {A1}, A1 makes 5 errors Situation 2: – 3 stock analysts {A1,B1,B2}, B2 best with 1 error Situation 3: – 1003 stock analysts {A1,B1,B2,C1,…,C1000}, C543 best with 0 errors Which analysts are you most confident in: A1, B2, or C543?Outline Questions in Statistical Learning Theory: – How good is the learned rule after n examples? – How many examples do I need before the learned rule is accurate? – What can be learned and what cannot? – Is there a universally best learning algorithm? In particular, we will address: What is the true error of h if we only know the training error of h? – Finite hypothesis spaces and zero training error – (Finite hypothesis spaces and non-zero training error)Game: Randomized 20-Questions Game: 20-Questions – I think of object f – For i = 1 to 20 • You get to ask 20 yes/no questions about f and I have to answer truthfully – You make a guess h – You win, if f=h Game: Randomized 20-Questions – I pick function f  H, where f: X  {-1,+1} – For i = 1 to 20 • World delivers instances x  X with probability P(x) and I have to tell you f(x) – You form hypothesis h  H trying to guess my f  H – You win if f(x)=h(x) with probability at least 1-ε for x drawn according to P(x).Inductive Learning Model • Probably Approximately Correct (PAC) Learning Model: – Take any function f from H – Draw n Training examples Dtrain from P(x), label as y=f(x) – Run learning algorithm on Dtrain to produce h from H – Gather Test Examples Dtest from P(x) – Apply h to Dtest and measure fraction (probability) of h(x)≠f(x) – How likely is it that error probability is less than some threshold ε (for any f from H)? Real-world Process (x1,y1), …, (xn,yn) Learner (xn+1,yn+1), … Training Data Dtrain Test Data Dtest P(x), y=f(x) P(x), y=f(x) h DtrainWhat are the chances of a wrong hypothesis making correct predictions? H Hbad  fUseful Formulas • Binomial Distribution: The probability of observing x heads in a sample of n independent coin tosses, where in each toss the probability of heads is p, is • Union Bound: • Unnamed:Chances of getting it wrong • Chances that hbHbad is consistent with N examples – ErrorRate(hb)> so chances it agrees with an example is  (1- ) – Chances it agrees with N examples  (1- )N – P(Hbad contains a consistent hypothesis) = |Hbad| (1- )N  |H| (1- )N – We want to reduce this below some probability  so |H| (1- )N   – Given (1- )e- we get  HN ln1ln1Size of hypothesis space |H| • How many possible Boolean functions are there on n binary attributes? • A = n • B = 2n • C = 22n • D = • E = n22n222Size of hypothesis space |H| x1 x2 x3 Function 1 1 1 y0 1 1 0 y1 1 0 1 y2 1 0 0 Y3 0 1 1 Y4 0 1 0 Y5 0 0 1 Y6 0 0 0 Y7 N=3  |H|=256 N=10|H|=1.8x10308All Boolean functions • If |H|= then • So we need to see the entire space to determine the function reliably n22 )2(1ln1nONApproach • Look for simplest hypothesis • Limit the size of |H| by only looking at simple (limited) subspaceExample: Decision lists • List of tests, executed serially • k-DL: Each test contains at most k literals • Includes as a subset k-DT – All decision trees of depth at most kExample: Decision lists • Number of possible tests of size k from n attributes is • Total size of hypothesis space |H| is – Each test can yield Yes, No, or be Absent – Tests can be ordered in any sequence • Therefore number of training samples is reasonable for small k  kkinOinknC 02),()log(22)(kknnOnkDL  )log(1ln12kknnON)!,(3)(),(knCnkDLknCExample: Decision lists • Search for simple (small k) tests that classify large portion of data • Add test to list, remove classified datapoints • Repeat with remaining dataInductive bias • The inductive bias of a learning algorithm is the set of assumptions that the learner uses to predict outputs given inputs that it has not encountered (Mitchell, 1980) – No Free Lunch (Mitchell, Wolpert,…) – Bias-free learning is futile* *Wolpert and Macready have proved that there are free lunches in coevolutionary optimizationGeneralization Error Bound: Finite H, Zero Training Error • Model and Learning Algorithm – Learning Algorithm A with a finite hypothesis space H – Sample of n labeled examples Dtrain drawn according to P(x) – Target function f H  At least one h  H has zero training error ErrDtrain(h) – Learning Algorithm A returns zero training error hypothesis ĥ • What is the probability  that the true prediction error of ĥ is larger than ?Generalization Error Bound: Finite H, Non-Zero Training Error • Model and Learning Algorithm – Sample of n labeled examples Dtrain – Unknown (random) fraction of examples in Dtrain is mislabeled (noise) – Learning Algorithm A with a finite hypothesis space H – A returns hypothesis ĥ=A(S) with lowest training error • What is the probability  that the prediction error of ĥ exceeds the fraction of training errors by more than ?Overfitting vs. Underfitting [Mitchell, 1997] With probability at least (1-):VC-Dimension • The capacity of a hypothesis space H – The maximum number of points with arbitrary labelings that could be separated (“shattered”) – VC dimension of linear classifiers is 3Representational power • Machine f can shatter a set of points x1, x2 .. Xr if

View Full Document