1 Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 13, 2011 Today: • The Big Picture • Overfitting • Review: probability Readings: Decision trees, overfiting • Mitchell, Chapter 3 Probability review • Bishop Ch. 1 thru 1.2.3 • Bishop, Ch. 2 thru 2.2 • Andrew Moore’s online tutorial Function Approximation: Decision Tree Learning Problem Setting: • Set of possible instances X – each instance x in X is a feature vector x = < x1, x2 … xn> • Unknown target function f : XY!– Y is discrete valued • Set of function hypotheses H={ h | h : XY }!– each hypothesis h is a decision tree Input: • Training examples {<x(i),y(i)>} of unknown target function f!Output: • Hypothesis h ∈ H that best approximates target function f!2 Function approximation as Search for the best hypothesis • ID3 performs heuristic search through space of decision trees Function Approximation: The Big Picture3 Which Tree Should We Output? • ID3 performs heuristic search through space of decision trees • It stops at smallest acceptable tree. Why? Occam’s razor: prefer the simplest hypothesis that fits the data Why Prefer Short Hypotheses? (Occam’s Razor) Arguments in favor: Arguments opposed:4 Why Prefer Short Hypotheses? (Occam’s Razor) Argument in favor: • Fewer short hypotheses than long ones a short hypothesis that fits the data is less likely to be a statistical coincidence highly probable that a sufficiently complex hypothesis will fit the data Argument opposed: • Also fewer hypotheses containing a prime number of nodes and attributes beginning with “Z” • What’s so special about “short” hypotheses?567 Split data into training and validation set!Create tree that classifies training set correctly!89 What you should know: • Well posed function approximation problems: – Instance space, X – Sample of labeled training data { <x(i), y(i)>} – Hypothesis space, H = { f: XY } • Learning is a search/optimization problem over H – Various objective functions • minimize training error (0-1 loss) • among hypotheses that minimize training error, select smallest (?) – But inductive learning without some bias is futile ! • Decision tree learning – Greedy top-down learning of decision trees (ID3, C4.5, ...) – Overfitting and tree/rule post-pruning – Extensions… Extra slides extensions to decision tree learning1011 Questions to think about (1) • ID3 and C4.5 are heuristic algorithms that search through the space of decision trees. Why not just do an exhaustive search?12 Questions to think about (2) • Consider target function f: <x1,x2> y, where x1 and x2 are real-valued, y is boolean. What is the set of decision surfaces describable with decision trees that use each attribute at most once? Questions to think about (3) • Why use Information Gain to select attributes in decision trees? What other criteria seem reasonable, and what are the tradeoffs in making this choice?13 Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 13, 2011 Today: • Review: probability Readings: Probability review • Bishop Ch. 1 thru 1.2.3 • Bishop, Ch. 2 thru 2.2 • Andrew Moore’s online tutorial many of these slides are derived from William Cohen, Andrew Moore, Aarti Singh, Eric Xing. Thanks! The Problem of Induction • David Hume (1711-1776): pointed out 1. Empirically, induction seems to work 2. Statement (1) is an application of induction. • This stumped people for about 200 years14 Probability Overview • Events – discrete random variables, continuous random variables, compound events • Axioms of probability – What defines a reasonable theory of uncertainty • Independent events • Conditional probabilities • Bayes rule and beliefs • Joint probability distribution • Expectations • Independence, Conditional independence Random Variables • Informally, A is a random variable if – A denotes something about which we are uncertain – perhaps the outcome of a randomized experiment • Examples – A = True if a randomly drawn person from our class is female – A = The hometown of a randomly drawn person from our class – A = True if two randomly drawn persons from our class have same birthday • Define P(A) as “the fraction of possible worlds in which A is true” or “the fraction of times A holds, in repeated runs of the random experiment” – the set of possible worlds is called the sample space, S – A random variable A is a function defined over S A: S {0,1}15 A little formalism More formally, we have • a sample space S (e.g., set of students in our class) – aka the set of possible worlds • a random variable is a function defined over the sample space – Gender: S { m, f } – Height: S Reals • an event is a subset of S – e.g., the subset of S for which Gender=f – e.g., the subset of S for which (Gender=m) AND (eyeColor=blue) • we’re often interested in probabilities of specific events • and of specific events conditioned on other specific events Visualizing A Sample space of all possible worlds Its area is 1 Worlds in which A is False Worlds in which A is true P(A) = Area of reddish oval16 The Axioms of Probability • 0 <= P(A) <= 1 • P(True) = 1 • P(False) = 0 • P(A or B) = P(A) + P(B) - P(A and B) The Axioms of Probability • 0 <= P(A) <= 1 • P(True) = 1 • P(False) = 0 • P(A or B) = P(A) + P(B) - P(A and B) [di Finetti 1931]: when gambling based on “uncertainty formalism A” you can be exploited by an opponent iff your uncertainty formalism A violates these axioms17 Interpreting the axioms • 0 <= P(A) <= 1 • P(True) = 1 • P(False) = 0 • P(A or B) = P(A) + P(B) - P(A and B) The area of A can’t get any smaller than 0 And a zero area would mean no world could ever have A true Interpreting the axioms • 0 <= P(A) <= 1 • P(True) = 1 • P(False) = 0 • P(A or B) = P(A) + P(B) - P(A and B) The area of A can’t get any bigger than 1 And an area of 1 would mean all worlds will have A true18 Interpreting the axioms • 0 <= P(A)
View Full Document