Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 15, 2011 Today: • The Big Picture • Overfitting • Review: probability Readings: Decision trees, overfiting • Mitchell, Chapter 3 Probability review • Bishop Ch. 1 thru 1.2.3 • Bishop, Ch. 2 thru 2.2 • Andrew Moore’s online tutorialFunction Approximation: Problem Setting: • Set of possible instances X • Unknown target function f : XY!• Set of function hypotheses H={ h | h : XY }!Input: • Training examples {<x(i),y(i)>} of unknown target function f!Output: • Hypothesis h ∈ H that best approximates target function f!Function Approximation: Decision Tree Learning Problem Setting: • Set of possible instances X – each instance x in X is a feature vector x = < x1, x2 … xn> • Unknown target function f : XY!– Y is discrete valued • Set of function hypotheses H={ h | h : XY }!– each hypothesis h is a decision tree Input: • Training examples {<x(i),y(i)>} of unknown target function f!Output: • Hypothesis h ∈ H that best approximates target function f!node = Root [ID3, C4.5, Quinlan]Information Gain (also called mutual information) between input attribute A and target variable Y Information Gain is the expected reduction in entropy of target variable Y for data sample S, due to sorting on variable AFunction approximation as Search for the best hypothesis • ID3 performs heuristic search through space of decision treesFunction Approximation: The Big PictureWhich Tree Should We Output? • ID3 performs heuristic search through space of decision trees • It stops at smallest acceptable tree. Why? Occam’s razor: prefer the simplest hypothesis that fits the dataWhy Prefer Short Hypotheses? (Occam’s Razor) Arguments in favor: Arguments opposed:Why Prefer Short Hypotheses? (Occam’s Razor) Argument in favor: • Fewer short hypotheses than long ones a short hypothesis that fits the data is less likely to be a statistical coincidence Argument opposed: • Also fewer hypotheses containing a prime number of nodes and attributes beginning with “Z” • What’s so special about “short” hypotheses, instead of “prime number of nodes”?Overfitting Consider a hypothesis h and its • Error rate over training data: • True error rate over all data:Overfitting Consider a hypothesis h and its • Error rate over training data: • True error rate over all data: We say h overfits the training data if Amount of overfitting =Split data into training and validation set!Create tree that classifies training set correctly!What you should know: • Well posed function approximation problems: – Instance space, X – Sample of labeled training data { <x(i), y(i)>} – Hypothesis space, H = { f: XY } • Learning is a search/optimization problem over H – Various objective functions • minimize training error (0-1 loss) • among hypotheses that minimize training error, select smallest (?) – But inductive learning without some bias is futile ! • Decision tree learning – Greedy top-down learning of decision trees (ID3, C4.5, ...) – Overfitting and tree post-pruning – Extensions…Extra slides extensions to decision tree learningQuestions to think about (1) • ID3 and C4.5 are heuristic algorithms that search through the space of decision trees. Why not just do an exhaustive search?Questions to think about (2) • Consider target function f: <x1,x2> y, where x1 and x2 are real-valued, y is boolean. What is the set of decision surfaces describable with decision trees that use each attribute at most once?Questions to think about (3) • Why use Information Gain to select attributes in decision trees? What other criteria seem reasonable, and what are the tradeoffs in making this choice?Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 15, 2011 Today: • Review: probability Readings: Probability review • Bishop Ch. 1 thru 1.2.3 • Bishop, Ch. 2 thru 2.2 • Andrew Moore’s online tutorial many of these slides are derived from William Cohen, Andrew Moore, Aarti Singh, Eric Xing. Thanks!Probability Overview • Events – discrete random variables, continuous random variables, compound events • Axioms of probability – What defines a reasonable theory of uncertainty • Independent events • Conditional probabilities • Bayes rule and beliefs • Joint probability distribution • Expectations • Independence, Conditional independenceRandom Variables • Informally, A is a random variable if – A denotes something about which we are uncertain – perhaps the outcome of a randomized experiment • Examples A = True if a randomly drawn person from our class is female A = The hometown of a randomly drawn person from our class A = True if two randomly drawn persons from our class have same birthday • Define P(A) as “the fraction of possible worlds in which A is true” or “the fraction of times A holds, in repeated runs of the random experiment” – the set of possible worlds is called the sample space, S – A random variable A is a function defined over S A: S {0,1}A little formalism More formally, we have • a sample space S (e.g., set of students in our class) – aka the set of possible worlds • a random variable is a function defined over the sample space – Gender: S { m, f } – Height: S Reals • an event is a subset of S – e.g., the subset of S for which Gender=f – e.g., the subset of S for which (Gender=m) AND (eyeColor=blue) • we’re often interested in probabilities of specific events • and of specific events conditioned on other specific eventsVisualizing A Sample space of all possible worlds Its area is 1 Worlds in which A is False Worlds in which A is true P(A) = Area of reddish ovalThe Axioms of Probability • 0 <= P(A) <= 1 • P(True) = 1 • P(False) = 0 • P(A or B) = P(A) + P(B) - P(A and B) [di Finetti 1931]: when gambling based on “uncertainty formalism A” you can be exploited by an opponent iff your uncertainty formalism A violates these axiomsInterpreting the axioms • 0 <= P(A) <= 1 • P(True) = 1 •
View Full Document