Machine Learning 10 701 Tom M Mitchell Machine Learning Department Carnegie Mellon University January 13 2011 Today The Big Picture Overfitting Review probability Readings Decision trees overfiting Mitchell Chapter 3 Probability review Bishop Ch 1 thru 1 2 3 Bishop Ch 2 thru 2 2 Andrew Moore s online tutorial Function Approximation Decision Tree Learning Problem Setting Set of possible instances X each instance x in X is a feature vector x x1 x2 xn Unknown target function f X Y Y is discrete valued Set of function hypotheses H h h X Y each hypothesis h is a decision tree Input Training examples x i y i of unknown target function f Output Hypothesis h H that best approximates target function f 1 Function approximation as Search for the best hypothesis ID3 performs heuristic search through space of decision trees Function Approximation The Big Picture 2 Which Tree Should We Output ID3 performs heuristic search through space of decision trees It stops at smallest acceptable tree Why Occam s razor prefer the simplest hypothesis that fits the data Why Prefer Short Hypotheses Occam s Razor Arguments in favor Arguments opposed 3 Why Prefer Short Hypotheses Occam s Razor Argument in favor Fewer short hypotheses than long ones a short hypothesis that fits the data is less likely to be a statistical coincidence highly probable that a sufficiently complex hypothesis will fit the data Argument opposed Also fewer hypotheses containing a prime number of nodes and attributes beginning with Z What s so special about short hypotheses 4 5 6 Split data into training and validation set Create tree that classifies training set correctly 7 8 What you should know Well posed function approximation problems Instance space X Sample of labeled training data x i y i Hypothesis space H f X Y Learning is a search optimization problem over H Various objective functions minimize training error 0 1 loss among hypotheses that minimize training error select smallest But inductive learning without some bias is futile Decision tree learning Greedy top down learning of decision trees ID3 C4 5 Overfitting and tree rule post pruning Extensions Extra slides extensions to decision tree learning 9 10 Questions to think about 1 ID3 and C4 5 are heuristic algorithms that search through the space of decision trees Why not just do an exhaustive search 11 Questions to think about 2 Consider target function f x1 x2 y where x1 and x2 are real valued y is boolean What is the set of decision surfaces describable with decision trees that use each attribute at most once Questions to think about 3 Why use Information Gain to select attributes in decision trees What other criteria seem reasonable and what are the tradeoffs in making this choice 12 Machine Learning 10 701 Tom M Mitchell Machine Learning Department Carnegie Mellon University January 13 2011 Today Review probability many of these slides are derived from William Cohen Andrew Moore Aarti Singh Eric Xing Thanks Readings Probability review Bishop Ch 1 thru 1 2 3 Bishop Ch 2 thru 2 2 Andrew Moore s online tutorial The Problem of Induction David Hume 1711 1776 pointed out 1 Empirically induction seems to work 2 Statement 1 is an application of induction This stumped people for about 200 years 13 Probability Overview Events discrete random variables continuous random variables compound events Axioms of probability What defines a reasonable theory of uncertainty Independent events Conditional probabilities Bayes rule and beliefs Joint probability distribution Expectations Independence Conditional independence Random Variables Informally A is a random variable if A denotes something about which we are uncertain perhaps the outcome of a randomized experiment Examples A True if a randomly drawn person from our class is female A The hometown of a randomly drawn person from our class A True if two randomly drawn persons from our class have same birthday Define P A as the fraction of possible worlds in which A is true or fraction of times A holds in repeated runs of the random experiment the the set of possible worlds is called the sample space S A random variable A is a function defined over S A S 0 1 14 A little formalism More formally we have a sample space S e g set of students in our class aka the set of possible worlds a random variable is a function defined over the sample space Gender S m f Height S Reals an event is a subset of S e g the subset of S for which Gender f e g the subset of S for which Gender m AND eyeColor blue we re often interested in probabilities of specific events and of specific events conditioned on other specific events Visualizing A Sample space of all possible worlds Its area is 1 Worlds in which A is true P A Area of reddish oval Worlds in which A is False 15 The Axioms of Probability 0 P A 1 P True 1 P False 0 P A or B P A P B P A and B The Axioms of Probability 0 P A 1 P True 1 P False 0 P A or B P A P B P A and B di Finetti 1931 when gambling based on uncertainty formalism A you can be exploited by an opponent iff your uncertainty formalism A violates these axioms 16 Interpreting the axioms 0 P A 1 P True 1 P False 0 P A or B P A P B P A and B The area of A can t get any smaller than 0 And a zero area would mean no world could ever have A true Interpreting the axioms 0 P A 1 P True 1 P False 0 P A or B P A P B P A and B The area of A can t get any bigger than 1 And an area of 1 would mean all worlds will have A true 17 Interpreting the axioms 0 P A 1 P True 1 P False 0 P A or B P A P B P A and B Theorems from the Axioms 0 P A 1 P True 1 P False 0 P A or B P A P B P A and B P not A P A 1 P A 18 Theorems from the Axioms 0 P A 1 P True 1 P False 0 P A or B P A P B P A and B P not A P A 1 P A P A or A 1 P A and A 0 P A or A P A P A P A and A 1 P A P A 0 Elementary Probability in Pictures P A P A 1 A A 19 Another useful theorem 0 P A 1 P True 1 P False 0 P A or B P A P B P A and B P A P A B P A B A A and B or B A and B or A and B P A P A and B P A and B P A and B and A and B P …
View Full Document