Artificial Intelligence: Representation and Problem Solving15-381April 10, 2007Introduction to Learning &Decision TreesMichael S. Lewicki ! Carnegie MellonArtificial Intelligence: Learning and Decision TreesLearning and Decision Trees to learning•What is learning?-more than just memorizing facts-learning the underlying structure of the problem or data•A fundamental aspect of learning is generalization:-given a few examples, can you generalize to others?•Learning is ubiquitous:-medical diagnosis: identify new disorders from observations-loan applications: predict risk of default-prediction: (climate, stocks, etc.) predict future from current and past data-speech/object recognition: from examples, generalize to others2aka:• regression• pattern recognition• machine learning• data miningMichael S. Lewicki ! Carnegie MellonArtificial Intelligence: Learning and Decision TreesRepresentation•How do we model or represent the world?•All learning requires some form of representation.•Learning:adjust model parameters to match data3world(or data)model{θ1, . . . , θn}Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Learning and Decision TreesThe complexity of learning•Fundamental trade-off in learning:complexity of model vs amount of data required to learn parameters•The more complex the model, the more it can describe,but the more data it requires to constrain the parameters.•Consider a hypothesis space of N models:-How many bits would it take to identify which of the N models is ‘correct’?-log2(N) in the worst case•Want simple models to explain examples and generalize to others-Ockham’s (some say Occam) razor4Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Learning and Decision TreesComplex learning example: curve fitting5example from Bishop (2006), Pattern Recognition and Machine Learningt = sin(2πx) + noise0 1!101txy (xn, w)tnxnHow do we model the data?Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Learning and Decision TreesPolynomial curve fitting60 1!1010 1!1010 1!1010 1!101y(x, w) = w0+ w1x + w2x2+ · · · + wMxM=M!j=0wjxjE(w) =12N!n=1[y(xn, w) − tn]2example from Bishop (2006), Pattern Recognition and Machine LearningMichael S. Lewicki ! Carnegie MellonArtificial Intelligence: Learning and Decision TreesMore data are needed to learn correct model70 1!1010 1!101example from Bishop (2006), Pattern Recognition and Machine LearningThis is overfitting.Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Learning and Decision TreesTypes of learning8world(or data)model{θ1, . . . , θn}desired output{y1, . . . , yn}supervisedworld(or data)model{θ1, . . . , θn}unsupervisedworld(or data)model{θ1, . . . , θn}model outputreinforcementreinforcementDecision TreesMichael S. Lewicki ! Carnegie MellonArtificial Intelligence: Learning and Decision TreesDecision trees: classifying from a set of attributes10<2 years at current job?missed payments?defaulted?NNNYNYNNNNNNNYYYNNNYNNYYYNNYNNPredicting credit riskbad: 3good: 7missed payments?N Ybad: 2good: 1bad: 1good: 6<2 years at current job?N Ybad: 0good: 3bad: 1good: 3•each level splits the data according to different attributes•goal: achieve perfect classification with minimal number of decisions-not always possible due to noise or inconsistencies in the dataMichael S. Lewicki ! Carnegie MellonArtificial Intelligence: Learning and Decision TreesObservations•Any boolean function can be represented by a decision tree.•not good for all functions, e.g.:-parity function: return 1 iff an even number of inputs are 1-majority function: return 1 if more than half inputs are 1•best when a small number of attributes provide a lot of information•Note: finding optimal tree for arbitrary data is NP-hard.11Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Learning and Decision TreesDecision trees with continuous values12years at current job# missed paymentsdefaulted?70N0.750Y30N90N42Y0.250N51N84Y1.00N1.750NPredicting credit risk•Now tree corresponds to order and placement of boundaries•General case: -arbitrary number of attributes: binary, multi-valued, or continuous-output: binary, multi-valued (decision or axis-aligned classification trees), or continuous (regression trees) years at current job# missed payments!"!!""""""">1.5!1Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Learning and Decision TreesExamples•loan applications•medical diagnosis•movie preferences (Netflix contest)•spam filters•security screening•many real-word systems, and AI success•In each case, we want-accurate classification, i.e. minimize error-efficient decision making, i.e. fewest # of decisions/tests•decision sequence could be further complicated-want to minimize false negatives in medical diagnosis or minimize cost of test sequence-don’t want to miss important email13Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Learning and Decision TreesDecision Trees•simple example of inductive learning1. learn decision tree from training examples2. predict classes for novel testing examples•Generalization is how well we do on the testing examples.•Only works if we can learn the underlying structure of the data.14training examplesmodel{θ1, . . . , θn}class predictiontesting examplesMichael S. Lewicki ! Carnegie MellonArtificial Intelligence: Learning and Decision TreesChoosing the attributes15•How do we find a decision tree that agrees with the training data?•Could just choose a tree that has one path to a leaf for each example-but this just memorizes the observations (assuming data are consistent)-we want it to generalize to new examples•Ideally, best attribute would partition the data into positive and negative examples•Strategy (greedy):-choose attributes that give the best partition first•Want correct classification with fewest number of testsMichael S. Lewicki ! Carnegie MellonArtificial Intelligence: Learning and Decision TreesProblems16<2 years at current job?missed payments?defaulted?NNNYNYNNNNNNNYYYNNNYNNYYYNNYNNPredicting credit riskbad: 3good: 7missed payments?N Ybad: 2good: 1bad: 1good: 6<2 years at current job?N Ybad: 0good: 3bad: 1good: 3•How do we which attribute or value to split on?•When should we stop splitting?•What do we do when we can’t achieve perfect classification?•What if tree is too large? Can we approximate with a smaller tree?Michael S. Lewicki ! Carnegie
View Full Document