Artificial Intelligence: Representation and Problem Solving15-381April 17, 2007Probabilistic LearningMichael S. Lewicki ! Carnegie MellonArtificial Intelligence: Probabilistic LearningReminder•No class on Thursday - spring carnival.2Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Probabilistic LearningRecall the basic algorithm for learning decision trees:1. starting with whole training data2. select attribute or value along dimension that gives “best” split using information gain or other criteria3. create child nodes based on split4. recurse on each child using child data until a stopping criterion is reached•all examples have same class•amount of data is too small•tree too large•Does this capture probabilistic relationships?3Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Probabilistic LearningQuantifying the certainty of decions4<2 years at current job?missed payments?defaulted?NNNYNYNNNNNNNYYYNNNYNNYYYNNYNN•••••••••Predicting credit risk•Suppose instead of a yes or no answer want some estimate of how strongly we believe a loan applicant is a credit risk.•This might be useful if we want some flexibility in adjusting our decision criteria.-Eg, suppose we’re willing to take more risk if times are good.-Or, if we want to examine case we believe are higher risks more carefully.Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Probabilistic LearningThe mushroom data•Or suppose we wanted to know how likely a mushroom was safe to eat?•Do decision trees give us that information?5EDIBLE?CAP-SHAPECAP-SURFACE• •!•1edibleflatfibrous• •!•2poisonousconvexsmooth• •!•3edibleflatfibrous• •!•4edibleconvexscaly• •!•5poisonousconvexsmooth• •!•6edibleconvexfibrous• •!•7poisonousflatscaly• •!•8poisonousflatscaly• •!•9poisonousconvexfibrous• •!•10poisonousconvexfibrous• •!•11poisonousflatsmooth• •!•12edibleconvexsmooth• •!•13poisonousknobbedscaly• •!•14poisonousflatsmooth• •!•15poisonousflatfibrous• •!•••••••••••••Mushroom dataMichael S. Lewicki ! Carnegie MellonArtificial Intelligence: Probabilistic Learning1 2 3 4 5 6 700.511.522.5petal length (cm)petal width (cm)Fisher’s Iris data6Iris virginicaIris setosaIris versicolorIn which example would you be more confident about the class?Decision trees provide a classification but not uncertainty.Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Probabilistic LearningThe general classification problem7DataD = {x1, . . . , xT}xi= {x1, . . . , xN}idesired outputy = {y1, . . . , yK}modelθ = {θ1, . . . , θM}Given data, we want to learn a model that can correctly classify novel observations.yi=!1 if xi∈ Ci≡ class i,0 otherwiseoutput is a binary classification vector:input is a set of T observations, each an N-dimensional vector (binary, discrete, or continuous)model (e.g. a decision tree) is defined by M parameters.How do we approach this probabilistically?Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Probabilistic LearningThe answer to all questions of uncertainty•Let’s apply Bayes’ rule to infer the most probable class given the observation:•This is the answer, but what does it mean?•How do we specify the terms?-p(Ck) is the prior probability on the different classes-p(x|Ck) is the data likelihood, ie probability of x given class Ck•How should we define this?8p(Ck|x) =p(x|Ck)p(Ck)p(x)=p(x|Ck)p(Ck)!kp(x|Ck)p(Ck)Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Probabilistic LearningWhat classifier would give “optimal” performance?•Consider the iris data again.•How would we minimize the number of future mis-classifications?•We would need to know the true distribution of the classes.•Assume they follow a Gaussian distribution.•The number of samples in each class is the same (50), so (assume) p(Ck) is equal for all classes.•Because p(x) is the same for all classes we have:91 2 3 4 5 6 700.511.522.5petal length (cm)petal width (cm)00.10.20.30.40.50.60.70.80.9p(petal length |C2) p(petal length |C3) p(Ck|x) =p(x|Ck)p(Ck)p(x)∝ p(x|Ck)p(Ck)Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Probabilistic Learning1 2 3 4 5 6 700.10.20.30.40.50.60.70.80.9Where do we put the boundary?10p(petal length |C2) p(petal length |C3)Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Probabilistic Learning1 2 3 4 5 6 700.10.20.30.40.50.60.70.80.9Where do we put the boundary?11decision boundaryR32 = C3 is misclassified as C2R23 = C2 is misclassified as C3p(petal length |C2) p(petal length |C3) Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Probabilistic Learning1 2 3 4 5 6 700.10.20.30.40.50.60.70.80.9Where do we put the boundary?12Shifting the boundary trades-off the two errors.R32 = C3 is misclassified as C2R23 = C2 is misclassified as C3p(petal length |C2) p(petal length |C3)Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Probabilistic Learning•The misclassification error is defined by •which in our case is proportional to the data likelihood1 2 3 4 5 6 700.10.20.30.40.50.60.70.80.9Where do we put the boundary?13R32 = C3 is misclassified as C2R23 = C2 is misclassified as C3p(error) =!R32p(C3|x)dx +!R23p(C2|x)dxp(petal length |C2) p(petal length |C3) Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Probabilistic Learning•The misclassification error is defined by •which in our case is proportional to the data likelihood1 2 3 4 5 6 700.10.20.30.40.50.60.70.80.9Where do we put the boundary?14This region would yieldbut we’re still classifying this region as C2!p(C3|x) > p(C2|x)p(error) =!R32p(C3|x)dx +!R23p(C2|x)dxp(petal length |C2) p(petal length |C3)Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Probabilistic Learning•The minimal misclassification error at the point where1 2 3 4 5 6 700.20.40.60.81The optimal decision boundary15Optimal decision boundaryp(petal length |C2) p(petal length |C3) p(C3|x) = p(C2|x)⇒ p(x|C3)p(C3)/p(x) = p(x|C2)p(C2)/p(x)⇒ p(x|C3) = p(x|C2)p(C2 | petal length) p(C3 | petal length) Note: this assumes we have only two classes.Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Probabilistic LearningBayesian classification for more complex models•Recall the class conditional probability:•How do we define the data likelihood, p(x|Ck)ie the probability of x given class Ck16p(Ck|x)
View Full Document