Decision Theory Naïve Bayes ROC CurvesGenerative vs Discriminative Methods • Logistic regression: h: xy. • When we only learn a mapping xy it is called a discriminative method. • Generative methods learn p(x,y) = p(x|y) p(y), i.e. for every class we learn a model over the input distribution. • Advantage: leads to regularization for small datasets (but when N is large discriminative methods tend to work better). • We can easily combine various sources of information: say we have learned a model for attribute I, and now receive additional information about attribute II, then: • Disadvantage: you model more than necessary for making decisions, and input space (x-space) can be very high dimensional. • This is called “conditional independence of x|y”. • The corresponding classifier is called “Naïve Bayes Classifier”. € p(xI, xII| y) ≈ p(xI| y) p(xII| y)Naïve Bayes: decisions € p(y | xI, xII) =p(xI| y) p(xII| y) p(y)p(xI, xII)=p(xI| y) p(xII| y) p(y)p(xI| y) p(xII| y) p(y)y∑• This is the “posterior distribution” and it can be used to make a decision on what label to assign to a new data-case. • Note that to make a decision you do not need the denominator. • If we computed the posterior p(y|xI) first, we can use it as a new prior for the new information xII (prove this as home): € p(y | xI, xII) ∝ p(xII| y) p(y | xI)Naïve Bayes: learning • What do we need to learn from data? • p(y) • p(xk|y) for all k • A very simple rule is to look at the frequencies in the data: (assuming discrete states) • p(y) = [ nr. data-cases with label y] / [ total nr. data-cases ] • p(xk=i|y) =[ nr. data-cases in state xk=i and y] / [nr. data-cases with label y ] • To regularize we imagine that each state i has a small fractional number of data-cases to begin with (K = total nr. of classes). p(xk=i|y) =[ c + nr. data-cases in state xk=i and y] / [Kc + nr. data-cases with label y ] • What difficulties do you expect if we do not assume conditional independence? • Does NB over-estimate or under-estimate the uncertainty of its predictions? • Practical guideline: work in log-domain: € p(xj| y) → log p(xj| y)j∑j∏Loss functions • What if it is much more costly to make an error on predicting y=1 vs y=0? • Example: y=1 is “patient has cancer”, y=0 means “patient is healthy. • Introduce “expected loss function”: € E[L] = Lkj dx p(y =Rj∫kj∑k, x) Predict cancer healthy cancer 0 1000 healthy 1 0 Total probability of predicting class while true class is k. Rj is the region of x-space where an example is assigned to class j.Decision surface € E[L] = Lkj dx p(y =Rj∫kj∑k, x)• How shall we choose Rj ? • Solution: mimimize E[L] over {Rj}. • Take an arbitrary point “x”. • Compute for all j and maximize over “j”. • Since we maximize for every “x” separately, the total integral is maximal • Places where the decision switches belong to the “decision surface”. • What matrix L corresponds to the decision rule on slide 2 using the posterior? € Lkjp(y = k |k∑ x)ROC Curve • Assume 2 classes and 1 attribute. • Plot class conditional densities p(xk|y) • Shift decision boundary from right to left. • As you move the loss will change, so you want to find the point where it is minimized. • If L=[0 1; 1 0] where is L minimal? • As you shift the true true positive rate (TP) and the false positive rate (FP) change. • By plotting the entire curve you can see the tradeoffs. • Easily generalized to more attributes if you can find a decision threshold to vary. xy=0 y=1Evaluation: ROC curves class 1 (positives) class 0 (negatives) moving threshold TP = true positive rate = # positives classified as positive divided by # positives FP = false positive rate = # negatives classified as positives divided by # negatives TN = true negative rate = # negatives classified as negatives divided by # negatives FN = false negatives = # positives classified as negative divided by # positives Identify a threshold in your classifier that you can shift. Plot ROC curve while you shift that
View Full Document