Decision Theory Naïve Bayes ROC CurvesGenerative vs Discriminative MethodsNaïve Bayes: decisionsNaïve Bayes: learningLoss functionsDecision surfaceROC CurveEvaluation: ROC curvesDecision TheoryNaïve BayesROC CurvesGenerative vs DiscriminativeMethods• Logistic regression: h: xy.• When we only learn a mapping xy it is called a discriminative method.• Generative methods learn p(x,y) = p(x|y) p(y), i.e. for every class we learn a model over the input distribution.• Advantage: leads to regularization for small datasets (but when N is large discriminative methods tend to work better).• We can easily combine various sources of information: say we have learned a model for attribute I, and now receive additional information about attribute II, then:• Disadvantage: you model more than necessary for making decisions, and input space (x-space) can be very high dimensional.• This is called “conditional independence of x|y”. • The corresponding classifier is called “Naïve Bayes Classifier”.€ p(xI, xII| y) ≈ p(xI| y)p(xII| y)Naïve Bayes: decisions€ p(y | xI, xII) =p(xI| y)p(xII| y) p(y)p(xI, xII)=p(xI| y)p(xII| y) p(y)p(xI| y)p(xII| y) p(y)y∑• This is the “posterior distribution” and it can be used to make a decision on what label to assign to a new data-case.• Note that to make a decision you do not need the denominator.• If we computed the posterior p(y|xI) first, we can use it as a new prior for the new information xII (prove this as home):€ p(y | xI, xII) ∝ p(xII| y) p(y | xI)Naïve Bayes: learning• What do we need to learn from data?• p(y)• p(xk|y) for all k• A very simple rule is to look at the frequencies in the data: (assuming discrete states)• p(y) = [ nr. data-cases with label y] / [ total nr. data-cases ]• p(xk=i|y) =[ nr. data-cases in state xk=i and y] / [nr. data-cases with label y ]• To regularize we imagine that each state i has a small fractional number of data-cases to begin with (K = total nr. of classes). p(xk=i|y) =[ c + nr. data-cases in state xk=i and y] / [Kc + nr. data-cases with label y ]• What difficulties do you expect if we do not assume conditional independence?• Does NB over-estimate or under-estimate the uncertainty of its predictions?• Practical guideline: work in log-domain: € p(xj| y) → log p(xj| y)j∑j∏Loss functions• What if it is much more costly to make an error on predicting y=1 vs y=0?• Example: y=1 is “patient has cancer”, y=0 means “patient is healthy.• Introduce “expected loss function”:€ E[L] = Lkj dx p(y =Rj∫kj∑k, x) Predict cancer healthycancer 0 1000healthy 1 0Total probability of predictingclass while true class is k.Rj is the region of x-space wherean example is assigned to class j.Decision surface€ E[L] = Lkj dx p(y =Rj∫kj∑k, x)• How shall we choose Rj ?• Solution: mimimize E[L] over {Rj}.• Take an arbitrary point “x”. • Compute for all j and maximize over “j”.• Since we maximize for every “x” separately, the total integral is maximal• Places where the decision switches belong to the “decision surface”. • What matrix L corresponds to the decision rule on slide 2 using the posterior? € Lkjp(y = k |k∑ x)ROC Curve• Assume 2 classes and 1 attribute.• Plot class conditional densities p(xk|y)• Shift decision boundary from right to left.• As you move the loss will change, so you want to find the point where it is minimized.• If L=[0 1; 1 0] where is L minimal?• As you shift the true true positive rate (TP) and the false positive rate (FP) change. • By plotting the entire curve you can see the tradeoffs.• Easily generalized to more attributes if you can find a decision threshold to vary. xy=0y=1Evaluation: ROC curvesclass 1 (positives)class 0 (negatives)moving thresholdTP = true positive rate =# positives classified as positivedivided by # positivesFP = false positive rate = # negatives classified as positivesdivided by # negativesTN = true negative rate = # negatives classified as negativesdivided by # negativesFN = false negatives = # positives classified as negativedivided by # positivesIdentify a threshold inyour classifier that you can shift. Plot ROC curve while you shift that
View Full Document