DOC PREVIEW
UCI ICS 273A - Decision Theory

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Decision Theory Naïve Bayes ROC CurvesGenerative vs Discriminative MethodsNaïve Bayes: decisionsNaïve Bayes: learningLoss functionsDecision surfaceROC CurveEvaluation: ROC curvesDecision TheoryNaïve BayesROC CurvesGenerative vs DiscriminativeMethods• Logistic regression: h: xy.• When we only learn a mapping xy it is called a discriminative method.• Generative methods learn p(x,y) = p(x|y) p(y), i.e. for every class we learn a model over the input distribution.• Advantage: leads to regularization for small datasets (but when N is large discriminative methods tend to work better).• We can easily combine various sources of information: say we have learned a model for attribute I, and now receive additional information about attribute II, then:• Disadvantage: you model more than necessary for making decisions, and input space (x-space) can be very high dimensional.• This is called “conditional independence of x|y”. • The corresponding classifier is called “Naïve Bayes Classifier”.€ p(xI, xII| y) ≈ p(xI| y)p(xII| y)Naïve Bayes: decisions€ p(y | xI, xII) =p(xI| y)p(xII| y) p(y)p(xI, xII)=p(xI| y)p(xII| y) p(y)p(xI| y)p(xII| y) p(y)y∑• This is the “posterior distribution” and it can be used to make a decision on what label to assign to a new data-case.• Note that to make a decision you do not need the denominator.• If we computed the posterior p(y|xI) first, we can use it as a new prior for the new information xII (prove this as home):€ p(y | xI, xII) ∝ p(xII| y) p(y | xI)Naïve Bayes: learning• What do we need to learn from data?• p(y)• p(xk|y) for all k• A very simple rule is to look at the frequencies in the data: (assuming discrete states)• p(y) = [ nr. data-cases with label y] / [ total nr. data-cases ]• p(xk=i|y) =[ nr. data-cases in state xk=i and y] / [nr. data-cases with label y ]• To regularize we imagine that each state i has a small fractional number of data-cases to begin with (K = total nr. of classes). p(xk=i|y) =[ c + nr. data-cases in state xk=i and y] / [Kc + nr. data-cases with label y ]• What difficulties do you expect if we do not assume conditional independence?• Does NB over-estimate or under-estimate the uncertainty of its predictions?• Practical guideline: work in log-domain: € p(xj| y) → log p(xj| y)j∑j∏Loss functions• What if it is much more costly to make an error on predicting y=1 vs y=0?• Example: y=1 is “patient has cancer”, y=0 means “patient is healthy.• Introduce “expected loss function”:€ E[L] = Lkj dx p(y =Rj∫kj∑k, x) Predict cancer healthycancer 0 1000healthy 1 0Total probability of predictingclass while true class is k.Rj is the region of x-space wherean example is assigned to class j.Decision surface€ E[L] = Lkj dx p(y =Rj∫kj∑k, x)• How shall we choose Rj ?• Solution: mimimize E[L] over {Rj}.• Take an arbitrary point “x”. • Compute for all j and maximize over “j”.• Since we maximize for every “x” separately, the total integral is maximal• Places where the decision switches belong to the “decision surface”. • What matrix L corresponds to the decision rule on slide 2 using the posterior? € Lkjp(y = k |k∑ x)ROC Curve• Assume 2 classes and 1 attribute.• Plot class conditional densities p(xk|y)• Shift decision boundary from right to left.• As you move the loss will change, so you want to find the point where it is minimized.• If L=[0 1; 1 0] where is L minimal?• As you shift the true true positive rate (TP) and the false positive rate (FP) change. • By plotting the entire curve you can see the tradeoffs.• Easily generalized to more attributes if you can find a decision threshold to vary. xy=0y=1Evaluation: ROC curvesclass 1 (positives)class 0 (negatives)moving thresholdTP = true positive rate =# positives classified as positivedivided by # positivesFP = false positive rate = # negatives classified as positivesdivided by # negativesTN = true negative rate = # negatives classified as negativesdivided by # negativesFN = false negatives = # positives classified as negativedivided by # positivesIdentify a threshold inyour classifier that you can shift. Plot ROC curve while you shift that


View Full Document

UCI ICS 273A - Decision Theory

Documents in this Course
Load more
Download Decision Theory
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Decision Theory and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Decision Theory 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?