UCI ICS 273A - Decision Theory - D2639198

Home> Schools> University of California, Irvine> (ICS) > ICS 273A> Decision Theory

DOC PREVIEW

UCI ICS 273A - Decision Theory

School name University of California, Irvine

Course Ics 273a- Machine Learning

Pages 8

This preview shows page 1-2-3 out of 8 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 8 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Decision Theory Naïve Bayes ROC CurvesGenerative vs Discriminative MethodsNaïve Bayes: decisionsNaïve Bayes: learningLoss functionsDecision surfaceROC CurveEvaluation: ROC curvesDecision TheoryNaïve BayesROC CurvesGenerative vs DiscriminativeMethods• Logistic regression: h: xy.• When we only learn a mapping xy it is called a discriminative method.• Generative methods learn p(x,y) = p(x|y) p(y), i.e. for every class we learn a model over the input distribution.• Advantage: leads to regularization for small datasets (but when N is large discriminative methods tend to work better).• We can easily combine various sources of information: say we have learned a model for attribute I, and now receive additional information about attribute II, then:• Disadvantage: you model more than necessary for making decisions, and input space (x-space) can be very high dimensional.• This is called “conditional independence of x|y”. • The corresponding classifier is called “Naïve Bayes Classifier”.€ p(xI, xII| y) ≈ p(xI| y)p(xII| y)Naïve Bayes: decisions€ p(y | xI, xII) =p(xI| y)p(xII| y) p(y)p(xI, xII)=p(xI| y)p(xII| y) p(y)p(xI| y)p(xII| y) p(y)y∑• This is the “posterior distribution” and it can be used to make a decision on what label to assign to a new data-case.• Note that to make a decision you do not need the denominator.• If we computed the posterior p(y|xI) first, we can use it as a new prior for the new information xII (prove this as home):€ p(y | xI, xII) ∝ p(xII| y) p(y | xI)Naïve Bayes: learning• What do we need to learn from data?• p(y)• p(xk|y) for all k• A very simple rule is to look at the frequencies in the data: (assuming discrete states)• p(y) = [ nr. data-cases with label y] / [ total nr. data-cases ]• p(xk=i|y) =[ nr. data-cases in state xk=i and y] / [nr. data-cases with label y ]• To regularize we imagine that each state i has a small fractional number of data-cases to begin with (K = total nr. of classes). p(xk=i|y) =[ c + nr. data-cases in state xk=i and y] / [Kc + nr. data-cases with label y ]• What difficulties do you expect if we do not assume conditional independence?• Does NB over-estimate or under-estimate the uncertainty of its predictions?• Practical guideline: work in log-domain: € p(xj| y) → log p(xj| y)j∑j∏Loss functions• What if it is much more costly to make an error on predicting y=1 vs y=0?• Example: y=1 is “patient has cancer”, y=0 means “patient is healthy.• Introduce “expected loss function”:€ E[L] = Lkj dx p(y =Rj∫kj∑k, x) Predict cancer healthycancer 0 1000healthy 1 0Total probability of predictingclass while true class is k.Rj is the region of x-space wherean example is assigned to class j.Decision surface€ E[L] = Lkj dx p(y =Rj∫kj∑k, x)• How shall we choose Rj ?• Solution: mimimize E[L] over {Rj}.• Take an arbitrary point “x”. • Compute for all j and maximize over “j”.• Since we maximize for every “x” separately, the total integral is maximal• Places where the decision switches belong to the “decision surface”. • What matrix L corresponds to the decision rule on slide 2 using the posterior? € Lkjp(y = k |k∑ x)ROC Curve• Assume 2 classes and 1 attribute.• Plot class conditional densities p(xk|y)• Shift decision boundary from right to left.• As you move the loss will change, so you want to find the point where it is minimized.• If L=[0 1; 1 0] where is L minimal?• As you shift the true true positive rate (TP) and the false positive rate (FP) change. • By plotting the entire curve you can see the tradeoffs.• Easily generalized to more attributes if you can find a decision threshold to vary. xy=0y=1Evaluation: ROC curvesclass 1 (positives)class 0 (negatives)moving thresholdTP = true positive rate =# positives classified as positivedivided by # positivesFP = false positive rate = # negatives classified as positivesdivided by # negativesTN = true negative rate = # negatives classified as negativesdivided by # negativesFN = false negatives = # positives classified as negativedivided by # positivesIdentify a threshold inyour classifier that you can shift. Plot ROC curve while you shift that

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3 out of 8 pages.

UCI ICS 273A - Decision Theory

Sign up for free to view:

Please select your school