Na ve Bayes Classifiers William W Cohen Announcement Please use Piazza for general questions https piazza com class hjrdb0ci34531x or see the class wiki Outline Review of the review of probabilities Bayes Rule Smoothing probability estimates MLE vs MAP estimates Estimating a joint distribution How to classify with a joint distribution Playing with Matlab Na ve Bayes Review2 Probability theory Comes from small set of axioms Models games of chance nondeterministic experiments and or uncertain beliefs Important concepts for ML independence conditional probability A Joe s cheating P B A Bayes P A rule P A B B 4 critical hits P B Odds ratio P B A P A P A B P B A P A P B P A B P B A P A P B P B A P A Uncertain beliefs Conditional probabilitie Review2 I proved a few things two ways with pictures and with axioms Why A B Theorem P A or B P A P B P A and B P E2 P E1 P E2 Proof P E3 P E2 E1 A and A and B E2 A and B E3 B and A and B E1 or E2 or E3 A or B and E1 E2 E3 disjoint P A or B P E1 P E2 P E3 further P A P E1 P E2 and P B P E1 P E2 Review2 Some notation we will try to use constently Capital letters A are random variables Lower case letters a are values the corresponding variable can take eg true false or 1 2 20 or aaliyah aardvark zynga Boldface letters are a vector of values eg A a Many equations with just random variables are really short cuts P B A P A really means P A B P B P B b A a P A a a b P A a B b P B b SMOOTHING AND BAYES RULE Some practical problems I bought a loaded d20 on EBay but it didn t come with any specs How can I find out how it behaves 1 Collect some data 20 rolls 2 Estimate Pr i C rolls of i C any roll One solution the MLE I bought a loaded d20 on EBay but it didn t come with any specs How can I find out how it behaves P 1 0 P 2 0 P 3 0 P 4 0 1 MLE maximum likelihood estimate P 19 0 25 P 20 0 2 But Do I really think it s impossible to roll a 1 2 or 3 Another solution smoothing the observed counts I bought a loaded d20 on EBay but it didn t come with any specs How can I find out how it behaves 0 Imagine some data 20 rolls each i shows upCollect 1x 1 some data 20 rolls 2 Estimate Pr i C rolls of i C any roll Smoothing I bought a loaded d20 on EBay but it didn t come with any specs How can I find out how it behaves P 1 1 40 P 2 1 40 P 3 1 40 P 4 2 1 40 P 19 5 1 40 P 20 4 1 40 1 8 C i 1 P r i C ANY C IMAGINED 0 25 vs 0 125 really different Maybe I should imagine less data Q What if I used m rolls with a probability of q 1 20 of rolling any i C i 1 Pr i C ANY C IMAGINED C i mp0 C i p0 P r i C ANY m C ANY 1 I can use this formula with m 20 or even with m 20 say with m 1 A better solution Q What if I used m rolls with a probability of q 1 20 of rolling any i C i 1 P r i C ANY C IMAGINED C i mp0 P r i C ANY m If m C ANY then your imagination q rules If m C ANY then your data rules BUT you never ever ever end up with Pr i 0 Terminology This is called a uniform Dirichlet prior C i C ANY are sufficient statistics C i mp0 qi P r i C ANY m MLE maximum likelihood estimate MAP maximum a posteriori estimate MAP vs MLE MLE pick parameters q that maximize the probability of Pr D q MAP pick parameters q that maximize the posterior probability of q given D a prior over q and Bayes rule P D q P q arg max q P q D P D Why we call this a MAP Smoothing is an application of Bayes Rule Simple case replace the die with a coin Now there s one parameter q P Heads I start with a prior over q P q I get some data D D1 H D2 T P D q P q P D q P q P q D 1 P D r P r dr P D I compute maximum of the posterior 0 of q Why we call this a MAP Simple case replace the die with a coin Now there s one parameter q P H I start with a prior over q P q I get some data D D1 H D2 T I compute the posterior of q a 1 The math works nicely if the pdf P q q 1 q b 1 are imaginary counts of pos neg examples B is something to make integral equal 1 I can combine these with the observed counts a b 1 a a 1 b b 1 P q D q 1 q B a a b b 1 B a b Why we call this a MAP are imaginary pos neg examples Terminology Conjugate priors So given D with counts a b we go from prior f to posterior f a b When P q and P q D are in the same functional form f then P q is a conjugate prior The Beta prior is conjugate to the Binomial distribution Conjugate priors are widely used in ML Usually they look like pseudo counts Why we call this a MAP The generalization to multinomials is called a Dirichlet distribution Parameters are f x1 xK CLASSIFICATION WITH BAYES RULE AND THE JOINT DISTRIBUTION CLASSIFICATION WITH BAYES RULE AND THE JOINT DISTRIBUTION P B A P A P A B P B A Joe s cheating B 4 critical hits his patient has cancer this email is spam this transaction is fraud measurements observations of this patient this email P X1 Xn Y P Y P X1 Xn Y P Y P Y X1 Xn P X1 Xn const Terminology X1 Xn are features Y is the class label A joint distribution P X1 Xn Y assigns a probability to every combination of values for x1 xn y for X1 Xn Y A combination of feature values x1 xn is an instance An instance plus a class label y is an example Joint distributions are powerful I have 1 standard d6 die 2 loaded d6 die one loaded hi one loaded …
View Full Document
Unlocking...