UTD CS 6375 - GENERATIVE AND DISCRIMINATIVE CLASSIFIERS: NAIVE BAYES AND LOGISTIC REGRESSION - D3440906

Home> Schools> The University of Texas at Dallas> Computer Science (CS) > CS 6375> GENERATIVE AND DISCRIMINATIVE CLASSIFIERS: NAIVE BAYES AND LOGISTIC REGRESSION

DOC PREVIEW

UTD CS 6375 - GENERATIVE AND DISCRIMINATIVE CLASSIFIERS: NAIVE BAYES AND LOGISTIC REGRESSION

School name The University of Texas at Dallas

Course Cs 6375- Machine Learning

Pages 17

This preview shows page 1-2-3-4-5-6 out of 17 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CHAPTER 3 GENERATIVE AND DISCRIMINATIVE CLASSIFIERS NAIVE BAYES AND LOGISTIC REGRESSION Machine Learning Copyright c 2015 Tom M Mitchell All rights reserved DRAFT OF January 31 2015 PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION This is a rough draft chapter intended for inclusion in the upcoming second edition of the textbook Machine Learning T M Mitchell McGraw Hill You are welcome to use this for educational purposes but do not duplicate or repost it on the internet For online copies of this and other materials related to this book visit the web site www cs cmu edu tom mlbook html Please send suggestions for improvements or suggested exercises to Tom Mitchell cmu edu 1 Learning Classifiers based on Bayes Rule Here we consider the relationship between supervised learning or function approximation problems and Bayesian reasoning We begin by considering how to design learning algorithms based on Bayes rule Consider a supervised learning problem in which we wish to approximate an unknown target function f X Y or equivalently P Y X To begin we will assume Y is a boolean valued random variable and X is a vector containing n boolean attributes In other words X hX1 X2 Xn i where Xi is the boolean random variable denoting the ith attribute of X Applying Bayes rule we see that P Y yi X can be represented as P Y yi X xk P X xk Y yi P Y yi j P X xk Y y j P Y y j 1 Copyright c 2015 Tom M Mitchell 2 where ym denotes the mth possible value for Y xk denotes the kth possible vector value for X and where the summation in the denominator is over all legal values of the random variable Y One way to learn P Y X is to use the training data to estimate P X Y and P Y We can then use these estimates together with Bayes rule above to determine P Y X xk for any new instance xk A NOTE ON NOTATION We will consistently use upper case symbols e g X to refer to random variables including both vector and non vector variables If X is a vector then we use subscripts e g Xi to refer to each random variable or feature in X We use lower case symbols to refer to values of random variables e g Xi xi j may refer to random variable Xi taking on its jth possible value We will sometimes abbreviate by omitting variable names for example abbreviating P Xi xi j Y yk to P xi j yk We will write E X to refer to the expected value j of X We use superscripts to index training examples e g Xi refers to the value of the random variable Xi in the jth training example We use x to denote an indicator function whose value is 1 if its logical argument x is true and whose value is 0 otherwise We use the D x operator to denote the number of elements in the set D that satisfy property x We use a hat to indicate estimates for example indicates an estimated value of 1 1 Unbiased Learning of Bayes Classifiers is Impractical If we are going to train a Bayes classifier by estimating P X Y and P Y then it is reasonable to ask how much training data will be required to obtain reliable estimates of these distributions Let us assume training examples are generated by drawing instances at random from an unknown underlying distribution P X then allowing a teacher to label this example with its Y value A hundred independently drawn training examples will usually suffice to obtain a maximum likelihood estimate of P Y that is within a few percent of its correct value1 when Y is a boolean variable However accurately estimating P X Y typically requires many more examples To see why consider the number of parameters we must estimate when Y is boolean and X is a vector of n boolean attributes In this case we need to estimate a set of parameters i j P X xi Y y j where the index i takes on 2n possible values one for each of the possible vector values of X and j takes on 2 possible values Therefore we will need to estimate approximately 2n 1 parameters To calculate the exact number of required parameters note for any fixed j the sum over i of i j must be one Therefore for any particular value y j and the 2n possible values of xi we need compute only 2n 1 independent parameters Given the two possible values for Y we must estimate a total of 2 2n 1 such i j parameters Unfortunately this corresponds to two 1 Why See Chapter 5 of edition 1 of Machine Learning Copyright c 2015 Tom M Mitchell 3 distinct parameters for each of the distinct instances in the instance space for X Worse yet to obtain reliable estimates of each of these parameters we will need to observe each of these distinct instances multiple times This is clearly unrealistic in most practical learning domains For example if X is a vector containing 30 boolean features then we will need to estimate more than 3 billion parameters 2 Naive Bayes Algorithm Given the intractable sample complexity for learning Bayesian classifiers we must look for ways to reduce this complexity The Naive Bayes classifier does this by making a conditional independence assumption that dramatically reduces the number of parameters to be estimated when modeling P X Y from our original 2 2n 1 to just 2n 2 1 Conditional Independence Definition Given random variables X Y and Z we say X is conditionally independent of Y given Z if and only if the probability distribution governing X is independent of the value of Y given Z that is i j k P X xi Y y j Z zk P X xi Z zk As an example consider three boolean random variables to describe the current weather Rain T hunder and Lightning We might reasonably assert that T hunder is independent of Rain given Lightning Because we know Lightning causes T hunder once we know whether or not there is Lightning no additional information about T hunder is provided by the value of Rain Of course there is a clear dependence of T hunder on Rain in general but there is no conditional dependence once we know the value of Lightning 2 2 Derivation of Naive Bayes Algorithm The Naive Bayes algorithm is a classification algorithm based on Bayes rule that assumes the attributes X1 Xn are all conditionally independent of one another given Y The value of this assumption is that it dramatically simplifies the representation of P X Y and the problem of estimating it from the training data Consider for example the case where X hX1 X2 i In this case P X Y P X1 X2 Y P X1 X2 Y P X2 Y P X1 Y P X2 Y 4 Copyright c 2015 Tom M Mitchell Where the second line follows from …

View Full Document