UT Dallas CS 6375 - 06 - NBayesLogReg - D3128241

Home> Schools> University of Texas at Dallas> Computer Science (CS) > CS 6375> 06 - NBayesLogReg

DOC PREVIEW

UT Dallas CS 6375 - 06 - NBayesLogReg

School name University of Texas at Dallas

Course Cs 6375- Machine Learning

Pages 17

This preview shows page 1-2-3-4-5-6 out of 17 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

CHAPTER 3GENERATIVE AND DISCRIMINATIVECLASSIFIERS:NAIVE BAYES AND LOGISTIC REGRESSIONMachine LearningCopyrightc2015. Tom M. Mitchell. All rights reserved.*DRAFT OF January 31, 2015**PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR’S PERMISSION*This is a rough draft chapter intended for inclusion in the upcoming second edi-tion of the textbook Machine Learning, T.M. Mitchell, McGraw Hill. You arewelcome to use this for educational purposes, but do not duplicate or repost iton the internet. For online copies of this and other materials related to this book,visit the web site www.cs.cmu.edu/∼tom/mlbook.html.Please send suggestions for improvements, or suggested exercises, [email protected] Learning Classifiers based on Bayes RuleHere we consider the relationship between supervised learning, or function ap-proximation problems, and Bayesian reasoning. We begin by considering how todesign learning algorithms based on Bayes rule.Consider a supervised learning problem in which we wish to approximate anunknown target function f : X →Y , or equivalently P(Y|X). To begin, we willassume Y is a boolean-valued random variable, and X is a vector containing nboolean attributes. In other words, X = hX1,X2...,Xni, where Xiis the booleanrandom variable denoting the ith attribute of X.Applying Bayes rule, we see that P(Y = yi|X) can be represented asP(Y = yi|X = xk) =P(X = xk|Y = yi)P(Y = yi)∑jP(X = xk|Y = yj)P(Y = yj)1Copyrightc 2015, Tom M. Mitchell. 2where ymdenotes the mth possible value for Y , xkdenotes the kth possible vectorvalue for X, and where the summation in the denominator is over all legal valuesof the random variable Y .One way to learn P(Y |X ) is to use the training data to estimate P(X|Y ) andP(Y ). We can then use these estimates, together with Bayes rule above, to deter-mine P(Y |X = xk) for any new instance xk.A NOTE ON NOTATION: We will consistently use upper case symbols (e.g.,X) to refer to random variables, including both vector and non-vector variables. IfX is a vector, then we use subscripts (e.g., Xito refer to each random variable, orfeature, in X). We use lower case symbols to refer to values of random variables(e.g., Xi= xi jmay refer to random variable Xitaking on its jth possible value). Wewill sometimes abbreviate by omitting variable names, for example abbreviatingP(Xi= xi j|Y = yk) to P(xi j|yk). We will write E[X] to refer to the expected valueof X. We use superscripts to index training examples (e.g., Xjirefers to the valueof the random variable Xiin the jth training example.). We use δ(x) to denotean “indicator” function whose value is 1 if its logical argument x is true, andwhose value is 0 otherwise. We use the #D{x} operator to denote the number ofelements in the set D that satisfy property x. We use a “hat” to indicate estimates;for example,ˆθ indicates an estimated value of θ.1.1 Unbiased Learning of Bayes Classifiers is ImpracticalIf we are going to train a Bayes classifier by estimating P(X|Y ) and P(Y ), thenit is reasonable to ask how much training data will be required to obtain reliableestimates of these distributions. Let us assume training examples are generatedby drawing instances at random from an unknown underlying distribution P(X ),then allowing a teacher to label this example with its Y value.A hundred independently drawn training examples will usually suffice to ob-tain a maximum likelihood estimate of P(Y ) that is within a few percent of its cor-rect value1when Y is a boolean variable. However, accurately estimating P(X |Y )typically requires many more examples. To see why, consider the number of pa-rameters we must estimate when Y is boolean and X is a vector of n booleanattributes. In this case, we need to estimate a set of parametersθi j≡ P(X = xi|Y = yj)where the index i takes on 2npossible values (one for each of the possible vectorvalues of X), and j takes on 2 possible values. Therefore, we will need to estimateapproximately 2n+1parameters. To calculate the exact number of required param-eters, note for any fixed j, the sum over i of θi jmust be one. Therefore, for anyparticular value yj, and the 2npossible values of xi, we need compute only 2n−1independent parameters. Given the two possible values for Y , we must estimatea total of 2(2n−1) such θi jparameters. Unfortunately, this corresponds to two1Why? See Chapter 5 of edition 1 of Machine Learning.Copyrightc 2015, Tom M. Mitchell. 3distinct parameters for each of the distinct instances in the instance space for X.Worse yet, to obtain reliable estimates of each of these parameters, we will need toobserve each of these distinct instances multiple times! This is clearly unrealisticin most practical learning domains. For example, if X is a vector containing 30boolean features, then we will need to estimate more than 3 billion parameters.2 Naive Bayes AlgorithmGiven the intractable sample complexity for learning Bayesian classifiers, we mustlook for ways to reduce this complexity. The Naive Bayes classifier does thisby making a conditional independence assumption that dramatically reduces thenumber of parameters to be estimated when modeling P(X|Y ), from our original2(2n−1) to just 2n.2.1 Conditional IndependenceDefinition: Given random variables X,Y and Z, we say X is con-ditionally independent of Y given Z, if and only if the probabilitydistribution governing X is independent of the value of Y given Z;that is(∀i, j,k)P(X = xi|Y = yj,Z = zk) = P(X = xi|Z = zk)As an example, consider three boolean random variables to describe the currentweather: Rain, T hunder and Lightning. We might reasonably assert that T hunderis independent of Rain given Lightning. Because we know Lightning causesT hunder, once we know whether or not there is Lightning, no additional infor-mation about T hunder is provided by the value of Rain. Of course there is a cleardependence of T hunder on Rain in general, but there is no conditional depen-dence once we know the value of Lightning.2.2 Derivation of Naive Bayes AlgorithmThe Naive Bayes algorithm is a classification algorithm based on Bayes rule, thatassumes the attributes X1...Xnare all conditionally independent of one another,given Y . The value of this assumption is that it dramatically simplifies the rep-resentation of P(X|Y ), and the problem of estimating it from the training data.Consider, for example, the case where X = hX1,X2i. In this caseP(X|Y ) = P(X1,X2|Y )= P(X1|X2,Y)P(X2|Y )= P(X1|Y )P(X2|Y

View Full Document