Chapter 3 HYPOTHESIS TESTING The purpose of pattern recognition is to determine to which category or class a given sample belongs. Through an observation or measurement pro- cess, we obtain a set of numbers which make up the observation vector. The observation vector serves as the input to a decision rule by which we assign the sample to one of the given classes. Let us assume that the observation vector is a random vector whose conditional density function depends on its class. If the conditional density function for each class is known, then the pattern recog- nition problem becomes a problem in statistical hypothesis testing. 3.1 Hypothesis Tests for Two Classes In this section, we discuss two-class problems, which arise because each sample belongs to one of two classes, o1 or 02. The conditional density func- tions and the a priori probabilities are assumed to be known. The Bayes Decision Rule for Minimum Error Bayes test: Let X be an observation vector, and let it be our purpose to determine whether X belongs to o1 or 02. A decision rule based simply on probabilities may be written as follows: 5152 Introduction to Statistical Pattern Recognition where qi(X) is a posteriori probability of 0; given X. Equation (3.1) indicates that, if the probability of o1 given X is larger than the probability of 02, X is classified to o1 , and vice versa. The a posteriori probability q;(X) may be cal- culated from the a priori probability Pi and the conditional density function pi(X), using Bayes theorem, as (3.2) where p (X) is the mixture density function. Since p (X) is positive and com- mon to both sides of the inequality, the decision rule of (3.1) can be expressed as or (3.3) (3.4) The term [(X) is called the likelihood ratio and is the basic quantity in hypothesis testing. We call P21P the threshold value of the likelihood ratio for the decision. Sometimes it is more convenient to write the minus-log likeli- hood ratio rather than writing the likelihood ratio itself. In that case, the deci- sion rule of (3.4) becomes P, 02 p2 h(X)=-lnt(X)=-InpI(X)+lnp2(X) 3 In -. (3.5) The direction of the inequality is reversed because we have used the negative logarithm. The term h (X) is called the discriminant function. Throughout this book, we assume P I = P 2, and set the threshold In P IIP = 0 for simplicity, unless otherwise stated. Equation (3.1), (3.4), or (3.5) is called the Bayes test for minimum error. Bayes error: In general, the decision rule of (3.3, or any other decision rule, does not lead to perfect classification. In order to evaluate the perfor- mance of a decision rule, we must calculate the probability of error, that is, the probability that a sample is assigned to the wrong class. The conditional error given X, r(X), due to the decision rule of (3.1) is either 9 I (X) or q*(X) whichever smaller. That is,3 Hypothesis Testing 53 r-(X) = mink1(X),q2(X)I . (3.4) The total error, which is called the Bayes error, is computed by E { r(X)]. where Equation (3.7) shows several ways to express the Bayes error, E. is the definition of E. The second line is obtained by inserting The first line (3.6) into the first line and applying the Bayes theorem of (3.2). The integral regions L and L2 of the third line are the regions where X is classified to o1 and o2 by this decision rule, and they are called the ol- and o;?-regions. In LI, P IpI (X) > P 2p2(X), and therefore r (X) = P2p2(X)/p (X). Likewise, r-(X) = P Ip I (X)/p (X) in L2 because P lp I (X) < P g2(X) in L2. In (3.8), we distinguish two types of errors: one results from misclassifying samples from w1 and the other results from misclassifying samples from 02. The total error is a weighted sum of these errors. Figure 3-1 shows an example of this decision rule for a simple one- dimensional case. The decision boundary is set at x=r where P lp I (x) = P 2p2(x), and s < r and x > t are designated to L I and L2 respec- tively. The resulting errors are P = R + C, P 2~2 = A, and E = A + B + C, where A, B, and C indicate the areas, for example, B = I' P Ip (8) dx. This decision rule gives the smallest probability of error. This may be demonstrated easily from the one-dimensional example of Fig. 3- 1. Suppose that the boundary is moved from r to t', setting up the new wI - and o2-regions as L; and L;. Then, the resulting errors are P ]E; = C, P 2~i = A + B + D, and 6 =A + B + C + D, which is larger than E by D. The same is true when the54 Introduction to Statistical Pattern Recognition I *L2 I b* +--1-->c2 Fig. 3-1 Bayes decision rule for minimum error. boundary is shifted to the left. This argument can be extended to a general n- dimensional case. The computation of the Bayes error is a very complex problem except in some special cases. This is due to the fact that E is obtained by integrating high-dimensional density functions in complex regions as seen in (3.8). There- fore, it is sometimes more convenient to integrate the density function of h = h (X) of (3.5), which is one-dimensional: (3.9) (3.10) where ph(h I mi) is the conditional density of h for mi. However, in general, the density function of h is not available, and very difficult to compute. Example 1: When the pi(X)'s are normal with expected vectors Mi and covariance matrices C;, the decision rule of (3.5) becomes h (X) = - In 1(X) (3.1 1) Equation (3.1 1) shows that the decision boundary is given by a quadratic form in X. When CI = C2 = C, the boundary becomes a linear function of X as3 Hypothesis Testing 55 x2 x2 / Fig. 3-2 Decision boundaries for normal distributions: (a) XI f &; (b) XI = &. Figure 3-2 shows two-dimensional examples for XI zC2 and C I =&. Example 2: Let us study a special case of (3.1 1) where Mi =O and Xi = 1 p; ... py-' Pi 1 ' Pi py-' . . . Pi 1 (3.12) (3.13) This type of covariance matrix is often seen, for example, when stationary run- dom processes are time-sampled to form random vectors. The explicit expres- sions for Z;' and I Zi I are known for this covariance matrix as56 Introduction to Statistical Pattern Recognition -p; 0 . . . 1+p: -p; .. , 0 1 2 l+Pi -pi IZ; I = (1 - py . (3.14) (3.15) Therefore, the quadratic equation of (3.11) becomes 1-p: P1 l-p2 o? p2 - ~xixi+l + (n-1) In 7 >< In - , (3.16) where the second term shows the edge effect of terminating the observation of random processes within a finite length, and this effect
View Full Document