Chapter 6 NONPARAMETRIC DENSITY ESTIMATION So far we have been discussing the estimation of parameters. Thus, if we can assume we have a density function that can be characterized by a set of parameters, we can design a classifier using estimates of the parameters. Unfortunately, we often cannot assume a parametric form for the density func- tion, and in order to apply the likelihood ratio test we somehow have to esti- mate the density functions using an unstructured approach. This type of approach is called nonparametric estimation, while the former is called parametric estimation. Since, in nonparametric approaches, the density func- tion is estimated locally by a small number of neighboring samples, the esti- mate is far less reliable with larger bias and variance than the parametric coun- terpart. There are two kinds of nonparametric estimation techniques available: one is called the Par-zen density estimate and the other is the k-nearest neigh- bor- densiry estimate. They are fundamentally very similar, but exhibit some different statistical properties. Both are discussed in this chapter. It is extremely difficult to obtain an accurate density estimate non- parametrically, particularly in high-dimensional spaces. However, our goal here is not to get an accurate estimate. Our goal is, by using these estimates, to design a classifier and evaluate its performance. For this reason, the accu- racy of the estimate is not necessarily a crucial issue. Classification and performance evaluation will be discussed in Chapter 7. The intention of this 2546 Nonparametric Density Estimation 255 chapter is to make the reader familiar with the fundamental mathematical pro- perties related to nonparametric density estimation in preparation for the material presented in Chapter 7. 6.1 Parzen Density Estimate Parzen Density Estimate In order to estimate the value of a density function at a point X, we may set up a small local region around X, L (X). Then, the probability coverage (or probability mass) of L(X) may be approximated by p(X)v where v is the volume of L(X). This probability may be estimated by drawing a large number of samples, N, from p(X), counting the number of samples, k, falling in L (X), and computing k/N. Equating these two probabilities, we may obtain an estimate of the density function as n n k(X) or p(X)= - . k(X) p(X)v = - N Nv Note that, with a fixed v, k is a random variable and is dependent on X. A fixed v does not imply the same v throughout the entire space, and vv could still vary with X. However, v is a preset value and is not a random variable. Kernel expression: The estimate of (6.1) has another interpretation. Suppose that 3 samples, X3, X,, and X,, are found in L(X) as shown in Fig. 6-1. With I' and N given, i(X) becomes 3/Nv. On the other hand, if we set up a uniform kernel function, IC(.), with volume v and height l/v around all exist- ing samples, the average of the values of these kernel functions at X is also 3/Nv. That is, [ 1-41 n (6.2) As seen in Fig. 6-1, only the kernel functions around the 3 samples, X3, X4, and X,, contribute to the summation of (6.2). Once (6.2) is adopted, the shape of the kernel function could be selected more freely, under the condition K(X) dX = 1. For one-dimensional cases, we may seek optimality and select a complex shape. However, in a high- dimensional space, because of its complexity, the practical selection of the ker- lN N ;=I p(x) = - K(X - x;) .256 Introduction to Statistical Pattern Recognition Fig. 6-1 Parzen kernel density estimate. ne1 function is very limited to either a normal or uniform kernel. In this book, we will use the following kernel which includes both normal and uniform ker- nels as special cases: where r(.) is the gamma function, and m is a parameter determining the shape of the kernel. It may be verified that, for any value of m, the covariance matrix of the kernel density (6.3) is r2A. The parameter rn determines the rate at which the kernel function drops off. For m = 1, (6.3) reduces to a simple nor- mal kernel. As m becomes large, (6.3) approaches a uniform (hyperelliptical) kernel, always with a smooth roll-off. The matrix A determines the shape of the hyperellipsoid, and I' controls the size or volume of the kernel. Other coefficients are selected to satisfy the two conditions mentioned previously:6 Nonparametric Density Estimation 257 j~(x)dX = 1 and Z, = r’A where C, is the covariance matrix of K(X). Convolution expression: Equation (6.2) can be rewritten in convolution form as where p,T is an impulsive density function with impulses at the locations of existing N samples. That is, the estimated density p(X) is obtained by feeding p,(X) through a linear (noncausal) filter whose impulse response is given by K(X). Therefore, p(X) is a smoothed version of p,(X). n Moments of p(X): The first and second order moments of (6.4) can be n easily computed. First, let us compute the expected value of p,(X) as (6.6) That is, p,(X) is an unbiased estimate of p(X). Then, the expected value of p(X) of (6.4) may be computed as IN lN N,=I N ;=, E(P,(X)J = -Zj6(X-Z)P(Z)dZ = -ZP(X) =p(X). Also,258 Introduction to Statistical Pattern Recognition Therefore, the variance of p(X) is Approximations of moments: In order to approximate the moments of p(X), let us expand p(Y) around X by a Taylor series up to the second order terms as p(~) E~(x) + V~'(X)(Y-X) + -~~{v~~(x)(Y-x)(Y-x)'] . Then, p (X)*K(X) may be approximated by A (6.10) 1 2 p(X)*K(X) = jp(Y)K(Y-x)dY gp (X)jK(Y-X)dY (6.1 1) + -tr{ v2p (X)j(Y -X)(Y -X)'K(Y -X)dY ) , where the first order term disappears because K(.) is a symmetric function. Since ~K(Y-x)~Y = 1 and ~(Y-x)(Y-x)'K(Y-x)~Y = r.2~ for K(.) of (6.3), (6.1 1) can be expressed by 1 2 where (6.12) (6.13) Similarly, p(X)*d(X) Ep(X)jt?(Y-X)dY (6.14) Although K(.) is a density function, K*(.) is not. Therefore, ld(Y)dY has a value not equal to 1. Let + -~~(v~~(x)~(Y-x)(Y-x)~(Y-x)~Y 1 1 . 26 Nonparametric Density Estimation 259 w =jl?(Y)dY. Then, $(.)/w becomes a density function. Therefore, (6.14) becomes where and r-'B is the covariance matrix of ?(X)/w. (6.15) (6.16) (6.17) Substituting (6.12) and (6.16) into (6.7) and (6.9), the moments of p(X) are
View Full Document