CS 59000 Statistical Machine learningLecture 9Alan QiOutlineReview of parzen windowsK-nearest neighbor classificationLinear Regression with basis functionsRidge regression and lassoBayesian model selectionBayesian factor Empirical BayesianNonparametric Methods (4)Assume observations drawn from a density p(x) and consider a small region Rcontaining x such thatThe probability that K out of N observations lie inside R is Bin(K|N,P ) and if N is largeIf the volume of R, V, is sufficiently small, p(x) is approximately constant over R andThus3What is the relation to the histogram method?Nonparametric Methods (5)Kernel Density Estimation: fix V, estimate K from the data. Let R be a hypercube centred on x and define the kernel function (Parzen window)It follows that and hence4What are the relation to the histogram method and its drawback?Nonparametric Methods (5)To avoid discontinuities in p(x), use a smooth kernel, e.g. a GaussianAny kernel such thatwill work.h acts as a smoother.5Nonparametric Methods (6)Nearest Neighbour Density Estimation: fix K, estimate V from the data. Consider a hyperspherecentred on x and let it grow to a volume, V?, that includes K of the given N data points. ThenK acts as a smoother.6K-Nearest-Neighbours for Classification (1)Given a data set with Nkdata points from class Ckand , we haveand correspondinglySince , Bayes’ theorem gives7Then how to classify the data points?K-Nearest-Neighbours for Classification (2)K = 1K = 38K-Nearest-Neighbours for Classification (3)• K acts as a smother• For , the error rate of the 1-nearest-neighbour classifier is never more than twice the optimal error (obtained from the true conditional class distributions).9Nonparametric vs Parametric Nonparametric models (not histograms) requires storing and computing with the entire data set. Parametric models, once fitted, are much more efficient in terms of storage and computation.10Linear Regression11Basis FunctionsExamples of Basis Functions (1)Examples of Basis Functions (2)14Maximum Likelihood Estimation (1)Maximum Likelihood Estimation (2)Sequential EstimationRegularized Least SquaresMore RegularizersVisualization of Regularized RegressionBayesian Linear RegressionPosterior Distributions of ParametersPredictive Posterior DistributionExamples of PredictiveDistributionQuestionSuppose we use Gaussian basis functions.What will happen to the predictive distribution if we evaluate it at places far from all training data points?Equivalent KernelGivenPredictive mean is thusEquivalent kernelBasis Function: Equivalent kernel:GaussianPolynomialSigmoidCovariance between two predictionsPredictive mean at nearby points will be highly correlated, whereas for more distant pairs of points the correlation will be
View Full Document