Unformatted text preview:

Introduction to Neural NetworksU. Minn. Psy 5038Bias/varianceInitialize‡Read in Add-in packages:Off@General::"spell1"D;<< "ErrorBarPlots`";<< "MultivariateStatistics`";Make sure the SVM package is downloaded in in the default directory<< MathSVMv7`SVMs and Kernel methodsAs in Lecture 27 (see discriminant functions), we assume a simple perceptron TLU to classify vector data x into one of two classes depending on the sign of g(x):decisionHxL = signHw.x + bL.Given g HxL = w.x + b, recall that g(x)/||w|| is the distance of a data point x from a plane defined by g(x) = 0. In support vector machines, the goal is to find the separating plane (i.e. find w and b) that is as far as possible from any of the data points. The intuition is that this will minimize the probability of making wrong classifications when given new data at some future point. Formally, we want to solve"(1)maxHw,bLIminidIPw,b, xiMM,where dIPw,b, xiM = gHxLê »» w »», i.e.dIPw,b, xiM = gHxLê »» w »» = w.xi+ b ê »» w »»-2-1123-3-2-1123Two-class data (black and grey dots), their optimal separating hyperplane (continuous line), and support vectors (circled in blue). This is an example output of the SVMPlot function in MathSVM. The width of the “corridor” defined by the two dotted lines connecting the support vectors is the margin of the optimal separating hyperplane. (From Nilsson et al., 2006)‡The Primal ProblemIt can be shown that the optimal separating hyperplane solving (1) can be found as the solution to the equivalent optimiza-tion problem (2)minw,b12»» w »»2subject to yiIwTxi+ bM ¥ 1,Typically, equality will hold for a relatively small number of the data vectors. These data are termed support vectors. The solution Hw, bL depends only on these specific points, and in effect contain all the information for the decision rule. The "dual problem".A Simple linear SVM Example (from Nilsson et al. 2006)Here's a demo of a simple SVM problem. It uses the add on package MathSVMv7 written by Nilsson et al. len = 20;X = Join@RandomReal@NormalDistribution@-2, 1D, 8len ê 2, 2<D,RandomReal@NormalDistribution@2, 1D, 8len ê 2, 2<DD;y = Join@Table@1, 8len ê 2<D, Table@-1, 8len ê 2<DD;We use the simple SVM formulation provided in MathSVM by the SeparableSVM function.t = 0.01;a = SeparableSVM@X, y, tD2 Lect28.nb80, 0.222806, 0, 0, 0, 0, 0, 0, 0,0, 0, 0.126342, 0.0964638, 0, 0, 0, 0, 0, 0, 0<In the output figure below, the solid line marks the optimal hyperplane, and dotted lines mark the width of the corridor that joins support vectors (highlighted in blue).SVMPlot@a, X, yD- 3- 2- 1123- 3- 2- 1123A Nonlinear Example: Using Kernels (from Nilsson et al. 2006)What if the data are not linearly separable? The essential idea is to map the data (through some non-linear mapping, e.g. polynomial) to a higher-dimensional "feature" space to find the optimal hyperplane separating the data. The dot product gets replaced by a non-linear kernel function. For example, the polynomial kernel is given by:If d = 1, we have the standard dot product, but for d = 2, 3, etc.. we have polynomial functions of the elements of the vectors x. See Nilsson et al, and paper by Jäkel (2009) for more information on kernels. Here is a demo for an application for nonlinear classification . We'll use second-degree kernel:Lect28.nb 3PolynomialKernel@x, y, 2DH1 + x.81, 1, 1, 1, 1, 1, 1, 1, 1, 1, - 1, - 1, -1, -1, -1, -1, - 1 , - 1, - 1, - 1<L2Some synthetic data which is not linearly separable.len = 50;X = JoinARandomReal@NormalDistribution@0, 0.03D, 8len ê 2, 2<D,TableA9RandomReal@NormalDistribution@i ê len - 1 ê 4, 0.01DD,RandomANormalDistributionAH2 i ê len - 1 ê 2L2- 1 ê 6, 0.01EE=,8i, len ê 2<EE;y = Join@Table@1, 8len ê 2<D, Table@-1, 8len ê 2<DD;SVMDataPlot@X, y, PlotRange Ø AllD- 0.2- 0.10.10.2- 0.15- 0.10- 0.050.05We use the KernelFunction to specify the kernel type and run SeparableSVM[].t = 0.01;pk = PolynomialKernel@Ò1, Ò2, 2D &;a = SeparableSVM@X, y, t, KernelFunction Ø pkD80, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4397.44, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 238.421, 0, 0, 736.733, 0, 0, 0, 0,0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3022.5, 0, 0, 0, 399.786, 0<When visualizing the results, SVMPlot can use the kernel functions to draw any nonlinear decision curves.4 Lect28.nbWhen visualizing the results, SVMPlot can use the kernel functions to draw any nonlinear decision curves.SVMPlot@a, X, y, KernelFunction Ø pkD- 0.2- 0.10.10.2- 0.15- 0.10- 0.050.05Clear@len, X, y, a, pkD‡More informationThe wiki entry for SVMs has a fairly good introduction (As of 12/14/2009, ). To go to the source see Vapnik (1995)http : // svm.first.gmd.de/Demo links:http://svm.dcs.rhbnc.ac.uk/pagesnew/GPat.shtmlAnd for applications of kernel methods more generally, to cognitive and neuroscience see reviews by Jäkel et al. (2006; 2009). The links to the pdfs are in the course syllabus.Statistical learning, model selection & the bias/variance dilemmaIn Lecture 26, we summarized optimal rules for minimizing risk in Bayesian statistical decision theory, assuming that we know the distributions of the generative model. But what if we don't? Consider the regression problem, fitting data that may be a complex function of the input.The problem in general is how to choose the function that both remembers the relationship between x and y, and general-izes with new values of x. At first one might think that it should be as general as possible to allow all kinds of maps. For example, if one is fitting a curve, you might wish to use a very high-order polynomial, or a back-prop network with lots of hidden units. There is a drawback, however, to the flexibility afforded by extra degrees of freedom in fitting the data. We can get drastically different fits for different sets of data that are randomly drawn from the same underlying process. The fact that we get different fit parameters (e.g. slope of a regression line) each time means that although we may exactly fit the data each time, we introduce variation between the average fit (over all data sets) and the fits over the ensemble of data sets. We could get around this problem with a huge amount of data, but the problem is that the amount of required data can grow exponentially with the order of the fit--an example of the so-called "curse of dimensionality".On the other hand, if the function is restrictive, (e.g. straight lines through the origin), then we will get similar fits for different data sets, because all


View Full Document
Download Lecture notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?