**Unformatted text preview:**

Chapter 1: Introduction1.1: Example-Polynomial Curve FittingWe have an input variable x that will be used to predict a target variable tThe data for this example is generated from the function s∈(2 πx ) with random noise included in the target variablesNow we are given a training set with N observations of x, x=(x1,… , xN)T with corresponding target values t=(t1, … , tN)TWe may fit the data using a polynomial functiony(x ,w)=w0+w1x+w2x2+…+wMxM=∑j=0Mwjxj(1.1)Although the polynomial function here is a nonlinear function of x, it is a linear function of the coefficients wBecause these functions are linear in the unknown parameters, they are known as linear modelsThese values will be determined by fitting the polynomial to the training data, which can be donevia minimizing an error function, written asE(w)=12∑n=1N{y(xn, w)−tn}2(1.2)We may solve the curve-fitting problem by choosing the value of w that minimizes the error functionWe may choose the order M of the polynomial via model selectionWe note that at M=9, the polynomial passes through each point exactly, providing an excellent fit for the training modelHowever, it is clearly a poor fit of the sinusoidal function because it oscillates wildly; this isknown as overfittingWe define the root-mean-square error (RMS) asw∗¿¿¿N2 E ¿¿ERMS=√¿The division by N allows us to compare different sizes of data sets on an equal footingThe square root ensures that this is measured on the same scale and in the same units as tWe may graph the RMS for both the training and testing sets through different values of MAt M=9, the training error goes to 0; this is because there are 10 degrees of freedom for 10 coefficients, and therefore they can be tuned exactly to the 10 data pointsIntuitively, this may seem paradoxical because a polynomial of given order contains all lower-order polynomials as special casesHowever, we see that as M increases, the magnitudes of the coefficients typically also get largerAt M=9, the coefficients have developed large positive and negative values to match each data point exactlyEssentially, the polynomials that are more flexible are increasingly tuned to the random noise on the target valuesWe may use regularization to control the over-fitting phenomenonRegularization is adding a penalty term to the error function in Equation 1.2 to discourage the coefficients from reaching large valuesThe simplest penalty term takes the form of a sum of squares of all the coefficients, leading to~E(w)=12∑n=1N{y(xn, w)−tn}2+λ2||w||2(1.4)This error function can be minimized exactly in closed formSuch a method is known as shrinkage because they reduce the values of the coefficients1.2: Probability TheoryThe joint probability of events isp(X=xi,Y = yj)=nijN(1.5)The marginal probability of X taking a specific value irrespective of Y isp(X=xi)=ciN(1.6)p(X=xi)=∑j =1Lp (X =xi, Y = yj)(1.7)The conditional probability isp(Y =yj|X=xi)=nijci(1.8)We can now derive the product rule of probabilityp(X=xi,Y = yj)= p(Y = yj|X= xi)p(X

View Full Document