This preview shows page 1-2-24-25 out of 25 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1Neural NetworksA Simple Problem (Linear Regression)• We have training data X = {x1k}, i=1,..,N with corresponding output Y = {yk}, i=1,..,N• We want to find the parameters that predict the output Y from the data X in a linear fashion:Y ≈ wo+ w1x1x1y2A Simple Problem (Linear Regression)• We have training data X = {x1k}, k=1,..,N with corresponding output Y = {yk}, k=1,..,N• We want to find the parameters that predict the output Y from the data X in a linear fashion:yk≈ wo+ w1x1kx1yNotations: Superscript: Index of the data point in the training data set; k = kthtraining data pointSubscript: Coordinate of the data point; x1k = coordinate 1 of data point k.A Simple Problem (Linear Regression)• It is convenient to define an additional “fake”attribute for the input data: xo= 1• We want to find the parameters that predict the output Y from the data X in a linear fashion:yk≈ woxok+ w1x1kx1y3More convenient notations• Vector of attributes for each training data point:xk= [xok,..,xMk]• We seek a vector of parameters: w = [wo,..,wM]• Such that we have a linear relation between prediction Y and attributes X:x1ykMikiikMMkkookxwxwxwxwy xw ⋅==+++≈∑=011LkMikiikMMkkookxwxwxwxwy xw ⋅==+++≈∑=011LMore convenient notations• Vector of attributes for each training data point:xi= [xoi,..,xMi]• We seek a vector of parameters: w = [wo,..,wM]• Such that we have a linear relation between prediction Y and attributes X:x1yBy definition: The dot product between vectors w and xkis:∑==⋅Mikiikxw0xw4Neural Network: Linear Perceptronxoxw ⋅=∑=iMiixw0xixMwowiwMInput attribute valuesOutput predictionNeural Network: Linear Perceptronxoxw ⋅=∑=iMiixw0xixMwowiwMInput UnitsOutput UnitConnection with weightNote: This input unit corresponds to the “fake” attribute xo= 1. Called the biasNeural Network Learningproblem: Adjust the connection weights so that the network generates the correct prediction on the training data.5Linear Regression: Gradient Descent• We seek a vector of parameters: w = [wo,..,wM] that minimizes the error between the prediction Y and andthe data X:()( )kkkNkkNkkkNkkMMkkookyyxwxwxwyExwxw⋅−==⋅−=+++−=∑∑∑===δδ12121211)( L()( )kkkNkkNkkkNkkMMkkookyyxwxwxwyExwxw⋅−==⋅−=+++−=∑∑∑===δδ12121211)( LLinear Regression: Gradient Descent• We seek a vector of parameters: w = [wo,..,wM] that minimizes the error between the prediction Y and andthe data X:x1yδkis the error between the input x and the prediction y at data point k.Graphically, it the “vertical” distance between data point k and the prediction calculated by using the vector of linear parameters w.6Gradient Descent• The minimum of E is reached when the derivatives with respect to each of the parameters wiis zero:( )( )kiNkkkiNkkkkiNkkMMkkookixxyxxwxwxwywE∑∑∑===−=⋅−−=+++−−=∂∂1111122)(2δxwLGradient Descent• The minimum of E is reached when the derivatives with respect to each of the parameters wiis zero:( )( )kiNkkkiNkkkkiNkkMMkkookixxyxxwxwxwywE∑∑∑===−=⋅−−=+++−−=∂∂1111122)(2δxwLNote that the contribution of training data element number kto the overall gradient is -δkxik7Gradient Descent Update Rule• Update rule: Move in the direction opposite to the gradient directionwiEiiiwEww∂∂−←αHere we need to increase wi. Note that is negativeiwE∂∂Here we need to decrease wi. Note that is positiveiwE∂∂Perceptron Training• Given input training data xkwith corresponding value yk1. Compute error:2. Update NN weights:kikiixwwαδ+←kkky xw ⋅−←δ8Linear Perceptron Training• Given input training data xkwith corresponding value yk1. Compute error:2. Update NN weights:kikiixwwαδ+←kkky xw ⋅−←δα is the learning rate.α too small: May converge slowly and may need a lot of training examplesα too large: May change w too quickly and spend along time oscillating around the minimum.w1iterationsiterationswo“True” function: y = 0.3 + 0.7 x1w = [0.3 0.7]“True” value w1= 0.7“True” value wo= 0.39After 2 iterations (2 training points)After 6 iterations (6 training points)10After 20 iterations (20 training points)w1iterationsiterationswo“True” value w1= 0.7“True” value wo= 0.3Perceptrons: Remarks• Update has many names: delta rule, gradient rule, LMS rule…..• Update is guaranteed to converge to the best linear fit (global minimum of E)• Of course, there are more direct ways of solving the linear regression problem by using linear algebra techniques. It boils down to a simple matrix inversion (not shown here).• In fact, the perceptron training algorithm can be much, much slower than the direct solution• So why do we bother with this? The answer in the next few of slides…be patient11A Simple Classification Problem• Suppose that we have one attribute x1• Suppose that the data is in two classes (red dots and green dots)• Given an input value x1, we wish to predict the most likely class (note: Same problem as the one we solved with decision trees and nearest-neighbors).x1Training data:A Simple Classification Problem• We could convert it to a problem similar to the previous one by defining an output value y• The problem now is to learn a mapping between the attribute x1of the training examples and their corresponding class output yx1y = 1y = 0=class green in if1class red in if0y12A Simple Classification Problemy = 1y = 0x1y = 1y = 0θWhat we would like: a piece-wise constant prediction function:This is not continuous  Does not have derivativesWhat we get from the current linear perceptron model: continuous linear predictionx1≥<=θθyyyif1if0y = w.xw = [wow1]x = [1 x1]x1y = 1y = 0x1y = 1y = 0y = w.xw = [wow1]x = [1 x1]y = σ(w.x)w = [wow1]x = [1 x1]Possible solution: Transform the linear predictions by some function σ which would transform to a continuous approximation of a thresholdThis curve is a continuous approximation (“soft” threshold) of the hard threshold θNote that we can take derivatives of that prediction function13The Sigmoid Function• Note: It is not important to remember the exact expression of σ (in fact, alternate definitions are used for σ). What is important to remember is that:– It is smooth and has a derivative σ’ (exact expression is unimportant)– It approximates a hard threshold function at x = 0( )tet−+=11σGeneralization to M Attributes• Two classes are linearly separable if they can be


View Full Document

CMU CS 15381 - nn

Documents in this Course
Planning

Planning

19 pages

Planning

Planning

19 pages

Lecture

Lecture

42 pages

Lecture

Lecture

27 pages

Lecture

Lecture

19 pages

FOL

FOL

41 pages

lecture

lecture

34 pages

Exam

Exam

7 pages

Lecture

Lecture

22 pages

Handout

Handout

11 pages

Midterm

Midterm

14 pages

lecture

lecture

83 pages

Handouts

Handouts

38 pages

mdp

mdp

37 pages

HW2

HW2

7 pages

lecture

lecture

13 pages

Handout

Handout

5 pages

Lecture

Lecture

27 pages

Lecture

Lecture

62 pages

Lecture

Lecture

5 pages

Load more
Download nn
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view nn and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view nn 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?