CMU CS 15381 - nn - D602803

Home> Schools> Carnegie Mellon University> Computer Science (CS) > CS 15381> nn

DOC PREVIEW

CMU CS 15381 - nn

School name Carnegie Mellon University

Course Cs 15381- Artificial Intelligence: Representation and Problem Solving

Pages 25

This preview shows page 1-2-24-25 out of 25 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 25 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1Neural NetworksA Simple Problem (Linear Regression)• We have training data X = {x1k}, i=1,..,N with corresponding output Y = {yk}, i=1,..,N• We want to find the parameters that predict the output Y from the data X in a linear fashion:Y ≈ wo+ w1x1x1y2A Simple Problem (Linear Regression)• We have training data X = {x1k}, k=1,..,N with corresponding output Y = {yk}, k=1,..,N• We want to find the parameters that predict the output Y from the data X in a linear fashion:yk≈ wo+ w1x1kx1yNotations: Superscript: Index of the data point in the training data set; k = kthtraining data pointSubscript: Coordinate of the data point; x1k = coordinate 1 of data point k.A Simple Problem (Linear Regression)• It is convenient to define an additional “fake”attribute for the input data: xo= 1• We want to find the parameters that predict the output Y from the data X in a linear fashion:yk≈ woxok+ w1x1kx1y3More convenient notations• Vector of attributes for each training data point:xk= [xok,..,xMk]• We seek a vector of parameters: w = [wo,..,wM]• Such that we have a linear relation between prediction Y and attributes X:x1ykMikiikMMkkookxwxwxwxwy xw ⋅==+++≈∑=011LkMikiikMMkkookxwxwxwxwy xw ⋅==+++≈∑=011LMore convenient notations• Vector of attributes for each training data point:xi= [xoi,..,xMi]• We seek a vector of parameters: w = [wo,..,wM]• Such that we have a linear relation between prediction Y and attributes X:x1yBy definition: The dot product between vectors w and xkis:∑==⋅Mikiikxw0xw4Neural Network: Linear Perceptronxoxw ⋅=∑=iMiixw0xixMwowiwMInput attribute valuesOutput predictionNeural Network: Linear Perceptronxoxw ⋅=∑=iMiixw0xixMwowiwMInput UnitsOutput UnitConnection with weightNote: This input unit corresponds to the “fake” attribute xo= 1. Called the biasNeural Network Learningproblem: Adjust the connection weights so that the network generates the correct prediction on the training data.5Linear Regression: Gradient Descent• We seek a vector of parameters: w = [wo,..,wM] that minimizes the error between the prediction Y and andthe data X:()( )kkkNkkNkkkNkkMMkkookyyxwxwxwyExwxw⋅−==⋅−=+++−=∑∑∑===δδ12121211)( L()( )kkkNkkNkkkNkkMMkkookyyxwxwxwyExwxw⋅−==⋅−=+++−=∑∑∑===δδ12121211)( LLinear Regression: Gradient Descent• We seek a vector of parameters: w = [wo,..,wM] that minimizes the error between the prediction Y and andthe data X:x1yδkis the error between the input x and the prediction y at data point k.Graphically, it the “vertical” distance between data point k and the prediction calculated by using the vector of linear parameters w.6Gradient Descent• The minimum of E is reached when the derivatives with respect to each of the parameters wiis zero:( )( )kiNkkkiNkkkkiNkkMMkkookixxyxxwxwxwywE∑∑∑===−=⋅−−=+++−−=∂∂1111122)(2δxwLGradient Descent• The minimum of E is reached when the derivatives with respect to each of the parameters wiis zero:( )( )kiNkkkiNkkkkiNkkMMkkookixxyxxwxwxwywE∑∑∑===−=⋅−−=+++−−=∂∂1111122)(2δxwLNote that the contribution of training data element number kto the overall gradient is -δkxik7Gradient Descent Update Rule• Update rule: Move in the direction opposite to the gradient directionwiEiiiwEww∂∂−←αHere we need to increase wi. Note that is negativeiwE∂∂Here we need to decrease wi. Note that is positiveiwE∂∂Perceptron Training• Given input training data xkwith corresponding value yk1. Compute error:2. Update NN weights:kikiixwwαδ+←kkky xw ⋅−←δ8Linear Perceptron Training• Given input training data xkwith corresponding value yk1. Compute error:2. Update NN weights:kikiixwwαδ+←kkky xw ⋅−←δα is the learning rate.α too small: May converge slowly and may need a lot of training examplesα too large: May change w too quickly and spend along time oscillating around the minimum.w1iterationsiterationswo“True” function: y = 0.3 + 0.7 x1w = [0.3 0.7]“True” value w1= 0.7“True” value wo= 0.39After 2 iterations (2 training points)After 6 iterations (6 training points)10After 20 iterations (20 training points)w1iterationsiterationswo“True” value w1= 0.7“True” value wo= 0.3Perceptrons: Remarks• Update has many names: delta rule, gradient rule, LMS rule…..• Update is guaranteed to converge to the best linear fit (global minimum of E)• Of course, there are more direct ways of solving the linear regression problem by using linear algebra techniques. It boils down to a simple matrix inversion (not shown here).• In fact, the perceptron training algorithm can be much, much slower than the direct solution• So why do we bother with this? The answer in the next few of slides…be patient11A Simple Classification Problem• Suppose that we have one attribute x1• Suppose that the data is in two classes (red dots and green dots)• Given an input value x1, we wish to predict the most likely class (note: Same problem as the one we solved with decision trees and nearest-neighbors).x1Training data:A Simple Classification Problem• We could convert it to a problem similar to the previous one by defining an output value y• The problem now is to learn a mapping between the attribute x1of the training examples and their corresponding class output yx1y = 1y = 0=class green in if1class red in if0y12A Simple Classification Problemy = 1y = 0x1y = 1y = 0θWhat we would like: a piece-wise constant prediction function:This is not continuous  Does not have derivativesWhat we get from the current linear perceptron model: continuous linear predictionx1≥<=θθyyyif1if0y = w.xw = [wow1]x = [1 x1]x1y = 1y = 0x1y = 1y = 0y = w.xw = [wow1]x = [1 x1]y = σ(w.x)w = [wow1]x = [1 x1]Possible solution: Transform the linear predictions by some function σ which would transform to a continuous approximation of a thresholdThis curve is a continuous approximation (“soft” threshold) of the hard threshold θNote that we can take derivatives of that prediction function13The Sigmoid Function• Note: It is not important to remember the exact expression of σ (in fact, alternate definitions are used for σ). What is important to remember is that:– It is smooth and has a derivative σ’ (exact expression is unimportant)– It approximates a hard threshold function at x = 0( )tet−+=11σGeneralization to M Attributes• Two classes are linearly separable if they can be

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-24-25 out of 25 pages.

CMU CS 15381 - nn

Sign up for free to view:

Please select your school