1. Stat 231. A.L. Yuille. Fall 2004.2. Linear Separation3. Linear Separation4. Perceptron Rule5. Perceptron Convergence6. Perceptron Convergence7. Perceptron Capacity8. Generalization and Capacity.9. Perceptron Capacity10. Capacity and Generalization11. Multi-Layer Perceptrons12. Multilayer Perceptrons13. Multilayer PerceptronsSummaryLecture notes for Stat 231: Pattern Recognition and Machine Learning1. Stat 231. A.L. Yuille. Fall 2004.Perceptron Rule and Convergence ProofCapacity of Perceptrons.Multi-layer Perceptrons.Read 5.4,5.5 9.6.8 Duda, Hart, Stork.Lecture notes for Stat 231: Pattern Recognition and Machine Learning2. Linear SeparationN samples where the Can we find a hyperplane in feature space through the origin, that separates the two types of samplesLecture notes for Stat 231: Pattern Recognition and Machine Learning3. Linear SeparationFor the two-class case, simplify by replacing all samples with Then find a plane such that The weight vector is almost never unique. Determine the weight vector that has the biggest margin m(>0), where (Next lecture). Discriminative: no attempt to model probability distributions. Recall that the decision boundary is a hyperplane if the distributions are Gaussian with identical covariance.Lecture notes for Stat 231: Pattern Recognition and Machine Learning4. Perceptron RuleAssume there is a hyperplane separating the two classes. How can we find it?Single Sample Perceptron Rule.Order samplesSet loop over j, if is misclassified, set repeat until all samples are classified correctly.Lecture notes for Stat 231: Pattern Recognition and Machine Learning5. Perceptron ConvergenceNovikov’s Theorem: the single sample Perceptron rule will converge to a solution weight, if one exists.Proof. Suppose is a separating weight. Then decreases by at least for each misclassified sample. Initialize weight at 0. Then number of weight changes is less thanLecture notes for Stat 231: Pattern Recognition and Machine Learning6. Perceptron ConvergenceProof of claim. IfUsingLecture notes for Stat 231: Pattern Recognition and Machine Learning7. Perceptron CapacityThe Perceptron was very influencial and unrealistic claims were made about its abilities (1950’s, early 1960’s).The model is an idealized model of neurons.An entire book was published in the mid 1960’s describing the limited capacity of Perceptrons (Minsky and Papert). Some classifications, exclusive or, can’t be performed by linear separation.But, from Learning Theory, limited capacity is good.Lecture notes for Stat 231: Pattern Recognition and Machine Learning8. Generalization and Capacity.The Perceptron is useful precisely because it has finite capacity and so cannot represent all classifications. The amount of training data required to ensure Generalization will need to be larger than the capacity. Infinite capacity requires infinite data. Full definition of Perceptron capacity must wait till we introduce Vapnik Chevonenkis (VC) dimension.But the following result (Cover) gives the basic idea..Lecture notes for Stat 231: Pattern Recognition and Machine Learning9. Perceptron CapacitySuppose we have n sample points in a d dimensional feature space. Assume that these points are in general position – no subset of (d+1) points lies in a (d-1) dimensional subspaceLet f(n,d) be the fraction of the 2^n dichotomies of the n points which can be expressed by linear separation.It can be shown (D.H.S) that f(n,d) =1, for otherwise There is a critical value 2(d+1). f(n,d)=1 for n << 2(d+1), f(n,d) =0 for n >> 2(d+1), transition rapid for large d.Lecture notes for Stat 231: Pattern Recognition and Machine Learning10. Capacity and GeneralizationPerceptron capacity is d+1. The probability of finding a separating hyperplane by chance alignment of the samples decreases rapidly for n > 2(d+1).Lecture notes for Stat 231: Pattern Recognition and Machine Learning11. Multi-Layer PerceptronsMultilayer Perceptrons were introduced in the 1980’s to increase capacity. Motivated by biological arguments (dubious).Key Idea: replace the binary decision rule by a Sigmoid function: (Step function as T tends to 0).Input units activityHidden unitsOutput units Weights connecting the Input units to the hidden units, and the hidden units to the output units.Lecture notes for Stat 231: Pattern Recognition and Machine Learning12. Multilayer PerceptronsMultilayer perceptrons can represent any function provided there are a sufficient number of hidden units. But the number of hidden units may be enormous.Also the ability to represent any function may be bad, because of generalization/memorization. Difficult to analyze multilayer perceptrons. They are like “black boxes”. When they are successful, there is often a simpler, more transparent alternativeThe Neuronal plausibility for multilayer perceptrons is unclear.Lecture notes for Stat 231: Pattern Recognition and Machine Learning13. Multilayer PerceptronsTrain the multilayer perceptron using training dataDefine error function for each sample Minimize the error function for each sample by steepest descent:Backpropagation algorithm (propagation of errors).Lecture notes for Stat 231: Pattern Recognition and Machine LearningSummaryPerceptron and Linear Separability.Perceptron rule and convergence proof.Capacity of Perceptrons.Multi-layer Perceptrons.Next Lecture – Support Vector Machines for Linear
View Full Document