PowerPoint PresentationSlide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Kansas State UniversityDepartment of Computing and Information SciencesCIS 830: Advanced Topics in Artificial IntelligenceWednesday, February 9, 2000William H. HsuDepartment of Computing and Information Sciences, KSUhttp://www.cis.ksu.edu/~bhsuReadings:Chapter 19, Russell and NorvigArtificial Neural Networks in Data Engineering:OverviewLecture 10Lecture 10Kansas State UniversityDepartment of Computing and Information SciencesCIS 830: Advanced Topics in Artificial IntelligenceLecture OutlineLecture Outline•Read Sections 4.5-4.9, Mitchell; Chapter 4, Bishop; Rumelhart et al•Multi-Layer Networks–Nonlinear transfer functions–Multi-layer networks of nonlinear units (sigmoid, hyperbolic tangent)•Backpropagation of Error–The backpropagation algorithm•Relation to error gradient function for nonlinear units•Derivation of training rule for feedfoward multi-layer networks–Training issues•Local optima•Overfitting in ANNs•Hidden-Layer Representations•Examples: Face Recognition and Text-to-Speech•Advanced Topics (Brief Survey)•Next Week: Chapter 5 and Sections 6.1-6.5, Mitchell; Quinlan paperKansas State UniversityDepartment of Computing and Information SciencesCIS 830: Advanced Topics in Artificial Intelligence•Nonlinear Units–Recall: activation function sgn (w - x)–Nonlinear activation function: generalization of sgn•Multi-Layer Networks–A specific type: Multi-Layer Perceptrons (MLPs) –Definition: a multi-layer feedforward network is composed of an input layer, one or more hidden layers, and an output layer–“Layers”: counted in weight layers (e.g., 1 hidden layer 2-layer network)–Only hidden and output layers contain perceptrons (threshold or nonlinear units)•MLPs in Theory–Network (of 2 or more layers) can represent any function (arbitrarily small error)–Training even 3-unit multi-layer ANNs is NP-hard (Blum and Rivest, 1992)•MLPs in Practice–Finding or designing effective networks for arbitrary functions is difficult –Training is very computation-intensive even when structure is “known” Multi-Layer NetworksMulti-Layer Networksof Nonlinear Unitsof Nonlinear Unitsx1x2x3Input Layeru 11h1h2 h3h4Hidden Layero1 o2v42Output LayerKansas State UniversityDepartment of Computing and Information SciencesCIS 830: Advanced Topics in Artificial Intelligence•Sigmoid Activation Function–Linear threshold gate activation function: sgn (w - x)–Nonlinear activation (aka transfer, squashing) function: generalization of sgn is the sigmoid function –Can derive gradient rules to train•One sigmoid unit•Multi-layer, feedforward networks of sigmoid units (using backpropagation)•Hyperbolic Tangent Activation FunctionNonlinear Activation FunctionsNonlinear Activation Functionsx1x2xnw1w2wnx0 = 1w0xwxwnetn0iii- netσwxσxo - netenetσ11 netnetnetneteeeenetnetnetσcoshsinhKansas State UniversityDepartment of Computing and Information SciencesCIS 830: Advanced Topics in Artificial IntelligenceBackpropagation AlgorithmBackpropagation Algorithm•Intuitive Idea: Distribute Blame for Error to Previous Layers•Algorithm Train-by-Backprop (D, r)–Each training example is a pair of the form <x, t(x)>, where x is the vector of input values and t(x) is the output value. r is the learning rate (e.g., 0.05)–Initialize all weights wi to (small) random values–UNTIL the termination condition is met, DOFOR each <x, t(x)> in D, DOInput the instance x to the unit and compute the output o(x) = (net(x))FOR each output unit k, DOFOR each hidden unit j, DOUpdate each w = ui,j (a = hj) or w = vj,k (a = ok) wstart-layer, end-layer wstart-layer, end-layer + wstart-layer, end-layer wstart-layer, end-layer r end-layer aend-layer –RETURN final u, v xoxtxoxokkkk 1δk outputskjjvxhxhjkj,jδ1δx1x2x3Input Layeru 11h1h2 h3h4Hidden Layero1 o2v42Output LayerKansas State UniversityDepartment of Computing and Information SciencesCIS 830: Advanced Topics in Artificial IntelligenceBackpropagation and Local OptimaBackpropagation and Local Optima•Gradient Descent in Backprop–Performed over entire network weight vector–Easily generalized to arbitrary directed graphs–Property: Backprop on feedforward ANNs will find a local (not necessarily global) error minimum•Backprop in Practice–Local optimization often works well (can run multiple times)–Often include weight momentum –Minimizes error over training examples - generalization to subsequent instances? –Training often very slow: thousands of iterations over D (epochs)–Inference (applying network after training) typically very fast•Classification•Control 1αΔδΔ nwa rnwlayer-end layer,-startlayer-endlayer-endlayer-end layer,-startKansas State UniversityDepartment of Computing and Information SciencesCIS 830: Advanced Topics in Artificial IntelligenceFeedforward ANNs:Feedforward ANNs:Representational Power and BiasRepresentational Power and Bias•Representational (i.e., Expressive) Power–Backprop presented for feedforward ANNs with single hidden layer (2-layer)–2-layer feedforward ANN•Any Boolean function (simulate a 2-layer AND-OR network)•Any bounded continuous function (approximate with arbitrarily small error) [Cybenko, 1989; Hornik et al, 1989]–Sigmoid functions: set of basis functions; used to compose arbitrary functions–3-layer feedforward ANN: any function (approximate with arbitrarily small error) [Cybenko, 1988]–Functions that ANNs are good at acquiring: Network Efficiently Representable Functions (NERFs) - how to characterize? [Russell and Norvig, 1995]•Inductive Bias of ANNs–n-dimensional Euclidean space (weight space)–Continuous (error function smooth with respect to weight parameters)–Preference bias: “smooth interpolation” among positive examples–Not well understood yet (known to be computationally hard)Kansas State UniversityDepartment of Computing and Information SciencesCIS 830: Advanced Topics in Artificial Intelligence•Hidden Units and Feature Extraction–Training procedure: hidden unit representations that minimize error E–Sometimes backprop will
View Full Document