Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide 25Slide 26Slide 27Slide 28Slide 29Slide 30Slide 31Slide 32Slide 33Slide 34Slide 35Slide 361PerceptronsLouis [email protected] 540 section 2slides borrowed (with modifications) from Burr Settles2AnnouncementsReview Session tonight at 4:30-5:30 CS 1325 (right here)–Come with questions–No lecture preparedMidterm tomorrow night 7:15-9:15 1240 CSHW 3 solution on line. Grading not done yet.3Neural NetworksNeural networks (NNs) are AI models that try to mimic the brain in the way it stores knowledge and processes informationAlso known as:–Artificial Neural Networks (ANNs)–Connectionist Learning Models•As opposed the symbolic models, like decision trees–Parallel Distributed Processing (PDP) Models4Neuroscience (1861-present)Neuroscience is the study of the nervous system, particularly the functions of the brain–By the 19th century, it had been established that the brain played a central role in specific cognitive functions–Before that, people thought the heart or spleen might be the focus of cognitive activityPaul Broca jump-started the field with his studies of speech disorders: he isolated the speech center in the lower left hemisphere of the brain–Now called “Broca’s Area”5NeuroscienceSpecial nerve cells called neurons had been theorized about by the late 1800s–At the turn of the 20th century, a staining method for actually viewing them was developed by Camillo Golgi –Santiago Ramon y Cajal used the staining technique to propose the structure of the nervous system–Golgi & Cajal shared the Nobel prize in 1906, though they had differing views:•Gogli thought brain’s functions were carried out in the medium•Cajal theorized about a connectionist “neuronal doctrine”6Neuronal Structure7Neuronal CommunicationNeurons propagate information by “firing,” or sending electrochemical signals along the axon–Axons can be 1 to 100 centimeters long!Synapses connect the axon of one neuron to the dendrites of up to 100,000 other neurons–The synapses function as signal amplifiers or repressorsIf enough energy flows into a neuron from all of its synapses/dendrites, then it will fire, too, sending a message along its axon to other neurons8Simulated NeuronsWe can create a mathematical approximation to the nature of neuronal communication:–Represent a “neuron” as a Boolean function–Each neuron can have an output capacity of either +1 (fire) or 0 (don’t fire… sometimes use -1)–Each also has a set of inputs (i.e. other neurons, +1/0), each with an associated weight (i.e. synapse)–The neuron can compute a weighted sum over all the inputs and compare it to some threshold t–If the sum is t, then output +1 (fire), otherwise 09PerceptronsA perceptron is a simulated neuron that takes the agent’s percepts (e.g. feature vector) as inputs and maps them to the appropriate output value:ow1wntx1xn……The output, o is the result of some activation function g(in), where in is the weighted sum of the inputs (x1…xn). Right now, g(in) is a simple threshold or “step” function10Perceptrons – inferenceReally, the threshold, t, is just another weight (called the bias):(w1 x1) + (w2 x2) + … + (wn xn) t= (w1 x1) + (w2 x2) + … + (wn xn) – t 0= (w1 x1) + (w2 x2) + … + (wn xn) + (t -1) 0-1tow1wnx1xn……o x1,... , xn1 i f w1x1w2x2... wnxnt0 otherwise11Methods of LearningPerceptron Training RuleDelta Rule12Perceptron LearningA perceptron learns by adjusting its weights in order to minimize the error on the training setTo start off, consider updating the value for a single weight on a single example x with the perceptron learning rule:–wi wi + wi; wi = (true – o)xi–Where is the learning rate, a value in the range [0,1], true is the target value for the example, and o is the perceptron’s output (so (true – o) is the error)Note: the notation used in the new version of AI: A Modern Approach is really messy, and riddled with typos… so my notation will differ from the textbook13Using Perceptron Training Rulewi wi + wi; wi = (true – o)xiSuppose training example correctly classified–What is the change in weight, wi ?What if training example incorrectly classified?–How will weights change?-1tow1wnx1xn……14Perceptron Training RuleProven to converge in a finite number of steps to weights that will correctly classify all training examples, provided the training examples are linearly seperable.15Gradient Descent and Delta Ruleworks with unthresholded perceptronDelta rule converges toward a best-fit approximation to the target concept even when training examples are not linearly seperableTraining error, for a given data set, is defined as–E[w] ½ d (trued – od)2–Where E[w] is the sum of squared errors for the weight vector w, and d ranges over examples in the training set–This formulation of error makes a parabolic curve, and so has a global minimum.o x1,... , xnw1x1w2x2... wnxno x w x16Gradient Descent and Delta RuleIf we have a perceptron with 2 weights, we want to find the pair of weights (i.e. point in 2D weight space) where E[w] is the lowestBut the weights are continuous values, so how do we know how much to change them?17Gradient Descent and Delta RuleFind the gradient (partial derivatives):–E[w] [E/w0, E/w1, E/wn]Update weights:–wi ←wi+wi and wi = –[E/wi]–Just need to calculate the partial derivative of the Error function•E/wi= /wi(½ d (trued – od)2)•E/wi= d(trued – od)(-xid)–Putting it all together, this is called the Delta rule for training:–wi = d(trued – od)(xid)–Often this is rule is applied for each example instead of on the entire dataset–This makes sense: if (true – o) is positive, the weight should be increased for positive inputs xi, and decreased for negatives18On Activation FunctionsHouston, we have a problem!–We’re using a simple step function as our activation function g(in)–This isn’t differentiable, so we can’t compute g'(in)–Using Delta rule will not work on thresholded perceptron–To remedy this, we can use a sigmoid
View Full Document