DOC PREVIEW
Berkeley COMPSCI 188 - Lecture 24 -- perceptron II

This preview shows page 1-2-3 out of 8 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 8 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1CS 188: Artificial IntelligenceFall 2008Lecture 24: Perceptrons II11/24/2008Dan Klein – UC Berkeley1Feature Extractors A feature extractor maps inputs to feature vectors Many classifiers take feature vectors as inputs Feature vectors usually very sparse, use sparse encodings (i.e. only represent non-zero keys)Dear Sir.First, I must solicit your confidence in this transaction, this is by virture of its nature as being utterly confidencial and top secret. …W=dear : 1W=sir : 1W=this : 2...W=wish : 0...MISSPELLED : 2NAMELESS : 1ALL_CAPS : 0NUM_URLS : 0...2Some (Vague) Biology Very loose inspiration: human neurons3The Binary Perceptron Inputs are feature values Each feature has a weight Sum is the activation If the activation is: Positive, output 1 Negative, output 0Σf1f2f3w1w2w3>0?4Example: Spam Imagine 4 features: Free (number of occurrences of “free”) Money (occurrences of “money”) BIAS (always has value 1)BIAS : -3free : 4money : 2the : 0 ...BIAS : 1 free : 1money : 1the : 0...“free money”5Binary Decision Rule In the space of feature vectors Any weight vector is a hyperplane One side will be class 1 Other will be class -1BIAS : -3free : 4money : 2the : 0 ...0 1012freemoney1 = SPAM-1 = HAM62Multiclass Decision Rule If we have more than two classes: Have a weight vector for each class Calculate an activation for each class Highest activation wins7ExampleBIAS : -2win : 4game : 4vote : 0the : 0 ...BIAS : 1win : 2game : 0vote : 4the : 0 ...BIAS : 2win : 0game : 2vote : 0the : 0 ...“win the vote”BIAS : 1win : 1game : 0vote : 1the : 1...8The Perceptron Update Rule Start with zero weights Pick up training instances one by one Try to classify If correct, no change! If wrong: lower score of wrong answer, raise score of right answer9ExampleBIAS :win : game : vote : the : ...BIAS : win : game : vote : the : ...BIAS : win : game : vote : the : ...“win the vote”“win the election”“win the game”10Examples: Perceptron Separable Case11Examples: Perceptron Separable Case123Mistake-Driven Classification In naïve Bayes, parameters: From data statistics Have a causal interpretation One pass through the data For the perceptron parameters: From reactions to mistakes Have a discriminative interpretation Go through the data until held-out accuracy maxes outTrainingDataHeld-OutDataTestData13Properties of Perceptrons Separability: some parameters get the training set perfectly correct Convergence: if the training is separable, perceptron will eventually converge (binary case) Mistake Bound: the maximum number of mistakes (binary case) related to the margin or degree of separabilitySeparableNon-Separable14Examples: Perceptron Non-Separable Case15Examples: Perceptron Non-Separable Case16Issues with Perceptrons Overtraining: test / held-out accuracy usually rises, then falls Overtraining isn’t quite as bad as overfitting, but is similar Regularization: if the data isn’t separable, weights might thrash around Averaging weight vectors over time can help (averaged perceptron) Mediocre generalization: finds a “barely” separating solution17Fixing the Perceptron Main problem with perceptron: Update size τ is uncontrolled Sometimes update way too much Sometimes update way too little Solution: choose an update size which fixes the current mistake (by 1)… … but, choose the minimum change184Minimum Correcting Updatemin not τ=0, or would not have made an error, so min will be where equality holds19MIRA In practice, it’s bad to make updates that are too large Example may be labeled incorrectly Solution: cap the maximum possible value of τ This gives an algorithm called MIRA Usually converges faster than perceptron Usually performs better, especially on noisy data20Linear Separators Which of these linear separators is optimal? 21Support Vector Machines Maximizing the margin: good according to intuition and theory. Only support vectors matter; other training examples are ignorable.  Support vector machines (SVMs) find the separator with max margin Basically, SVMs are MIRA where you optimize over all examples at onceMIRASVM22Summary Naïve Bayes Build classifiers using model of training data Smoothing estimates is important in real systems Classifier confidences are useful, when you can get them Perceptrons / MIRA: Make less assumptions about data Mistake-driven learning Multiple passes through data23Similarity Functions Similarity functions are very important in machine learning Topic for next class: kernels Similarity functions with special properties The basis for a lot of advance machine learning (e.g. SVMs)245Case-Based Reasoning Similarity for classification Case-based reasoning Predict an instance’s label using similar instances Nearest-neighbor classification 1-NN: copy the label of the most similar data point K-NN: let the k nearest neighbors vote (have to devise a weighting scheme) Key issue: how to define similarity Trade-off: Small k gives relevant neighbors Large k gives smoother functions Sound familiar? [DEMO]http://www.cs.cmu.edu/~zhuxj/courseproject/knndemo/KNN.html25Parametric / Non-parametric Parametric models: Fixed set of parameters More data means better settings Non-parametric models: Complexity of the classifier increases with data Better in the limit, often worse in the non-limit (K)NN is non-parametricTruth2 Examples10 Examples 100 Examples 10000 Examples26Collaborative Filtering Ever wonder how online merchants decide what products to recommend to you? Simplest idea: recommend the most popular items to everyone Not entirely crazy! (Why) Can do better if you know something about the customer (e.g. what they’ve bought) Better idea: recommend items that similar customers bought A popular technique: collaborative filtering Define a similarity function over customers (how?) Look at purchases made by people with high similarity Trade-off: relevance of comparison set vs confidence in predictions How can this go wrong?You are here27Nearest-Neighbor Classification Nearest neighbor for digits: Take


View Full Document

Berkeley COMPSCI 188 - Lecture 24 -- perceptron II

Documents in this Course
CSP

CSP

42 pages

Metrics

Metrics

4 pages

HMMs II

HMMs II

19 pages

NLP

NLP

23 pages

Midterm

Midterm

9 pages

Agents

Agents

8 pages

Lecture 4

Lecture 4

53 pages

CSPs

CSPs

16 pages

Midterm

Midterm

6 pages

MDPs

MDPs

20 pages

mdps

mdps

2 pages

Games II

Games II

18 pages

Load more
Download Lecture 24 -- perceptron II
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture 24 -- perceptron II and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture 24 -- perceptron II 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?