DOC PREVIEW
STEVENS CS 559 - CS 559 10th Set of Notes

This preview shows page 1-2-22-23 out of 23 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1CS 559: Machine Learning Fundamentals and Applications10thSet of NotesInstructor: Philippos MordohaiWebpage: www.cs.stevens.edu/~mordohaiE-mail: [email protected]: Lieb 2151Project Status• Next week: be prepared to give brief status report on your project• No slides necessary2Outline • Support Vector Machines (SVM)– Additional material not in 9thset of notes• Boosting– Theory– Applications32A Few More Notes on SVMs• Preprocessing: data scaling• Overfitting• Multi-class SVMs4Data Scaling• Without scaling• Attributes in greater numeric ranges may dominate• Example:5Data Scaling• The separating hyperplane• Decision strongly depends on the first attribute• What if the second is more important?63Data Scaling• Linearly scale the first to [0, 1] by:• New points and separating hyperplane• The second attribute plays a role7Overfitting• In theory: You can easily achieve 100% training accuracy• This is useless if it does not generalize to novel test data894Cross-Validation• In practice• Available data => training and validation• Train on the training data• Test on the validation data• k-fold cross validation:– Data randomly separated into k groups– Each time k−1 groups used for training and one as testing10Cross Validation and Test Accuracy• If we select parameters so that CV is highest:– Does CV represent future test accuracy ?– Slightly different• If we have enough parameters, we can achieve 100% CV as well– e.g. more parameters than # of training data• But test accuracy may be different• So– Available data with class labels– =>training, validation, testing11Cross Validation and Test Accuracy• Using CV on training + validation• Predict testing with the best parameters from CV125Mutli-Class SVMs• SVMs can only handle two-class outputs• What can be done?• Answer: learn N SVM’s– SVM 1 learns “Output==1” vs “Output != 1”– SVM 2 learns “Output==2” vs “Output != 2”– …– SVM N learns “Output==N” vs “Output != N”• Then to predict the output for a new input, just predict with each SVM and find out which one puts the prediction the furthest into the positive region13SVMs for OCR• MNIST hand-writing recognition• 60,000 training examples, 10000 test examples, 28x28 images (784D space)• Linear SVM has around 8.5% test error• Polynomial SVM has around 1% test error14SVMs for OCR• Choosing a good mapping φ() (encoding priorknowledge + getting right complexity of function class) for your problem improves results156Boosting• Slides based on:– Tutorial by Rob Schapire– Tutorial by Yoav Freund– Slides by Carlos Guestrin– Tutorial by Paul Viola – Tutorial by Ron Meir• Not in DHS book16Boosting• Invented independently by Schapire(1989) and Freund (1990)– Later joined forces• Main idea: train a strong classifier by combining weak classifiers– Practically useful– Theoretically interesting17Boosting• Idea: given a set of weak learners, run them multiple times on (reweighted) training data, then let learned classifiers vote• At each iteration t:– Weight each training example by how incorrectly it was classified– Learn a hypothesis –ht– Choose a strength for this hypothesis –αt• Final classifier: weighted combination of weak learners187Learning from Weighted Data• Sometimes not all data points are equal– Some data points are more equal than others• Consider a weighted dataset– D(i) – weight of ithtraining example (xi,yi)– Interpretations:•ithtraining example counts as D(i) examples• If I were to “resample” data, I would get more samples of “heavier” data points• Now, in all calculations the ithtraining example counts as D(i) “examples”19Definition of Boosting• Given training set (x1,y1),…, (xm,ym)• yiϵ{-1,+1} correct label of instance xiϵX• For t=1,…,T– construct distribution Dton {1,…,m}– find weak hypothesis (“rule of thumb”)ht: X {-1,+1}with small error εton Dt• Output final hypothesis Hfinal20AdaBoost• Constructing Dt– D1=1/m– Given Dtand ht:where Ztis a normalization constant • Final hypothesis: 218The AdaBoost Algorithm22The AdaBoost Algorithm23The AdaBoost Algorithm249How to choose WeightsTraining error of final classifier is bounded by:where: 25How to choose WeightsTraining error of final classifier is bounded by:where: 26How to choose Weights• If we minimize ΠtZt, we minimize our training error– We can tighten this bound greedily, by choosing αtandhton each iteration to minimize Zt• For boolean target function, this is accomplished by [Freund & Schapire ’97]:2710Weak and Strong Classifiers• If each classifier is (at least slightly) better than random– εt< 0.5• AdaBoost will achieve zero training error (exponentially fast):• Is it hard to achieve better than random training error?28Toy Example29Toy Example: Round 13011Toy Example: Round 231Toy Example: Round 332Toy Example: Final Hypothesis3312Important Aspects of Boosting• Exponential loss function• Choice of weak learners• Generalization and overfitting• Multi-class boosting34Exponential Loss Function• AdaBoost attempts to minimize:• Really a steepest descent procedure– At each round add athtto sum to minimize (*)• Why this loss function?– upper bound on training (classification) error– easy to work with– connection to logistic regression35Weak Learners• Stumps:– Single-axis parallel partition of space• Decision trees:– Hierarchical partition of space• Multi-layer perceptrons:– General nonlinear function approximators• Radial basis functions: – Non-linear expansion based on kernels3613Decision Trees• Hierarchical and recursive partitioning of the feature space• A simple model (e.g. constant) is fit in each region• Often, splits are parallel to axes37Decision Trees – Nominal Features38Decision Trees - Instability3914Analysis of Training Error• Training error of final classifier is bounded by:• For binary classifiers with choice of αtas before, the error is bounded by 40Analysis of Training Error• If each base classifier is slightly better than random such that there exists γ such that γt>γ for all t• Then the training error drops exponentially fast in T• AdaBoost is indeed a boosting algorithm in the sense that it can efficiently convert a true weak learning algorithm into a strong learning algorithm – Weak learning


View Full Document

STEVENS CS 559 - CS 559 10th Set of Notes

Download CS 559 10th Set of Notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view CS 559 10th Set of Notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view CS 559 10th Set of Notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?