UB CSE 574 - Introduction to Boosting - D2378251

Home> Schools> University at Buffalo, The State University of New York> Computer Science & Engineering (CSE) > CSE 574> Introduction to Boosting

DOC PREVIEW

UB CSE 574 - Introduction to Boosting

School name University at Buffalo, The State University of New York

Course Cse 574- Introduction to Machine Learning

Pages 29

This preview shows page 1-2-3-27-28-29 out of 29 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 29 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Introduction to BoostingCynthia RudinPACM, Princeton UniversityAdvisorsIngrid Daubechies and Robert SchapireSay you have a database of news articles…, +1, +1, +1, +1(((((((()))))))), +1, +1, +1, +1, -1, -1, -1, -1(((((((()))))))), -1, -1, -1, +1Your goal is: Given a new article , find its label.where articles are labeled ‘+1’ if the category is “entertainment”,and ‘-1’ otherwise.This is not easy, there are noisy datasets, high dimensions.Examples of Statistical Learning Tasks:• Optical Character Recognition (OCR) (post office, banks), object recognition in images.• Bioinformatics (analysis of gene array data for tumor detection, protein classification, etc.) • Webpage classification (search engines), email filtering, document retrieval• Semantic classification for speech, automatic .mp3 sorting • Time-series prediction (regression)Huge number of applications, but all have high dimensional dataExamples of classification algorithms:•SVM’s(Support Vector Machines – large margin classifiers)• Neural Networks• Decision Trees / Decision Stumps (CART)• RBF Networks• Nearest Neighbors•BayesNetWhich is the best?Depends on amount and type of data, and application!It’s a tie between SVM’s and BoostedDecision Trees/Stumps for general applications.One can always find a problem where a particular algorithm is the best. Boosted convolutional neural nets are the best for OCR (Yann LeCun et al).Training Data: {(xi,yi)}i=1..mwhere (xi,yi) is chosen iid from anunknown probability distribution on X×{-1,1}.“space of all possible articles” “labels”Huge Question: Given a new random example x, can we predict its correct label with high probability? That is, can we generalizefrom our training data?X+++?___Huge Question: Given a new random example x, can we predict its correct label with high probability? That is, can we generalizefrom our training data?Yes!!! That’s what the field of statistical learning is all about.The goal of statistical learning is to characterize points from an unknown probability distribution when given a representative sample from that distribution.How do we construct a classifier?Classifiers divide the space into two pieces for binary classification. Multiclass classification can always be reduced to binary.X+++?___• Divide the space X into two sections, based on the sign of a function f : X→R. • Decision boundary is the zero-level set of f.f(x)=0+-Y Overview of Talk Z• The Statistical Learning Problem (done)• Introduction to Boosting and AdaBoost• AdaBoost as Coordinate Descent• The Margin Theory and GeneralizationSay we have a “weak” learning algorithm:• A weak learning algorithm produces weak classifiers.• (Think of a weak classifier as a “rule of thumb”)Examples of weak classifiers for “entertainment” application:h1( ) = +1 if contains the term “movie”,-1 otherwise h2( ) = +1 if contains the term “actor”,-1 otherwise h3( ) = +1 if contains the term “drama”,-1 otherwise Wouldn’t it be nice to combine the weak classifiers?Example:Boosting algorithms combine weak classifiers in a meaningful way.f( ) = sign[.4 h1 ( ) + .3 h2 ( ) + .3 h3 ( )]A boosting algorithm takes as input:- the weak learning algorithm which produces the weak classifiers- a large training databaseSo if the article contains the term “movie”, and the word “drama”,but not the word “actor”:The value of f is sign[.4-.3+.3] = 1, so we label it +1. and outputs:- the coefficients of the weak classifiers to make the combined classifierTwo ways to use a Boosting Algorithm:• As a way to increase the performance of already `strong` classifiers.• Ex. neural networks, decision trees• “On their own” with a really basic weak classifier• Ex. decision stumpsAdaBoost(Freund and Schapire ’95)-Start with a uniform distribution (“weights”) over training examples. (The weights tell the weak learning algorithm which examples are important.)-Request a weak classifier from the weak learning algorithm, hjt:X→{-1,1}.-Increase the weights on the training examples that were misclassified.-(Repeat)At the end, make (carefully!) a linear combination of the weak classifiers obtained at all iterations.))()(()(11finalxxxnnhhsignfλλ++=KAdaBoostDefine three important things: := distribution (“weights”) over examples at time tmR∈td= [ .25 .3 .2 .25 ]1234tdAdaBoostDefine three important things:nR∈tλ:= coeffs of weak classifiers for the linear combination ))(...)(sign()(,11,xxxnnttthhfλλ++=AdaBoostDefine::=matrix of hypotheses and datanm×∈RMh1hjhn1imMij# of data points⎩⎨⎧−== otherwise 1correctly pt classifies h classif. weak if 1)(:ijxxiijijyhMEnumerate every possible weak classifier which can be produced by weak learning algorithm“movie” “actor” “drama”The matrix M has too many columns to actually be enumerated. M acts as the only input to AdaBoost.AdaBoostfinalλttλd ,MAdaBoost (Freund and Schapire 95)for end 11ln21 )( )(maxarg allfor 1 ..1for 01t)(1')(,1'ttititjtttttjtjjtmiitfinalrrrjieedTteλλMdMdλTtTtMλMλαα+=⎟⎟⎠⎞⎜⎜⎝⎛−+==∈===+−=−∑Initialize coeffs to 0Calculate (normalized) distributionRequest weak classif. from weak learning algorithm}Update linear combo of weak classifiersAdaBoost (Freund and Schapire 95)“Edge” or “correlation” of weak classifier jt.][)()(1,ttttjimiijiitjTthyhyd∑===dExMdfor end 11ln21 )( )(maxarg allfor 1 ..1for 01t)(1')(,1'ttititjtttttjtjjtmiitfinalrrrjieedTteλλMdMdλTtTtMλMλαα+=⎟⎟⎠⎞⎜⎜⎝⎛−+==∈===+−=−∑Y AdaBoost as Coordinate Descent ZBreiman, Mason et al., Duffy and Helmbold, etc. noticed thatAdaBoost is a coordinate descent algorithm. • Coordinate descent is a minimization algorithm like gradient descent, except that we only move along coordinates.• We cannot calculate the gradient because of the high dimensionality of the space!• “coordinates” = weak classifiers“distance to move in that direction” = the update αtAdaBoost minimizes the following function via coordinate descent:∑=−=miieF1)(:)(MλλChoose a direction: )(maxargjjtj MdTt∈Choose a distance to move in that direction:ttjtttttjtrrreλλMdTtαα+=⎟⎟⎠⎞⎜⎜⎝⎛−+==+1t11ln21)(The function is convex:∑=−=miieF1)(:)(Mλλ1) If the data is non-separable by the weak classifiers, the

View Full Document