DOC PREVIEW
MIT 6 867 - Machine Learning

This preview shows page 1-2-19-20 out of 20 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 20 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Machine learning: lecture 9Tommi S. JaakkolaMIT [email protected]• Feature selection– motivation, examples– information value, greedy selection, regularization• Combination methods– forward/backward fitting– boostingTommi Jaakkola, MIT CSAIL 2Feature selection• Suppose we consider only a finite collection of possible basisfunctions, φi(x), i = 1, . . . , m, suc h as the input componentsφi(x) = xi.• We try to find a small subset S of basis functions that aresufficient to solve the (regression or) classification problem:P (y = 1|x, w) = gw0+kXi=1wiφsi(x)where the indexes S = {s1, . . . , sk} identify the selectedbasis functions.Tommi Jaakkola, MIT CSAIL 3Feature selection• Suppose we consider only a finite collection of possible basisfunctions, φi(x), i = 1, . . . , m, suc h as the input componentsφi(x) = xi.• We try to find a small subset S of basis functions that aresufficient to solve the (regression or) classification problem:P (y = 1|x, w) = gw0+kXi=1wiφsi(x)where the indexes S = {s1, . . . , sk} identify the selectedbasis functions.• There are many ways to find appropriate basis functions:– information value– greedy selection– regularizationTommi Jaakkola, MIT CSAIL 4Information value• Let’s first try to select the basis functions independentlyof the classifier, i.e., gauge how “informative” they are ingeneral about the class label.• Text classification example: x is a document and the basisfunctions φ1(x), . . . , φm(x) are “word indicators”φi(x) =1, if doc ument x contains word i0, otherwise– each document is represe nted by a binary vectorφ(x) = [0 1 0 . . . 0 1]T| {z }m bits– we will derive a score for each feature (bit) based on howmuch information it contains about the class labelTommi Jaakkola, MIT CSAIL 5Information value cont’d• Let’s focus on a single feature, e.g., the first oneφ1: 0 1 0 . . . 1y : −1 −1 1 . . . 1To assess how the feature values relate to the labelswe can calculate the frequency of occurence of differentcombinations of values:ˆP (y),ˆP (φ1),ˆP (φ1, y).For exampleˆP (φ1= 0, y = 1) =# of docs suc h that φ1(x) = 0 and y = 1nTommi Jaakkola, MIT CSAIL 6Information value cont’d• Let’s focus on a single feature, e.g., the first oneφ1: 0 1 0 . . . 1y : −1 −1 1 . . . 1To assess how the feature values relate to the labelswe can calculate the frequency of occurence of differentcombinations of values:ˆP (y),ˆP (φ1),ˆP (φ1, y).• The mutual information score for each feature is given by:I(φ1; y) =Xφ1∈{0,1}Xy∈{−1,1}ˆP (φ1, y) log2ˆP (φ1, y)ˆP (y)ˆP (φ1)This score is zero if the label is independent of the featurevalue; large (but ≤ 1) if they are deterministically related.Tommi Jaakkola, MIT CSAIL 7Selection by information value• We rank the features according to their mutual informationscores (in the descending order of the score):I(φ1; y) =Xφ1∈{0,1}Xy∈{−1,1}ˆP (φ1, y) log2ˆP (φ1, y)ˆP (y)ˆP (φ1)– how many features to include?– redundant features?– coordination among the features?– which c lassifier can make use of these features?Tommi Jaakkola, MIT CSAIL 8Greedy selection1. Find s1and w = [w0, w1]Tsuch t hatP (y = 1|x, w) = gw0+ w1φs1(x)leads to the best classifie r.Tommi Jaakkola, MIT CSAIL 9Greedy selection1. Find s1and w = [w0, w1]Tsuch t hatP (y = 1|x, w) = gw0+ w1φs1(x)leads to the best classifie r.2. Find s2and w = [w0, w1, w2]Tsuch t hatP (y = 1|x, w) = gw0+ w1φs1(x) + w2φs2(x)gives the best performing classifie r.Tommi Jaakkola, MIT CSAIL 10Greedy selection1. Find s1and w = [w0, w1]Tsuch t hatP (y = 1|x, w) = gw0+ w1φs1(x)leads to the best classifie r.2. Find s2and w = [w0, w1, w2]Tsuch t hatP (y = 1|x, w) = gw0+ w1φs1(x) + w2φs2(x)gives the best performing classifie r.3. Etc.- stopping criterion?- over-fitting?Tommi Jaakkola, MIT CSAIL 11Regularization• We c an also consider all of the basis functions at onceP (y = 1|x, w) = gw0+ w1φ1(x) + . . . + wmφm(x)and introduce a regularization penalty that tries to set theweights to zero unless the corresponding basis functions areuseful.Tommi Jaakkola, MIT CSAIL 12Regularization• We c an also consider all of the basis functions at onceP (y = 1|x, w) = gw0+ w1φ1(x) + . . . + wmφm(x)and introduce a regularization penalty that tries to set theweights to zero unless the corresponding basis functions areuseful.J(w; λ) =nXi=1− log P (yi|x, w) + λmXi=1|wi|In other words, we regularize the 1-norm (not Euclideannorm) of the weights; w0is not pe nalizedTommi Jaakkola, MIT CSAIL 13Regularization• The effect of the regularization penalty depe nds on itsderivative at w ≈ 0−2 −1 0 1 200.511.52w2/2 versus |w|J(w; λ) =nXi=1− log P (yi|x, w) + λmXi=1|wi|Tommi Jaakkola, MIT CSAIL 14Combination of methods• Similarly to feature selection we can select simple “weak”classification or regression methods and combine them intoa single “strong” method• Example techniques– forward fitting (regression)– boosting (classification)Tommi Jaakkola, MIT CSAIL 15Combination of regression methods• We want to combine multiple “weak” regression methodsinto a single “strong” methodf(x) = f(x; θ1) + . . . + f(x; θm)• Suppose we are given a family simple regression methodsf(x; θ) = w φk(x)where θ = {k, w} specifies the identity of the basis functionas well as the associated weight.• Forward-fitting: sequentially introduce new simple regressionmethods to reduce the re maining prediction errorTommi Jaakkola, MIT CSAIL 16Forward fitting cont’dSimple family: f(x; θ) = wφk(x), θ = {k, w}• We can fit each new component to reduce the predictionerror; in each iteration we solve the same type of estimationproblemStep 1:ˆθ1← argminθnXi=1(yi− f(xi; θ))2Tommi Jaakkola, MIT CSAIL 17Forward fitting cont’dSimple family: f(x; θ) = wφk(x), θ = {k, w}• We can fit each new component to reduce the predictionerror; in each iteration we solve the same type of estimationproblemStep 1:ˆθ1← argminθnXi=1(yi− f(xi; θ))2Step 2:ˆθ2← argminθnXi=1(yi− f(xi;ˆθ1)| {z }error−f(xi; θ))2Tommi Jaakkola, MIT CSAIL 18Forward fitting cont’dSimple family: f(x; θ) = wφk(x), θ = {k, w}• We can fit each new component to reduce the predictionerror; in each iteration we solve the same type of estimationproblemStep 1:ˆθ1← argminθnXi=1(yi− f(xi; θ))2Step 2:ˆθ2←


View Full Document

MIT 6 867 - Machine Learning

Download Machine Learning
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Machine Learning and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Machine Learning 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?