Unformatted text preview:

'&$%CHAPTER 6 INDIRECT LEARNING1'&$%What is indirect learning?– It doesn’t estimate f(x) directly, most likely due to in-explicitf(x).– It estimates the decision rule through minimizing empiricalrisks.– It includes SVM and regularized minimization.2'&$%Support vector machine– Assume Y ∈ {−1, 1}.– The goal is to find a hyperplane β0+ XTβ which can separateY ’s maximally.– That is, we wishYi(β0+ XTiβ) > 0for all i = 1, ..., n.3'&$%Perfect separation– Consider an ideal situation where Y ’s can be perfectlyseparated.– A maximal separation can be determined as that we want theminimum distance from each point to the separating plane aslarge as possible.– It is equivalent tomaxkβk=1C, subject to Yi(β0+ XTiβ) ≥ C, i = 1, ..., n.– The dual problem ismaxαnXi=1αi−12nXi,j=1αiαjYiYjXTiXj, αi≥ 0.4'&$%Imperfect separation– In real data, there is usually no hyperplane separating perfectly(if there is, it is by chance).– We should allow some violations by introducing slack variablesξi≥ 0:maxkβk=1C, subject to Yi(β0+ XTiβ) ≥ C(1 − ξi)i = 1, ..., n.–Piξidescribes the total degree of violation should becontrolled (like type I error in hypothesis test):Xiξi≤ a given constant.5'&$%Imperfect separation– The dual problem ismaxαnXi=1αi−12nXi,j=1αiαjYiYjXTiXj,0 ≤ αi≤ γ,nXi=1αiYi= 0.– It is a convex optimization problem.– It turns outˆβ =Pˆαi>0ˆαiYiXisoˆβ is determined by the pointswithin or on the boundary of a band around the hyperplane.– These points are called support vectors.6'&$%SVM allowing nonlinear boundary– Linear boundary may not be practical.– To allow nonlinear boundary, assumef(x) = (h1(x), ..., hm(x))β + β0.– The dual problem becomesmaxαnXi=1αi−12nXi,j=1αiαjYiYjK(Xi, Xj),0 ≤ αi≤ γ,nXi=1αiYi= 0.– Here, K(x, x0) = (h1(x), ..., hm(x)(h1(x0), ..., hm(x0))T.7'&$%– Moreover,ˆf(x) =nXi=1ˆαiYiK(x, Xi) +ˆβ0.– Thus, we only need to specific the kernel function K(x, y)instead of (h1, ..., hm).8'&$%Equivalent form of SVM– SVM learning is equivalent to minimizingnXi=1{1 − Yif(Xi)}++ λkβk2/2.– Thus, it is a regularized empirical risk minimization.– This formation is useful for justifying SVM’s theoreticalproperty.– It gives the Bayes rule.– Other loss functions are possible.9'&$%Example– Include Example6.2-ex.pdf10'&$%Regularized estimation– It is typically formed asminf∈H"nXi=1L(Yi, f(Xi)) + λJ(f)#.– J(f) penalizes those band f in H.– For example, J(f) =R(f00(x))2dx gives cubic splineapproximation.11'&$%General operators– A general penalty given by Girosi et al. (1995):J(f) =Z|˜f(s)|2˜G(s)ds,where˜f( s) is the Fourier transform of f and˜G(s) is somepositive function that falls off to zero as ksk → ∞.– In other words, we penalty high-frequency component of f.– The solutions have formKXk=1αkφk(x) +nXi=1θkG(x − Xi),where {φk} spans the null space of J-operator and G is theinverse Fourier transformation of˜G.12'&$%Reproduce kernel Hilbert space (RKHS)– HKis a Hilbert space in which all point evaluations arebounded functionals.– Define < ηt, f >= f(t) for any f by the Riesz representationtheorem.– Let K(t, x) = ηt(x) then< K(t, ·), K(s, ·) >= K(s, t),K(x, y) satisfies positive definiteness conditionK(x, y) =∞Xi=1γiφi(x)φi(y),φ1, φ2, ...are the orthogonal eigen-functions in HK, < φj, φj>= 1/γj.– For any f(x) ∈ HK, f(x) =Pjcjφj(x).13'&$%– Regularization with J(f) = kfkHKbecomesnXi=1L(Yi,Xjcjφj(Xi)) + λXjc2j/γj– The solutions isˆf(x) =nXi=1ˆαkK(x, Xi)where ˆα minimizesnXi=1L(Yi,nXj=1αjK(Xj, Xi)) + λnXi,j=1αiαjK(Xi, Xj).14'&$%CHAPTER 7 AGGREGATEDSUPERVISED LEARNING15'&$%The goal of aggregation– Try to take advantages of different classifiers.– Boosting weak learning methods.– The methods include model average, stacking, and boosting.16'&$%Stacking/Model average– Simply aggregate (average) multiple classifiers like randomforest– Alternative, choose prediction using the optimal linearcombinations to minimizenXi=1(Yi−mXk=1ωkbf(−i)k(Xi))2.17'&$%Boosting– It is an iterative procedure to combine the outputs of weaklearning methods to produce a powerful committee.– The final output from the boosting method issignÃmXk=1αkbfk(x)!,wherebf1, ...,bfmare the estimators from m learning methodsand α1, ..., αmare their corresponding weights.– AdaBoost algorithm:1. We assign each subject i equal weight wi= 1/n.2. From learning method k = 1 to m,(a) we apply learning method k to data using weights(w1, ..., wn) to obtainbfk,18'&$%(b) compute the error rate aserrk=Pni=1wiI(Yi6=bfk(Xi))Pni=1withenαk= log [(1 − errk)/errk],(c) recalculate each individual weight as proportional towiexp{αkI(Yi6=bfk(Xi))} and send to next classifier.3. Finally output sign³Pmk=1αkbfk(x)´.– The algorithm is equivalent to minimize an exponential lossexp{−Y f(X)} using forward stagewise additive model, i.e., atkth stage, minimizenXi=1exp{−Yi(ˆfk−1(Xi) + βg(Xi))}.19'&$%– This recursive nature can be incorporated into classificationtree to yield “boosting tree”.20'&$%CHAPTER 8 MODEL SELECTION INSUPERVISED LEARNING21'&$%Model selection– All learning methods assume f from some models.– The choice of models is important: underfitting or overfitting.– Often reflected in some tuning parameters in learning methods:k-NN, bandwidth, the number of basis functions, tree size,penalty parameters.– The model selection aims to balance fitting data and modelcomplexity.22'&$%AIC and BIC– They apply when the loss function is the log-likelihood functionand models are parametric.– AIC: -2log-lik+2 # parametersBIC: -2log-lik+2log n # parameters– Whether AIC or BIC?23'&$%Model complexity– Not all the models have finite number of parameters.– A more general measurement for model complexity isVC-dimension.– Stochastic errors between the empirical risk and the limitingrisk can be controlled in term of VC-dimension.– Thus, among a series of models Ω1, Ω2, ..., we choose the oneminimizingγn(ˆfΩ) + bn(Ω).– γn(ˆfΩ) reflects the best approximation using model Ω (bias).– bn(Ω) is an upper bound controlling stochastic errors(variability).– Limitation: VC-dimension is often not easy to calculate.24'&$%Cross-validation– It is the most commonly used method.– It is computationally feasible, although intensive.– The idea is to use one data as training data and the other partas


View Full Document

UNC-Chapel Hill BIOS 740 - Chapter 6 - Indirect Learning

Download Chapter 6 - Indirect Learning
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Chapter 6 - Indirect Learning and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Chapter 6 - Indirect Learning 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?