Bounds for Linear Multi-Task LearningAndreas MaurerAdalbertstr. 55D-80799 Mü[email protected]. We give dimension- free and data-depende nt bounds for lin-ear multi-task learning where a common linear operator i s chosen topreprocess data fo r a vector of task speci…c linear-thresh old ing classi-…ers. The complexity penalty of multi-task learning is bounded by asimple expression involving the margins of the task-speci…c classi…ers,the Hilbert-Schmidt norm of the selected preprocessor an d the Hilbert-Schmidt norm of the covariance operator for the total mixture of all taskdistributions, or, alte rnatively, the Frobenius norm of the total Gram ianmatrix for the data-dependent version. The results can be compared tostate-of-the-art results on linear single-task learning.1 IntroductionSimultaneous learning of di¤erent tasks under some common constraint, oftencalled multi-task learning, has been tested in practice with good results under avariety of di¤erent circumstances (see [5], [8], [17], [18]). The technique has beenanalyzed theoretically and in some generality (see Baxter [6] and Zhang[18]).The purpose of this paper is to improve some of these theoretical results in aspecial case of practical importance, when input data is represented in a linear,potentially in…nite dimensional space, and the common constraint is a linearpreprocessor.A simple way to understand multi-task learning and its potential advantagesis perhaps agnostic learning with an input space X and a …nite set F of hy-potheses f : X !f0; 1g. For a hypothesis f 2 F let er(f ) be the expected errorand eˆr(f)the empirical error on a training sample S of size n (drawn iid fromthe underlying task distribution) respectively. Combining Hoe¤ding’s inequalitywith a union bound one shows (see e.g. [1]), that with probability greater than1 we have for every f 2 F the error bounder (f) eˆr (f) +1p2npln jFj + ln (1=): (1)Suppose now that there are a set Y, a rather large set G of preprocessors g :X ! Y, and another set H of classi…ers h : Y ! f0; 1g with jHj jFj. For acleverly chosen preprocessor g 2 G it will likely be the case that we …nd someh 2 H such that h g has the same or even a smaller empirical error than the2best f 2 F. But this will lead to an improvement of the bound above (replacingjFj by jHj) only if we choose g before seeing the data, otherwise we incur a largeestimation penalty for the selection of g (replacing jFj by jH Gj).The situation is improved if we have a set of m di¤erent learning tasks withcorresponding task distributions and samples S1; :::; Sm, each of size n and drawniid from the corresponding distributions. We now consider solutions h1 g; :::hmg for each of the m tasks where the preprocessing map g 2 G is constrainedto be the same for all tasks and only the hl2 H specialize to each task l at hand.Again Hoe¤ding’s inequality and a union bound imply that with probabilitygreater 1 we have for all (h1; :::; hm) 2 Hmand every g 2 G1mmXl=1erl(hl g) 1mmXl=1eˆrl(hl g) +1p2nrln jHj +ln jGj + ln (1=)m: (2)Here erl(f) and eˆrl(f) denote the expected error in task l and the empiricalerror on training sample Slrespectively. The left hand side above is an averageof the expected errors, so that the guarantee implied by the bound is a littleweaker than the usual PAC guarantees (but see Ben-David [7] for bounds onthe individual errors). The …rst term on the right is the average empirical error,which a multi-task learning algorithm seeks to minimize. We can take it as anoperational de…nition of task-relatedness relative to (H; G) that we are able toobtain a very small value for this term. The remaining term, which bounds theestimation error, now exhibits the advantage of multi-task learning: Sharing thepreprocessor implies sharing its cost of estimation, an d the entropy contributionarising from the selection of g 2 G decreases with the number of learning tasks.Since by assumption jHj jFj, the estimation error in the multi-task bound(2) can become much smaller than in the single task case (1) if the number mof tasks becomes large.The choice of the preprocessor g 2 G can also be viewed as the selection ofthe hypothesis space H g. This leads to an alternative formulation of multi-task learning, where the common obje ct is a hypothesis space chosen from aclass of hypothesis spaces (in this case fH g : g 2 Gg), and the classi…ers forthe individual tasks are all chosen from the selected hypothesis space. Herewe prefer the functional formulation of selecting a preprocessor instead of ahyp othe sis space, because it is more intuitive and su¢ cient in the situationswhich we consider.The arguments leading to (2) can be re…ned and extended to certain in…niteclasses to give general bounds for multi-task learning ([6] and [18]). In this paperwe concentrate on the case where the input space X is a subset of the unit ball ina Hilb ert space H, the class G of preprocessors is a set A of bounded symmetriclinear operators on H, and the class H is the set of classi…ers hvobtained by0-thresholding linear fu nc tionals v in H with kvk B, that ishv(x) = sign (hx; vi) and h T (x) = sign (hT x; vi) ; x 2 H; T 2 G, kvk B:3The learner now searches f or a multi-classi…er hv T = (h1 T; :::; hm T )where the preprocessing operator T 2 A is the same for all tasks and only thevectors vlspecialize to each task l at hand. We desired multi-classi…er hv Tshould have a small value of the average errorer (hv T ) =1mmXl=1erl(hvl T ) =1mmXl=1PrsignT Xl; vl6= Yl;where Xland Ylare the random variables modelling input-values and labels f orthe l-th task. To guide this search we look for bounds on er(hv T ) in terms ofthe total observed data for all tasks, valid uniformly for all v =v1; :::; vmwithvl B and all T 2 A. We will prove the following :Theorem 1. Let 2 (0; 1). With probability greater than 1 it holds for allv =v1; :::; vm2 H withvl 1 and all bounded symmetric operators T onH with kT kHS 1, and for all 2 (0; 1) thater (hv T ) eˆr(v T ) +8 kT kHSpnrkCkHS+1m+sln42nm:Here eˆr(v T ) is a margin-based empirical error estimate, bounded by therelative number of examplesXli; Yliin the total training sample for all tasks l,where YliT Xli; vl< (see section
View Full Document