DOC PREVIEW
Berkeley COMPSCI 294 - Bounds for Linear Multi-Task Learning

This preview shows page 1-2-22-23 out of 23 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 23 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Bounds for Linear Multi-Task LearningAndreas MaurerAdalbertstr. 55D-80799 Mü[email protected]. We give dimension- free and data-depende nt bounds for lin-ear multi-task learning where a common linear operator i s chosen topreprocess data fo r a vector of task speci…c linear-thresh old ing classi-…ers. The complexity penalty of multi-task learning is bounded by asimple expression involving the margins of the task-speci…c classi…ers,the Hilbert-Schmidt norm of the selected preprocessor an d the Hilbert-Schmidt norm of the covariance operator for the total mixture of all taskdistributions, or, alte rnatively, the Frobenius norm of the total Gram ianmatrix for the data-dependent version. The results can be compared tostate-of-the-art results on linear single-task learning.1 IntroductionSimultaneous learning of di¤erent tasks under some common constraint, oftencalled multi-task learning, has been tested in practice with good results under avariety of di¤erent circumstances (see [5], [8], [17], [18]). The technique has beenanalyzed theoretically and in some generality (see Baxter [6] and Zhang[18]).The purpose of this paper is to improve some of these theoretical results in aspecial case of practical importance, when input data is represented in a linear,potentially in…nite dimensional space, and the common constraint is a linearpreprocessor.A simple way to understand multi-task learning and its potential advantagesis perhaps agnostic learning with an input space X and a …nite set F of hy-potheses f : X !f0; 1g. For a hypothesis f 2 F let er(f ) be the expected errorand eˆr(f)the empirical error on a training sample S of size n (drawn iid fromthe underlying task distribution) respectively. Combining Hoe¤ding’s inequalitywith a union bound one shows (see e.g. [1]), that with probability greater than1   we have for every f 2 F the error bounder (f)  eˆr (f) +1p2npln jFj + ln (1=): (1)Suppose now that there are a set Y, a rather large set G of preprocessors g :X ! Y, and another set H of classi…ers h : Y ! f0; 1g with jHj  jFj. For acleverly chosen preprocessor g 2 G it will likely be the case that we …nd someh 2 H such that h  g has the same or even a smaller empirical error than the2best f 2 F. But this will lead to an improvement of the bound above (replacingjFj by jHj) only if we choose g before seeing the data, otherwise we incur a largeestimation penalty for the selection of g (replacing jFj by jH  Gj).The situation is improved if we have a set of m di¤erent learning tasks withcorresponding task distributions and samples S1; :::; Sm, each of size n and drawniid from the corresponding distributions. We now consider solutions h1 g; :::hmg for each of the m tasks where the preprocessing map g 2 G is constrainedto be the same for all tasks and only the hl2 H specialize to each task l at hand.Again Hoe¤ding’s inequality and a union bound imply that with probabilitygreater 1   we have for all (h1; :::; hm) 2 Hmand every g 2 G1mmXl=1erl(hl g) 1mmXl=1eˆrl(hl g) +1p2nrln jHj +ln jGj + ln (1=)m: (2)Here erl(f) and eˆrl(f) denote the expected error in task l and the empiricalerror on training sample Slrespectively. The left hand side above is an averageof the expected errors, so that the guarantee implied by the bound is a littleweaker than the usual PAC guarantees (but see Ben-David [7] for bounds onthe individual errors). The …rst term on the right is the average empirical error,which a multi-task learning algorithm seeks to minimize. We can take it as anoperational de…nition of task-relatedness relative to (H; G) that we are able toobtain a very small value for this term. The remaining term, which bounds theestimation error, now exhibits the advantage of multi-task learning: Sharing thepreprocessor implies sharing its cost of estimation, an d the entropy contributionarising from the selection of g 2 G decreases with the number of learning tasks.Since by assumption jHj  jFj, the estimation error in the multi-task bound(2) can become much smaller than in the single task case (1) if the number mof tasks becomes large.The choice of the preprocessor g 2 G can also be viewed as the selection ofthe hypothesis space H  g. This leads to an alternative formulation of multi-task learning, where the common obje ct is a hypothesis space chosen from aclass of hypothesis spaces (in this case fH  g : g 2 Gg), and the classi…ers forthe individual tasks are all chosen from the selected hypothesis space. Herewe prefer the functional formulation of selecting a preprocessor instead of ahyp othe sis space, because it is more intuitive and su¢ cient in the situationswhich we consider.The arguments leading to (2) can be re…ned and extended to certain in…niteclasses to give general bounds for multi-task learning ([6] and [18]). In this paperwe concentrate on the case where the input space X is a subset of the unit ball ina Hilb ert space H, the class G of preprocessors is a set A of bounded symmetriclinear operators on H, and the class H is the set of classi…ers hvobtained by0-thresholding linear fu nc tionals v in H with kvk  B, that ishv(x) = sign (hx; vi) and h T (x) = sign (hT x; vi) ; x 2 H; T 2 G, kvk  B:3The learner now searches f or a multi-classi…er hv T = (h1 T; :::; hm T )where the preprocessing operator T 2 A is the same for all tasks and only thevectors vlspecialize to each task l at hand. We desired multi-classi…er hv Tshould have a small value of the average errorer (hv T ) =1mmXl=1erl(hvl T ) =1mmXl=1PrsignT Xl; vl6= Yl;where Xland Ylare the random variables modelling input-values and labels f orthe l-th task. To guide this search we look for bounds on er(hv T ) in terms ofthe total observed data for all tasks, valid uniformly for all v =v1; :::; vmwithvl B and all T 2 A. We will prove the following :Theorem 1. Let  2 (0; 1). With probability greater than 1   it holds for allv =v1; :::; vm2 H withvl 1 and all bounded symmetric operators T onH with kT kHS 1, and for all  2 (0; 1) thater (hv T )  eˆr(v  T ) +8 kT kHSpnrkCkHS+1m+sln42nm:Here eˆr(v  T ) is a margin-based empirical error estimate, bounded by therelative number of examplesXli; Yliin the total training sample for all tasks l,where YliT Xli; vl<  (see section


View Full Document

Berkeley COMPSCI 294 - Bounds for Linear Multi-Task Learning

Documents in this Course
"Woo" MAC

"Woo" MAC

11 pages

Pangaea

Pangaea

14 pages

Load more
Download Bounds for Linear Multi-Task Learning
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Bounds for Linear Multi-Task Learning and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Bounds for Linear Multi-Task Learning 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?