Berkeley COMPSCI 294 - Bounds for Linear Multi-Task Learning - D2768357

Home> Schools> University of California, Berkeley> Computer Science (COMPSCI) > COMPSCI 294> Bounds for Linear Multi-Task Learning

DOC PREVIEW

Berkeley COMPSCI 294 - Bounds for Linear Multi-Task Learning

School name University of California, Berkeley

Course Compsci 294- Special Topics

Pages 16

This preview shows page 1-2-3-4-5 out of 16 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 16 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Bounds for Linear Multi-Task LearningAndreas MaurerAdalbertstr. 55D-80799 Mü[email protected]. We give dimension- free and data-depende nt bounds for lin-ear multi-task learning where a common linear operator i s chosen topreprocess data fo r a vector of task speci…c linear-thresh old ing classi-…ers. The complexity penalty of multi-task learning is bounded by asimple expression involving the margins of the task-speci…c classi…ers,the Hilbert-Schmidt norm of the selected preprocessor an d the Hilbert-Schmidt norm of the covariance operator for the total mixture of all taskdistributions, or, alte rnatively, the Frobenius norm of the total Gram ianmatrix for the data-dependent version. The results can be compared tostate-of-the-art results on linear single-task learning.1 IntroductionSimultaneous learning of di¤erent tasks under some common constraint, oftencalled multi-task learning, has been tested in practice with good results under avariety of di¤erent circumstances (see [4], [7], [14], [15]). The technique has beenanalyzed theoretically and in some generality (see Baxter [5] and Zhang[15]).The purpose of this paper is to improve and clarify some of these theoreticalresults in the case, when input data is represented in a linear, potentially in…nitedimensional space, and the common constraint is a linear preprocessor.The simplest conceptual model to understand multi-task learning and itspotential advantages is perhaps agnostic learning with an input space X and a…nite set F of hypotheses f : X !f0; 1g. For a hypothesis f 2 F let er(f) andeˆr(f) be the expected error and the empirical error on a training sample S ofsize n (drawn iid from the underlying task distribution) respectively. CombiningHoe¤ding’s inequality with a union b ound one shows (see e.g. [1]) that withprobability greater than 1   we have for every f 2 F the error bounder (f)  eˆr (f) +1p2npln jFj+ ln (1=): (1)Suppose now that F factors in the following sense: There is a set Y and a setG of functions g : X ! Y and a set H of functions h : Y ! f0; 1g such thatF = H  G and jHj  jFj. Imagine that we have a set of m di¤erent learningtasks with corresponding task distributions and samples S1; :::; Sm, each of sizen and drawn iid from the corresponding distribution. We now seek a multiple2solution h1 g; ::: hm g for each of the m task where the preprocessing mapg 2 G is constrained to be the same for all tasks and only the hl2 H specializeto each task l at hand. Again Hoe¤ding’s inequality and a union b ound implythat with probability greater 1   we have for every possible multiple solution(h1 g; :::; hm g)1mmXl=1erl(hl g) 1mmXl=1eˆrl(hl g) +1p2nrln jHj +ln jGj + ln (1=)m: (2)Here erl(f) and eˆrl(f) denote the expected error in task l and the empiricalerror on training sample Slrespectively. The left hand side above is an averageof the expected errors, so that the guarantee implied by the bound is a littleweaker than the usual PAC results (but see Ben-David [6] for bounds on theindividual errors). The …rst term on the right is the average empirical error,which a multi-task learning algorithm seeks to minimize. We can take it as anoperational de…nition of task-relatedness relative to (H; G) that we are able toobtain a very small value for this term. The remaining term, which bounds theestimation error, now exhibits the advantage of multi-task learning: The contri-bution coming f rom estimating the common preprocessor g 2 G decreases withthe number of learning tasks. Since by assumption jHj  jFj, the estimationerror in the multi-task bound (2) can become much smaller than in the singletask case (1) if the number m of tasks becomes large. This behavior can be in-tuitively explained by the fact that much more data is being considered in theselection of the c ommon preprocessor g 2 G.The choice of the preprocessor g 2 G can also be viewed as the selection ofthe hypothesis space H  g. This leads to an alternative formulation of multi-task learning, where the common obje ct is a hypothesis space chosen from aclass of hypothesis spaces (in this case fH  g : g 2 Gg), and the classi…ers forthe individual tasks are all chosen from the selected hypothesis space. The twoformulations are equivalent: If F is a class of hypothesis spaces F of functionsf : X ! f0; 1g de…n e Y = F  X andG = fx 2 X 7!(F; x) 2 F  X : F 2Fgand H = f(F; x) 2 F  X 7! f (x) : f 2 Fg. Then ch oosing F from F and f fromF is equivalent to choosing g from G and h from H and using the hypothesisf = h  g. Here we prefer the formulation of selecting a preprocessor insteadof a hypothesis space, because it is more intuitive in the situations which weconsider.Using covering numbers the arguments leading to (2) can be extended tocertain in…nite classes to give general bounds for multi-task learning ([5] and[15]). In this paper we concentrate on the c ase where the input space X is asubset of the unit ball in a Hilbert space H, the class G of preprocessors is a setof linear operators on H, and the class H is the set of classi…ers hvobtained by0-thresholding linear functionals v in H with kvk  B. This contains the case3considered by Zhang ([15]) where G = Pd, the set of orthogonal projections inH with d-dimensional range. We will prove the following :Theorem 1. Let  2 (0; 1). With probability greater than 1   it holds for allv1; :::; vm2 H with kvlk  1 and all bounded symmetric operators T on H withkT kHS 1, and for all  2 (0; 1) that1mmXl=1er (hvl T ) 1mmXl=1eˆr(vl T ) +8 kT kHSpnrkCkHS+1m+sln8 2nm:Here eˆr(vl T ) is an empirical margin error, bounded by the relative num-ber of examplesXli; Yliin the training sample for task l, when YliT Xli; vl< .The quantity kT kHSis the Hilbert-Schmidt norm the operator T de…ned forsymmetric T bykT kHS=X2i1=2;where iis the sequence of eigenvalues of T (counting multiplicities). C is thetotal covariance operator corresponding to the mixture of all the task-input-distributions in H. The above Theorem is the simplest, but not the tightest ormost general form of our main results. For example the factor 8 on the righthand side can be decreased to be arbitrarily close to 2, thereby incurring only alogarithmic penalty in the last term.The bound in the above theorem is dimension free, it does not require thedata distribution in H to

View Full Document

Berkeley COMPSCI 294 - Bounds for Linear Multi-Task Learning

Sign up for free to view:

This document and 3 million+ documents and flashcards
High quality study guides, lecture notes, practice exams
Course Packets handpicked by editors offering a comprehensive review of your courses
Better Grades Guaranteed


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4-5 out of 16 pages.

Berkeley COMPSCI 294 - Bounds for Linear Multi-Task Learning

Sign up for free to view:

Please select your school