Berkeley COMPSCI C281B - Overview plus probabilistic formulations of prediction problems

Unformatted text preview:

CS281B/Stat241B (Spring 2008) Statistical Learning Theory Lecture: 1Overview plus probabilistic formulations of prediction problemsLecturer: Peter Bartlett Scribe: Peter Bartlett1 Organizational issuesThe course web page is at http://www.cs.berkeley.edu/∼bartlett/courses/281b-sp08/See the web page for details of office hours, the syllabus, assignments, readings, lecture notes, and announce-ments.1.1 AssignmentsThere will be roughly five homework assignments, approximately one every two weeks. The first has beenposted on the web site. It is due at the lecture on Thursday, January 31. You will also need to act as scribefor a small number of lectures, preparing a latex version of lecture notes. There is a template on the website, and the latex file of the lec ture notes for this lecture. (Please email the GSI, David, to choose thelecture that you’d like to prepare lecture notes for.) Also, there will be a final project, in an area related tothe topics of the course.2 OverviewThe course will focus on the theoretical analysis of prediction methods.1. Probabilistic formulation of prediction problems2. Algorithms:(a) Kernel methods(b) Boosting algorithms3. Risk bounds4. Game theoretic formulation of prediction problems5. Model selection3 Probabilistic Formulations of Predictio n ProblemsIn a prediction problem, we wish to predict an outcome y from some set Y of possible outcomes, on the basisof some observation x from a feature space X . Some examples:12 Overview plus probabilistic formulations of prediction problemsx yphylogenetic profile of a gene gene function(i.e., relationship to genomes of other s pecies)gene expression levels of a tissue sample patient disease stateimage of a signature on a check identity of the writeremail message spam or hamFor such problems, we might have access to a data set of n pairs, (x1, y1), . . . , (xn, yn), and we would like touse the data to produce a function f : X → Y so that, for subsequent (x, y) pairs, f(x) is a good predictionof y.To define the notion of a ‘good prediction,’ we can define a loss function ` : Y × Y → R, so that `(ˆy, y)quantifies the cost of predicting ˆy when the true outcome is y. Then the aim is to ensure that `(f(x), y) issmall. For instance, in pattern classification problems, the aim is to classify an x into one of a finite numberof classes (that is, the label space Y is finite). If all mistakes are equally bad, we could define`(ˆy, y) = 1[ˆy 6= y] =(1 if ˆy 6= y,0 otherwise.As another example, in a regression problem, with Y = R, we might choose the quadratic loss function,`(ˆy, y) = (ˆy − y)2.We can formulate such problems using probabilistic assumptions: we assume that there is a probabilitydistribution P on X ×Y, and that the pairs (X1, Y1), . . . , (Xn, Yn), (X, Y ) are chosen independently accordingto P . The aim is to choose f so that the risk of f ,R(f) = E`(f(X), Y ),is small. For instance, in the pattern classification example, this is the misclassification probability.R(f) = E1[f(X) 6= Y ] = Pr(f (X) 6= Y ).Some things to notice:1. We are using capital letters to denote random variables.2. The distribution P can be viewed as modelling both the relative frequency of different features orcovariates X, together with the conditional distribution of the outcome Y given X.3. The assumption that the data is i.i.d. is a strong one.4. The function x 7→ fn(x) = fn(x; X1, Y1, . . . , Xn, Yn) is random, since it depends on the random (Xi, Yi).Thus, the riskR(fn) = E [`(fn(X), Y )|X1, Y1, . . . , Xn, Yn]= E [`(fn(X; X1, Y1, . . . , Xn, Yn), Y )|X1, Y1, . . . , Xn, Yn]is a random variable. We might aim for ER(fn) small, or R(fn) small with high probability (over thetraining data).We might choose fnfrom some class F of functions, for instance, by choosing the structure and parametersof a decision tree, or by choosing the paramete rs of a neural net or a kernel machine.There are several questions that we are interested in:Overview plus probabilistic formulations of prediction problems 31. Can we design algorithms for which fnis close to the best that we could hope for, given that it waschosen from F ? (that is, is R(fn) − inff∈FR(f) small?)2. How does the performance of fndepend on n? On other parameters of the problem?3. Can we ensure that R(fn) approaches the best possible performance (that is, the infimum over all fof R(f ))?4. What do we need to assume about P ? About F ?In this course, we are c oncerned with results that apply to large classes of distributions P , such as the setof all joint distributions on X × Y. In contrast to parametric problems, we will not (often) assume that Pcomes from a small (e.g., finite-dimensional) space, P ∈ {Pθ: θ ∈ Θ}.Several key issues arise in designing a prediction method for these problems:Approximation How good is the best f in the class F that we are using? That is, how close to inffR(f)is inff∈FR(f)?Estimation Since we only have access to the distribution P through observing a finite data set, how closeis our p e rformance to that of the best f in F ?Computation We need to use the data to choose fn, typically through solving some kind of optimizationproblem. How can we do that effic iently?In this course, we will not spend much time on the approximation properties, beyond observing some univer-sality results (that particular classes can achieve zero approximation error). We will focus on the estimationissue. We will take the approach that efficiency of computation is a constraint. Indeed, the methods thatwe spend most of our time studying involve convex optimization problems. (For example, kernel methodsinvolve solving a quadratic program, and boosting algorithms involve minimizing a convex criterion in aconvex set.)4 The Probabilistic Formulation of Pattern Classification Prob-lemsAssume, for simplicity, that Y = {±1} (We’ll consider extensions of the results of this lecture to the multi-class case in a homework problem.) Let’s fix some notation: We’ll represent the joint distribution P onX × Y as the pair (µ, η), where µ is the marginal distribution on X and η is the conditional expectation ofY given X,η(x) = E(Y |X = x) = P (Y = 1|X = x).If we knew η, we could use it to find a decision function that minimized risk. To see this, notice that we canwrite the expected loss as an expectation of a conditional expectation,R(f) = E`(f(X), Y )= EE[`(f(X), Y )|X]= E (`(f(X), 1)P (Y = 1|X) + `(f(X), −1)P (Y = −1|X))= E (1[f(X) 6=


View Full Document

Berkeley COMPSCI C281B - Overview plus probabilistic formulations of prediction problems

Download Overview plus probabilistic formulations of prediction problems
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Overview plus probabilistic formulations of prediction problems and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Overview plus probabilistic formulations of prediction problems 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?