Bayesian Interpretations of RegularizationCharlie Frogner9.520 Class 20April 21, 2010C. Frogner Bayesian Interpretations of RegularizationThe PlanRegularized least squares maps {(xi, yi)}ni=1to a function thatminimizes the regularized loss:fS= arg minf ∈H12nXi=1(yi− f (xi))2+λ2kf k2HCan we interpret RLS from a probabilistic point of view?C. Frogner Bayesian Interpretations of RegularizationSome notationS = {(xi, yi)}ni=1is the set of observed input/output pairs inRd× R (the training set).X and Y denote the matrices [x1, . . . , xn]T∈ Rn×dand[y1, . . . , yn]T∈ Rn, respectively.θ is a vector of parameters in Rp.p(Y |X , θ) is the joint distribution over outputs Y giveninputs X and the parameters.C. Frogner Bayesian Interpretations of RegularizationWhere do probabilities show up?12nXi=1V (yi, f (xi)) +λ2kf k2Hbecomesp(Y |f , X ) · p(f )Likelihood, a.k.a. noise model: p(Y |f , X ).Gaussian: yi∼ Nf∗(xi), σ2iPoisson: yi∼ Pois (f∗(xi))Prior: p(f ).C. Frogner Bayesian Interpretations of RegularizationWhere do probabilities show up?12nXi=1V (yi, f (xi)) +λ2kf k2Hbecomesp(Y |f , X ) · p(f )Likelihood, a.k.a. noise model: p(Y |f , X ).Gaussian: yi∼ Nf∗(xi), σ2iPoisson: yi∼ Pois (f∗(xi))Prior: p(f ).C. Frogner Bayesian Interpretations of RegularizationEstimationThe estimation problem:Given data {(xi, yi)}Ni=1and model p(Y |f , X ), p(f ).Find a good f to explain data.C. Frogner Bayesian Interpretations of RegularizationThe PlanMaximum likelihood estimation for ERMMAP estimation for linear RLSMAP estimation for kernel RLSTransductive modelInfinite dimensions get more complicatedC. Frogner Bayesian Interpretations of RegularizationMaximum likelihood estimationGiven data {(xi, yi)}Ni=1and model p(Y |f , X ), p(f ).A good f is one that maximizes p(Y |f , X ).C. Frogner Bayesian Interpretations of RegularizationMaximum likelihood and least squaresFor least squares, noise model is:yi|f , xi∼ Nf (xi), σ2a.k.a.Y |f , X ∼ Nf (X ), σ2ISop(Y |f , X ) =1(2πσ2)N/2exp(−NXi=11σ2(yi− f (xi))2)C. Frogner Bayesian Interpretations of RegularizationMaximum likelihood and least squaresFor least squares, noise model is:yi|f , xi∼ Nf (xi), σ2a.k.a.Y |f , X ∼ Nf (X ), σ2ISop(Y |f , X ) =1(2πσ2)N/2exp(−NXi=11σ2(yi− f (xi))2)C. Frogner Bayesian Interpretations of RegularizationMaximum likelihood and least squaresMaximum likelihood: maximizep(Y |f , X ) =1(2πσ2)N/2exp(−NXi=11σ2(yi− f (xi)))2)Empirical risk minimization: minimizeNXi=1(yi− f (xi))2C. Frogner Bayesian Interpretations of Regularization...NXi=1(yi− f (xi))2C. Frogner Bayesian Interpretations of Regularization...e−NPi=11σ2(yi−f (xi))2C. Frogner Bayesian Interpretations of RegularizationWhat about regularization?RLS:arg minf12nXi=1(yi− f (xi))2+λ2kf k2HIs there a model of Y and f that yields RLS?Yes.e−12σ2εnPi=1(yi−f (xi))2−λ2kf k2Hp(Y |f , X ) · p(f )C. Frogner Bayesian Interpretations of RegularizationWhat about regularization?RLS:arg minf12nXi=1(yi− f (xi))2+λ2kf k2HIs there a model of Y and f that yields RLS?Yes.e−12σ2εnPi=1(yi−f (xi))2−λ2kf k2Hp(Y |f , X ) · p(f )C. Frogner Bayesian Interpretations of RegularizationWhat about regularization?RLS:arg minf12nXi=1(yi− f (xi))2+λ2kf k2HIs there a model of Y and f that yields RLS?Yes.e−12σ2εnPi=1(yi−f (xi))2· e−λ2kf k2Hp(Y |f , X ) · p(f )C. Frogner Bayesian Interpretations of RegularizationWhat about regularization?RLS:arg minf12nXi=1(yi− f (xi))2+λ2kf k2HIs there a model of Y and f that yields RLS?Yes.e−12σ2εnPi=1(yi−f (xi))2· e−λ2kf k2Hp(Y |f , X ) · p(f )C. Frogner Bayesian Interpretations of RegularizationPosterior function estimatesGiven data {(xi, yi)}Ni=1and model p(Y |f , X ), p(f ).Find a good f to explain data.(If we can get p(f |Y , X ))Bayes least squares estimate:ˆfBLS= E(f |X ,Y )[f ]i.e. the mean of the posterior.MAP estimate:ˆfMAP(Y |X ) = arg maxfp(f |X , Y )i.e. a mode of the posterior.C. Frogner Bayesian Interpretations of RegularizationPosterior function estimatesGiven data {(xi, yi)}Ni=1and model p(Y |f , X ), p(f ).Find a good f to explain data.(If we can get p(f |Y , X ))Bayes least squares estimate:ˆfBLS= E(f |X ,Y )[f ]i.e. the mean of the posterior.MAP estimate:ˆfMAP(Y |X ) = arg maxfp(f |X , Y )i.e. a mode of the posterior.C. Frogner Bayesian Interpretations of RegularizationPosterior function estimatesGiven data {(xi, yi)}Ni=1and model p(Y |f , X ), p(f ).Find a good f to explain data.(If we can get p(f |Y , X ))Bayes least squares estimate:ˆfBLS= E(f |X ,Y )[f ]i.e. the mean of the posterior.MAP estimate:ˆfMAP(Y |X ) = arg maxfp(f |X , Y )i.e. a mode of the posterior.C. Frogner Bayesian Interpretations of RegularizationA posterior on functions?How to find p(f |Y , X )?Bayes’ rule:p(f |X , Y ) =p(Y |X , f ) · p(f )p(Y |X )=p(Y |X , f ) · p(f )Rp(Y |X , f )dfWhen is this well-defined?C. Frogner Bayesian Interpretations of RegularizationA posterior on functions?How to find p(f |Y , X )?Bayes’ rule:p(f |X , Y ) =p(Y |X , f ) · p(f )p(Y |X )=p(Y |X , f ) · p(f )Rp(Y |X , f )dfWhen is this well-defined?C. Frogner Bayesian Interpretations of RegularizationA posterior on functions?Functions vs. parameters:H∼=RpRepresent functions in H by their coordinates w.r.t. a basis:f ∈ H ↔ θ ∈ RpAssume (for the moment): p < ∞C. Frogner Bayesian Interpretations of RegularizationA posterior on functions?Functions vs. parameters:H∼=RpRepresent functions in H by their coordinates w.r.t. a basis:f ∈ H ↔ θ ∈ RpAssume (for the moment): p < ∞C. Frogner Bayesian Interpretations of RegularizationPosterior for linear RLSLinear function:f (x) = hx, θiNoise model:Y |X , θ ∼ NX θ, σ2εIAdd a prior:θ ∼ N (0, Λ)C. Frogner Bayesian Interpretations of RegularizationPosterior for linear RLSModel:Y |X , θ ∼ NX θ, σ2εI, θ ∼ N (0, Λ)Joint over Y and θ:Yθ∼ N00,X ΛXT+ σ2εI X ΛΛXTΛCondition on Y .C. Frogner Bayesian Interpretations of RegularizationPosterior for linear RLSPosterior:θ|X , Y ∼ Nµθ|X ,Y, Σθ|X ,Ywhereµθ|X ,Y= ΛXT(X ΛXT+ σ2εI)−1YΣθ|X ,Y= Λ −ΛXT(X ΛXT+ σ2εI)−1X ΛThis is Gaussian, soˆθMAP(Y |X ) =ˆθBLS(Y |X ) = ΛXT(X ΛXT+ σ2εI)−1YC. Frogner Bayesian Interpretations of RegularizationPosterior for linear RLSPosterior:θ|X , Y ∼ Nµθ|X ,Y, Σθ|X ,Ywhereµθ|X ,Y= ΛXT(X ΛXT+
View Full Document