GaussiansLinear RegressionBias-Variance TradeoffAnnouncementsMaximum Likelihood EstimationBayesian Learning for ThumbtackPosterior distributionMAP: Maximum a posteriori approximationWhat about continuous variables?Some properties of GaussiansLearning a GaussianMLE for GaussianYour second learning algorithm:MLE for mean of a GaussianMLE for varianceLearning Gaussian parametersBayesian learning of Gaussian parametersMAP for mean of GaussianPrediction of continuous variablesThe regression problemThe regression problem in matrix notationRegression solution = simple matrix operationsBut, why?Maximizing log-likelihoodBias-Variance tradeoff – Intuition(Squared) Bias of learnerVariance of learnerBias-Variance TradeoffBias–Variance decomposition of errorSources of error 1 – noiseSources of error 2 – Finite dataBias-Variance Decomposition of ErrorWhat you need to knowBasics, Gaussians: Koller&Friedman 1.1, 1.2 – handed out in classBias-Variance tradeoff: Bishop chapter 9.1, 9.2GaussiansLinear RegressionBias-Variance TradeoffMachine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityJanuary 23rd, 2006Announcements Recitations stay on Thursdays 5-6:30pm in Wean 5409 Special Matlab recitation: Jan. 25 Wed. 5:00-7:00pm in NSH 3305 First homework: Programming part and Analytic part Remember collaboration policy: can discuss questions, but need to write your own solutions and code Out later today Due Mon. Feb 6thbeginning of class Start early!Maximum Likelihood Estimation Data: Observed set D of αHHeads and αTTails Hypothesis: Binomial distribution Learning θ is an optimization problem What’s the objective function? MLE: Choose θ that maximizes the probability of observed data:Bayesian Learning for Thumbtack Likelihood function is simply Binomial: What about prior? Represent expert knowledge Simple posterior form Conjugate priors: Closed-form representation of posterior For Binomial, conjugate prior is Beta distributionPosterior distribution Prior: Data: αHheads and αTtails Posterior distribution:MAP: Maximum a posteriori approximation As more data is observed, Beta is more certain MAP: use most likely parameter:What about continuous variables? Billionaire says: If I am measuring a continuous variable, what can you do for me? You say: Let me tell you about Gaussians…Some properties of Gaussians affine transformation (multiplying by scalar and adding a constant) X ~ N(µ,σ2) Y = aX + b → Y ~ N(aµ+b,a2σ2) Sum of Gaussians X ~ N(µX,σ2X) Y ~ N(µY,σ2Y) Z = X+Y → Z ~ N(µX+µY, σ2X+σ2Y)Learning a Gaussian Collect a bunch of data Hopefully, i.i.d. samples e.g., exam scores Learn parameters Mean VarianceMLE for Gaussian Prob. of i.i.d. samples x1,…,xN: Log-likelihood of data:Your second learning algorithm:MLE for mean of a Gaussian What’s MLE for mean?MLE for variance Again, set derivative to zero:Learning Gaussian parameters MLE: BTW. MLE for the variance of a Gaussian is biased Expected result of estimation is not true parameter! Unbiased variance estimator:Bayesian learning of Gaussian parameters Conjugate priors Mean: Gaussian prior Variance: Wishart Distribution Prior for mean:MAP for mean of GaussianPrediction of continuous variables Billionaire says: Wait, that’s not what I meant! You says: Chill out, dude. He says: I want to predict a continuous variable for continuous inputs: I want to predict salaries from GPA. You say: I can regress that…The regression problem Instances: <xj, tj> Learn: Mapping from x to t(x) Hypothesis space: Given, basis functions Find coeffs w={w1,…,wk} Why is this called linear regression??? model is linear in the parametersPrecisely, minimize the residual error:The regression problem in matrix notationN sensorsK basis functionsN sensorsK basis funcmeasurementsweightsRegression solution = simple matrix operationswherek×k matrix for k basis functions k×1 vectorBut, why? Billionaire (again) says: Why sum squared error??? You say: Gaussians, Dr. Gateson, Gaussians… Model: prediction is linear function plus Gaussian noise t = ∑iwihi(x) + ε Learn w using MLEMaximizing log-likelihoodMaximize:Least-squares Linear Regression is MLE for Gaussians!!!Bias-Variance tradeoff – Intuition Model too “simple” → does not fit the data well A biased solution Model too complex → small changes to the data, solution changes a lot A high-variance solution(Squared) Bias of learner Suppose you are given a dataset D with m samples from some distribution You learn function h(x) from data D If you sample a different datasets, you will learn different h(x) Expected hypothesis: ED[h(x)] Bias: difference between what you expect to learn and truth Measures how well you expect to represent true solution Decreases with more complex modelVariance of learner Suppose you are given a dataset D with m samples from some distribution You learn function h(x) from data D If you sample a different datasets, you will learn different h(x) Variance: difference between what you expect to learn and what you learn from a from a particular dataset Measures how sensitive learner is to specific dataset Decreases with simpler modelBias-Variance Tradeoff Choice of hypothesis class introduces learning bias More complex class → less bias More complex class → more varianceBias–Variance decomposition of error Consider simple regression problem f:XÆTt = f(x) = g(x) + εCollect some data, and learn a function h(x)What are sources of prediction error?noise ~ N(0,σ)deterministicSources of error 1 – noise What if we have perfect learner, infinite data? Our learning solution h(x) satisfies h(x)=g(x) Still have remaining, unavoidable error of σ2 due to noise εSources of error 2 – Finite data What if we have imperfect learner, or only m training examples? What is our expected squared error per example Expectation taken over random training sets D of size m, drawn from distribution P(X,T)Bias-Variance Decomposition of ErrorAssume target function: t = f(x) = g(x) + εThen expected sq error over fixed size training sets D drawn from P(X,T) can be expressed as sum of three components:Where:Bishop chapter 9.1, 9.2What you need to
View Full Document