11©Carlos Guestrin 2005-2007GaussiansLinear RegressionBias-Variance TradeoffMachine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversitySeptember 12th, 2007Readings listed in class website2©Carlos Guestrin 2005-2007What about continuous variables? Billionaire says: If I am measuring a continuous variable, what can you do for me? You say: Let me tell you about Gaussians…23©Carlos Guestrin 2005-2007Some properties of Gaussians affine transformation (multiplying by scalar and adding a constant) X ~ N(µ,σ2) Y = aX + b → Y ~ N(aµ+b,a2σ2) Sum of Gaussians X ~ N(µX,σ2X) Y ~ N(µY,σ2Y) Z = X+Y → Z ~ N(µX+µY, σ2X+σ2Y)4©Carlos Guestrin 2005-2007Learning a Gaussian Collect a bunch of data Hopefully, i.i.d. samples e.g., exam scores Learn parameters Mean Variance35©Carlos Guestrin 2005-2007MLE for Gaussian Prob. of i.i.d. samples D={x1,…,xN}: Log-likelihood of data:6©Carlos Guestrin 2005-2007Your second learning algorithm:MLE for mean of a Gaussian What’s MLE for mean?47©Carlos Guestrin 2005-2007MLE for variance Again, set derivative to zero:8©Carlos Guestrin 2005-2007Learning Gaussian parameters MLE: BTW. MLE for the variance of a Gaussian is biased Expected result of estimation is not true parameter! Unbiased variance estimator:59©Carlos Guestrin 2005-2007Bayesian learning of Gaussian parameters Conjugate priors Mean: Gaussian prior Variance: Wishart Distribution Prior for mean:10©Carlos Guestrin 2005-2007MAP for mean of Gaussian611©Carlos Guestrin 2005-2007Prediction of continuous variables Billionaire says: Wait, that’s not what I meant! You says: Chill out, dude. He says: I want to predict a continuous variable for continuous inputs: I want to predict salaries from GPA. You say: I can regress that…12©Carlos Guestrin 2005-2007The regression problem Instances: <xj, tj> Learn: Mapping from x to t(x) Hypothesis space: Given, basis functions Find coeffs w={w1,…,wk} Why is this called linear regression??? model is linear in the parametersPrecisely, minimize the residual squared error:713©Carlos Guestrin 2005-2007The regression problem in matrix notationN sensorsK basis functionsN sensorsmeasurementsweightsK basis func14©Carlos Guestrin 2005-2007Regression solution = simple matrix operationswherek×k matrix for k basis functions k×1 vector815©Carlos Guestrin 2005-2007 Billionaire (again) says: Why sum squared error??? You say: Gaussians, Dr. Gateson, Gaussians… Model: prediction is linear function plus Gaussian noise t = ∑iwihi(x) + ε Learn w using MLEBut, why?16©Carlos Guestrin 2005-2007Maximizing log-likelihoodMaximize:Least-squares Linear Regression is MLE for Gaussians!!!917©Carlos Guestrin 2005-2007Applications Corner 1 Predict stock value over time from past values other relevant vars e.g., weather, demands, etc.18©Carlos Guestrin 2005-2007Applications Corner 2 Measure temperatures at some locations Predict temperatures throughout the environmentSERVERLABKITCHENCOPYELECPHONEQUIETSTORAGECONFERENCEOFFICEOFFICE50515253544648494743454442 41373938 36333610111213141516171920212224252628303231272923189587434123540[Guestrin et al. ’04]1019©Carlos Guestrin 2005-2007Applications Corner 3 Predict when a sensor will fail based several variables age, chemical exposure, number of hours used,…20©Carlos Guestrin 2005-2007Announcements 1 Readings associated with each class See course website for specific sections, extra links, and further details Visit the website frequently Recitations Thursdays, 5:00-6:20 in Wean Hall 5409 Special recitation on Matlab Sept. 18 Tue. 4:30-5:50pm NSH 3002 Carlos away on Monday Sept. 17th Prof. Eric Xing will teach the lecture1121©Carlos Guestrin 2005-2007Announcement 2 First homework out later today! Download from course website! Start early!!! :) Due Oct 3rd To expedite grading: there are 4 questions please hand in 4 stapled separate parts, one for each question22©Carlos Guestrin 2005-2007Bias-Variance tradeoff – Intuition Model too “simple” → does not fit the data well A biased solution Model too complex → small changes to the data, solution changes a lot A high-variance solution1223©Carlos Guestrin 2005-2007(Squared) Bias of learner Given dataset D with m samples, learn function h(x) If you sample a different datasets, you will learn different h(x) Expected hypothesis: ED[h(x)] Bias: difference between what you expect to learn and truth Measures how well you expect to represent true solution Decreases with more complex model 24©Carlos Guestrin 2005-2007Variance of learner Given a dataset D with m samples, you learn function h(x) If you sample a different datasets, you will learn different h(x) Variance: difference between what you expect to learn and what you learn from a from a particular dataset Measures how sensitive learner is to specific dataset Decreases with simpler model1325©Carlos Guestrin 2005-2007Bias-Variance Tradeoff Choice of hypothesis class introduces learning bias More complex class → less bias More complex class → more variance26©Carlos Guestrin 2005-2007Bias–Variance decomposition of error Consider simple regression problem f:XÆT t = f(x) = g(x) + εCollect some data, and learn a function h(x)What are sources of prediction error?noise ~ N(0,σ)deterministic1427©Carlos Guestrin 2005-2007Sources of error 1 – noise What if we have perfect learner, infinite data? If our learning solution h(x) satisfies h(x)=g(x) Still have remaining, unavoidable error of σ2 due to noise ε28©Carlos Guestrin 2005-2007Sources of error 2 – Finite data What if we have imperfect learner, or only m training examples? What is our expected squared error per example? Expectation taken over random training sets D of size m, drawn from distribution P(X,T)1529©Carlos Guestrin 2005-2007Bias-Variance Decomposition of ErrorAssume target function: t = f(x) = g(x) + εThen expected sq error over fixed size training sets D drawn from P(X,T) can be expressed as sum of three components:Where:Bishop Chapter 330©Carlos Guestrin 2005-2007What you need to know Gaussian estimation MLE Bayesian learning MAP Regression Basis function = features Optimizing sum squared error
View Full Document