1©Carlos Guestrin 2005-2007GaussiansLinear RegressionBias-Variance TradeoffMachine Learning – 10701/15781Carlos GuestrinCarnegie Mellon UniversityJanuary 22nd, 2007Readings listed in class website©Carlos Guestrin 2005-2007Maximum Likelihood Estimation Data: Observed set D of αHHeads and αTTails Hypothesis: Binomial distribution Learning θ is an optimization problem What’s the objective function? MLE: Choose θ that maximizes the probability of observed data:2©Carlos Guestrin 2005-2007Bayesian Learning for Thumbtack Likelihood function is simply Binomial: What about prior? Represent expert knowledge Simple posterior form Conjugate priors: Closed-form representation of posterior For Binomial, conjugate prior is Beta distribution©Carlos Guestrin 2005-2007Posterior distribution Prior: Data: αHheads and αTtails Posterior distribution:3©Carlos Guestrin 2005-2007MAP: Maximum a posteriori approximation As more data is observed, Beta is more certain MAP: use most likely parameter:©Carlos Guestrin 2005-2007What about continuous variables? Billionaire says: If I am measuring a continuous variable, what can you do for me? You say: Let me tell you about Gaussians…4©Carlos Guestrin 2005-2007Some properties of Gaussians affine transformation (multiplying by scalar and adding a constant) X ~ N(µ,σ2) Y = aX + b → Y ~ N(aµ+b,a2σ2) Sum of Gaussians X ~ N(µX,σ2X) Y ~ N(µY,σ2Y) Z = X+Y → Z ~ N(µX+µY, σ2X+σ2Y)©Carlos Guestrin 2005-2007Learning a Gaussian Collect a bunch of data Hopefully, i.i.d. samples e.g., exam scores Learn parameters Mean Variance5©Carlos Guestrin 2005-2007MLE for Gaussian Prob. of i.i.d. samples D={x1,…,xN}: Log-likelihood of data:©Carlos Guestrin 2005-2007Your second learning algorithm:MLE for mean of a Gaussian What’s MLE for mean?6©Carlos Guestrin 2005-2007MLE for variance Again, set derivative to zero:©Carlos Guestrin 2005-2007Learning Gaussian parameters MLE: BTW. MLE for the variance of a Gaussian is biased Expected result of estimation is not true parameter! Unbiased variance estimator:7©Carlos Guestrin 2005-2007Bayesian learning of Gaussian parameters Conjugate priors Mean: Gaussian prior Variance: Wishart Distribution Prior for mean:©Carlos Guestrin 2005-2007MAP for mean of Gaussian8©Carlos Guestrin 2005-2007Prediction of continuous variables Billionaire says: Wait, that’s not what I meant! You says: Chill out, dude. He says: I want to predict a continuous variable for continuous inputs: I want to predict salaries from GPA. You say: I can regress that…©Carlos Guestrin 2005-2007The regression problem Instances: <xj, tj> Learn: Mapping from x to t(x) Hypothesis space: Given, basis functions Find coeffs w={w1,…,wk} Why is this called linear regression??? model is linear in the parametersPrecisely, minimize the residual squared error:9©Carlos Guestrin 2005-2007The regression problem in matrix notationN sensorsK basis functionsN sensorsmeasurementsweightsK basis func©Carlos Guestrin 2005-2007Regression solution = simple matrix operationswherek×k matrix for k basis functions k×1 vector10©Carlos Guestrin 2005-2007 Billionaire (again) says: Why sum squared error??? You say: Gaussians, Dr. Gateson, Gaussians… Model: prediction is linear function plus Gaussian noise t = ∑iwihi(x) + ε Learn w using MLEBut, why?©Carlos Guestrin 2005-2007Maximizing log-likelihoodMaximize:Least-squares Linear Regression is MLE for Gaussians!!!11©Carlos Guestrin 2005-2007Applications Corner 1 Predict stock value over time from past values other relevant vars e.g., weather, demands, etc.©Carlos Guestrin 2005-2007Applications Corner 2 Measure temperatures at some locations Predict temperatures throughout the environmentSERVERLABKITCHENCOPYELECPHONEQUIETSTORAGECONFERENCEOFFICEOFFICE50515253544648494743454442 41373938 36333610111213141516171920212224252628303231272923189587434123540[Guestrin et al. ’04]12©Carlos Guestrin 2005-2007Applications Corner 3 Predict when a sensor will fail based several variables age, chemical exposure, number of hours used,…©Carlos Guestrin 2005-2007Announcements Readings associated with each class See course website for specific sections, extra links, and further details Visit the website frequently Recitations Thursdays, 5:30-6:50 in Wean Hall 5409 Special recitation on Matlab Jan. 24 Wed. 5:30-6:50pm NSH 130513©Carlos Guestrin 2005-2007Bias-Variance tradeoff – Intuition Model too “simple” → does not fit the data well A biased solution Model too complex → small changes to the data, solution changes a lot A high-variance solution©Carlos Guestrin 2005-2007(Squared) Bias of learner Given dataset D with m samples, learn function h(x) If you sample a different datasets, you will learn different h(x) Expected hypothesis: ED[h(x)] Bias: difference between what you expect to learn and truth Measures how well you expect to represent true solution Decreases with more complex model \phi14©Carlos Guestrin 2005-2007(Squared) Bias of learner Given dataset D with m samples, learn function h(x) If you sample a different datasets, you will learn different h(x) Expected hypothesis: ED[h(x)] Bias: difference between what you expect to learn and truth Measures how well you expect to represent true solution Decreases with more complex model ©Carlos Guestrin 2005-2007Variance of learner Given a dataset D with m samples, you learn function h(x) If you sample a different datasets, you will learn different h(x) Variance: difference between what you expect to learn and what you learn from a from a particular dataset Measures how sensitive learner is to specific dataset Decreases with simpler model15©Carlos Guestrin 2005-2007Bias-Variance Tradeoff Choice of hypothesis class introduces learning bias More complex class → less bias More complex class → more variance©Carlos Guestrin 2005-2007Bias–Variance decomposition of error Consider simple regression problem f:XÆT t = f(x) = g(x) + εCollect some data, and learn a function h(x)What are sources of prediction error?noise ~ N(0,σ)deterministic16©Carlos Guestrin 2005-2007Sources of error 1 – noise What if we have perfect learner, infinite
View Full Document