10#601%Review%Tom$Mitchell$and$Aar1$Singh$$$Machine$Learning$107601$Dec$8,$2011$$TexPoint$fonts$used$in$EMF.$$Read$the$TexPoint$manual$before$you$delete$this$box.:$AAAAAAAAAAAA$1$Machine%Learning%Algorithm%2$Goal:$$$Learn$a$rule$Z$è$f(Z)$that$op1mizes$some$objec1ve$–$$loss(f(Z)).$$Z$can$be$X$or$(X,Y)$$$$$$modeled$as$a$random$variable,$and$we$op1mize$EZ[loss(f(Z))]$$$$$$$$$$$D$=$$$$Why$do$we$need$training$data?$Learning algorithm Training$Data$ Rule${Zi}ni=1bfnModeling%Distribu=ons%3$Parametric:$$Pθ(Z)$$$Gaussian$–$con1nuous$random$variables$$Bernoulli$–$binary/boolean$random$variable$$Binomial$–$sum$of$binary/boolean$random$variables$$Mul1nomial$–$sum$of$k7ary$random$variables$$Beta,$Dirichlet$(conjugate$prior$for$binomial,$mul1nomial),$Poisson,$…$$If$θ$is$a$random$variable,$Pθ(Z)$=$P(Z|θ)$$$$$likelihood$$Bayes$Rule:$$$P(θ|Z)$=$P(Z|θ)$P(θ) $posterior $$$$$$$$$ $ $$$$$$$P(Z)$θ$$µ,$σ$θ$n, θ$n, θ1, θ2, …, θk$Modeling%Distribu=ons%Condi1onal$independence$assump1ons$for$joint$distribu1on s:$$• Markov$Models$ $$$• Hidden$Markov$Models$$$$$• Bayes$Nets/Graphical$models$$$$4$p(X)=nYi=1p(Xi|pa(Xi))p(X)=nYi=1p(Xi|Xi1)p(X, Z)=nYi=1p(Xi|Zi)nYi=1p(Zi|Zi1)Machine%Learning%Problems%5$Broad$categories$7$$• $Unsupervised%learning$$$Density$es1ma1on,$Clustering,$Dimensionality$reduc1on$• %%Supervised%learning$$$Classifica1on,$Regression$$• $Semi#supervised%learning%%• %Ac=ve%learning%• Many$more$…$Unsupervised%&%Supervised%Le arning%6$Unsupervised$Learning$–$Learning$without$a$teacher$Learning algorithm Model%for%word%distribu=on%OR%Clustering%of%similar%documents%Documents%Learning algorithm Supervised$Learning$–$Learning$with$a$teacher$Mapping%between%Documents%and%topics%Documents,%topics%Semi#supervised%&%Ac=ve%Learning%7$Learning algorithm Semi7Supervised$Learning$–$randomly$labeled$examples$Ac1ve$Learning$–$selec,vely$labeled$examples$$Mapping%between%Documents%and%topics%Documents,%topics%Documents%Selective labeling Learning algorithm Documents,%topics%Documents%Unsupervised%Learning%8$Density$es1ma1on:$$$$Parametric$(MLE,$MAP)$$Nonparametric$(Histogram,$Kernel)$$Dimensionality$reduc1on:$$$Feature$Selec1on$$Principal$Component$Analysis$(PCA)$$Laplacian$Eigenmaps$$Clustering:$$$$Gaussian$mixture$models$$k7means$$spectral$$Supervised%Learning%9$Regression:$(Con1nuous$labels,$Mean$Square$Error)$$$Op1mal$es1ma1on$rule$$$$f*(X)$=$E[Y|X] $ $MLE$under$P(Y|X)$=$N(f*(X),$σ2)$$$$$$$$$$$$Linear$Regression $$$$f(X)$=$$X$w,$X$=$[x1,$x2,$…,$xd]$$Polynomial$Regression$$$$$$$$$$$$$$$$$$$$$$$X$=$[x12,$x1x2,x22$,$…]$$Basis$Regression $ $$$$$$$$$$$$$$$$$X$=$[φ1(x),φ2(x),$…,$φd(x)]%$Regularized$versions$(MAP)$$$Neural$Networks $$$f(X)$=$$nonlinear$(combina1on$of$$$$$$mul1ple$logis1c$units)$$$Kernel$(locally7weighted)$7$Weighted$mean$square$error$$Supervised%Learning%10$Classifica1on:$$(Discrete$labels,$Probability$of$error)$$$Bayes$op1mal$classifica1on$rule$$$$$$f*(X)$=$arg$max$P(Y|X)$$$$$$$$$$$$$Y$$$plug7in$MLE,$MAP$of$distribu1on$model$$Naïve$Bayes$$Decision$Trees$$Logis1c$Regression$$k7nearest$neighbor$$SVM$$Boos1ng$$Decision%Trees%K#NN% Gauss%Naïve%Bayes%Logis=c%regression%Neural%Nwks%HMM% Bayes%Net%SVM% Boos=ng%%Gen/Disc$Loss$func1ons$Decision$boundary$Output$Algorithm$$Model$Complexity$Rela1on$to$others$Parametric/Nonparam$Comparison%Chart%for%Classifica=on%(before$Midterm)$(aper$Midterm)$Hidden$Markov$Models$$$$$1me7series/sequen1al$modeling$$$$$representa1on,$parameters$$$$$evaluate$prob$of$output$sequence$$$$$decode$hidden$states$$$$$learning$parameters$$Neural$Networks$$$$$nonlinear$classifier$$$$$layers$of$mul1ple$logis1c$units$$$$$training$–$backpropaga1on$$$$$local$minimum$$Dimensionality$reduc1on$$$$$feature$selec1on$$$$$PCA$–linear,$direc1ons$of$max$$$$$$$$$$$$$$$$$$variance,$SVD$$$$$$Laplacian$Eigenmaps$–$nonlinear$Clustering$$$$$k7means$–$isotropic,$convex$$$$$spectral$7$connec1vity$based$$Nonparametric$methods$$$$$histogram,$kernel$density$est$$$$$kernel$regression$$$$$k7NN$classifier$$Support$Vector$Machines$$$$$hard7margin,$sop7margin$$$$$support$vectors$$$$$dual$formula1on,$kernel$trick$$Boos1ng$$$$$weak$base$classifiers$trained$on$re7$$$$$$$weighted$data$$$$$Adaboost$algorithm,$exp$loss$Four Fundamentals for ML 1. Learning is an optimization problem – many algorithms are best understood as optimization algs – what objective do they optimize, and how? Local minima? – gradient descent/ascent as general fallback approachFour Fundamentals for ML 1. Learning is an optimization problem – many algorithms are best understood as optimization algs – what objective do they optimize, and how? 2. Learning is a parameter estimation problem – the more training data, the more accurate the statistical estimates – MLE, MAP, M(Conditional)LE, … – to measure accuracy of learned model, we must use test (not train) dataFour Fundamentals for ML 1. Learning is an optimization problem – many algorithms are best understood as optimization algs – what objective do they optimize, and how? 2. Learning is a parameter estimation problem – the more training data, the more accurate the estimates – MLE, MAP, M(Conditional)LE, … – to measure accuracy of learned model, we must use test (not train) data 3. Error arises from three sources – unavoidable error, bias, variance – PAC learning theory: probabilistic bound on overfitting: errortrue - errortraingiven some estimator Y for some parameter θ, we note Y is a random variable (why?) the bias of estimator Y : the variance of estimator Y : consider when • θ is the probability of “heads” for my coin • Y = proportion of heads observed from 3 flips consider when • θ is the vector of correct parameters for learner • Y = parameters output by learning algorithm Bias and Variance of EstimatorsFour Fundamentals for ML 1. Learning is an optimization problem – many algorithms are best understood as optimization algs – what objective do they optimize, and how? 2. Learning is a parameter estimation problem – the more training data, the more accurate the estimates – MLE, MAP, M(Conditional)LE, … – to measure accuracy of learned model, we must use test (not train) data 3. Error arises from three sources – unavoidable error, bias, variance – PAC learning theory: probabilistic bound on overfitting:
View Full Document