**Unformatted text preview:**

Machine learning: lecture 12Tommi S. JaakkolaMIT [email protected]• Complexity and model selection– structural risk minimization• Complexity, compression, and model selection– description length– minimum description length principleTommi Jaakkola, MIT CSAIL 2VC-dimension: reviewShattering: A set of classifiers F (e.g., linear classifiers)is said to shatter n points x1, . . . , xnif for any possibleconfiguration of labels y1, . . . , ynwe can find h ∈ F thatreproduces those labels.VC-dimension: The VC-dimension of a set of classifie rs F isthe largest number of points that F can shatter (maximizedover the choice of the n points).Learning: We don’t expect to learn anything until we havemore than dV Ctraining examples and labels (this statementwill be refined later on).Tommi Jaakkola, MIT CSAIL 3The number of labelings0 50 100 150 200020406080100120140160# of training exampleslog2(# of labelings)n ≤ dV C: # of labelings = 2nn > dV C: # of labelings ≤"endV C#dV CTommi Jaakkola, MIT CSAIL 4Learning and VC-dimension• By essentially replacing log M in the finite case with the logof the number of possible labelings by the set of classifiersover n (really 2n) points, we get an analogous result:Theorem: With probability at least 1 − δ over the choice ofthe training set, for all h ∈ FE(h) ≤ˆEn(h) + "(n, dV C, δ)where"(n, dV C, δ) =!dV C(log(2n/dV C) + 1) + log(1/(4δ))nTommi Jaakkola, MIT CSAIL 5Model selection• We try to find the model with the best balance of complexityand fit to the training data• Ideally, we would select a model from a nested sequence ofmodels of increasing complexity (VC-dimension)Model 1, F1VC-dim = d1Model 2, F2VC-dim = d2Model 3, F3VC-dim = d3where F1⊆ F2⊆ F3⊆ . . .• Model selection criterion: find the model (set of classifiers)that achieves the lowest upper bound on the expected loss(generalization error):Expec ted error ≤ Training error + Complexity penaltyTommi Jaakkola, MIT CSAIL 6Structural risk minimization• We choose the model class Fithat minimizes the upperbound on the expected error:E(ˆhi) ≤ˆEn(ˆhi) +!di(log(2n/di) + 1) + log(1/(4δ))nwhereˆhiis the classifier from Fithat minimizes the trainingerror.0 10 20 30 40 5000.10.20.30.40.50.60.70.80.91VC dimensionBound Complexity penalty Training error Tommi Jaakkola, MIT CSAIL 7Example• Models of increasing complexityModel 1 K(x1, x2) = (1 + (xT1x2))Model 2 K(x1, x2) = (1 + (xT1x2))2Model 3 K(x1, x2) = (1 + (xT1x2))3. . . . . .• These are nested, i.e.,F1⊆ F2⊆ F3⊆ . . .where Fkrefers to the set of possible decision boundariesthat the model k can represent.Tommi Jaakkola, MIT CSAIL 8Structural risk minimization: example!1.5 !1 !0.5 0 0.5 1 1.5 2!1!0.500.511.52!1.5 !1 !0.5 0 0.5 1 1.5 2!1!0.500.511.52linear 2ndorder polynomial!1.5 !1 !0.5 0 0.5 1 1.5 2!1!0.500.511.52!1.5 !1 !0.5 0 0.5 1 1.5 2!1!0.500.511.524thorder polynomial 8thorder polynomialTommi Jaakkola, MIT CSAIL 9Structural risk minimization: example cont’d• Number of training examples n = 50, confidence parameterδ = 0.05.Model dV CEmpirical fit "(n, dV C, δ)1storder 3 0.06 0.55012ndorder 6 0.06 0.69994thorder 15 0.04 0.94948thorder 45 0.02 1.2849• Structural risk minimization would select the simplest (linear)model in this case.Tommi Jaakkola, MIT CSAIL 10Complexity and margin• The number of p oss ible labelings of points with large margincan be dramatically less than the (basic) VC-dimension wouldimplyxoooooooooxxxxxxx• The set of separating hyperplaces which attain margin γor better for examples within a sphere of radius R hasVC-dimension bounded by dV C(γ) ≤ R2/γ2Tommi Jaakkola, MIT CSAIL 11Topics• Complexity and model selection– structural risk minimization• Complexity, compression, and model selection– description length– minimum description length principleTommi Jaakkola, MIT CSAIL 12Data compression and model selection• We can alternatively view model selection as a problem offinding the best way of communicating the available data• Compression and learning:?x2. . . xnx1x2. . . xn. . .y1y2yn? ? . . .x1Tommi Jaakkola, MIT CSAIL 13Data compression and model selection• We can alternatively view model selection as a problem offinding the best way of communicating the available data• Compression and learning:h(x;ˆθ)x2. . . xnx1x2. . . xn. . .y1y2yn? ? . . . ?x1Tommi Jaakkola, MIT CSAIL 14Data compression and model selection• We can alternatively view model selection as a problem offinding the best way of communicating the available data• Compression and learning:ˆθx2. . . xnx1x2. . . xn. . .y1y2yn? ? . . . ?h(x;ˆθ)x1Tommi Jaakkola, MIT CSAIL 15Data compression and model selection• We can alternatively view model selection as a problem offinding the best way of communicating the available data• Compression and learning:ˆθx2. . . xnx1x2. . . xn. . .y1y2yn? ? . . . ?h(x;ˆθ) h(x;ˆθ)x1Tommi Jaakkola, MIT CSAIL 16Data compression and model selection• We can alternatively view model selection as a problem offinding the best way of communicating the available data• Compression and learning:errorsx2. . . xnx1x2. . . xn. . .y1y2yn? ? . . . ?h(x;ˆθ) h(x;ˆθ)ˆθx1Tommi Jaakkola, MIT CSAIL 17Data compression and model selection• We can alternatively view model selection as a problem offinding the best way of communicating the available data• Compression and learning:errorsx2. . . xnx1x2. . . xn. . .y1y2yn? ? . . . ?h(x;ˆθ) h(x;ˆθ)ˆθx1The receiver already knows– input examples, models we considerNeed to communicate– model class, parameter estimates, prediction errorsTommi Jaakkola, MIT CSAIL 18Compression and sequential estimation• We don’t have to communicate any real valued parametersif we setup the learning problem sequentially?x2. . . xnx1x2. . . xn. . .y1y2yn? ? . . .x1Tommi Jaakkola, MIT CSAIL 19Compression and sequential estimation• We don’t have to communicate any real valued parametersif we setup the learning problem sequentiallyθ0: default parameter valuesx2. . . xnx1x2. . . xn. . .y1y2yn? ? . . . ?h(x; θ0)x1Tommi Jaakkola, MIT CSAIL 20Compression and sequential estimation• We don’t have to communicate any real valued parametersif we setup the learning problem sequentiallyˆθ1: based on θ0and (x1, y1)x2. . . xnx1x2. . . xn. . .y1y2yn? ? . . . ?θ0: default parameter valuesh(x;ˆθ1)x1Tommi Jaakkola, MIT CSAIL 21Compression and sequential estimation• We don’t have to communicate any real

View Full Document