U of M EE 4389W - Statistical Learning Theory - D2181475

Home> Schools> University of Minnesota- Twin Cities> Electrical and Computer Engineering (EE) > EE 4389W> Statistical Learning Theory

U of M EE 4389W - Statistical Learning Theory

School name University of Minnesota- Twin Cities

Course Ee 4389w- Introduction to Predictive Learning

Pages 63

Download Save

Unformatted text preview:

Introduction to Predictive LearningOUTLINE of Set 4ObjectivesSlide 4History and OverviewHistory and Overview(cont’d)Importance of VC-theoryInductive Learning SettingThe Problem of Inductive LearningEmpirical Risk MinimizationProbabilistic Modeling vs ERMProbabilistic Modeling vs ERM: ExampleProbabilistic ApproachERM ApproachEstimation of multivariate functionsProperties of high-dimensional dataSlide 17Keep-It-Direct PrincipleLearning vs System IdentificationEmpirical ComparisonEmpirical Comparison (cont’d)Slide 22ConclusionPhilosophical Interpretation of KIDSlide 25VC-theory has 4 parts:Consistency/Convergence of ERMConsistency of ERMConditions for Consistency of ERMSlide 30SHATTERINGVC DIMENSIONVC-dimension and Consistency of ERMVC-dimension and FalsifiabilityRecall Occam’s Razor:Philosophical Principle of VC-falsifiabilityCalculating the VC-dimensionSlide 38Slide 39Slide 40Slide 41VC-dimension vs number of parametersVC-dimension for Regression ProblemsExample: what is VC-dim of kNN Regression?Slide 45Recall consistency of ERMGeneralization BoundsPractical VC Bound for regressionVC Regression Bound for model selectionModeling pure noise with x in [0,1] via poly regression sample size n=30, noiseSlide 51Structural Risk MinimizationSlide 53SRM vs ERM modelingSRM ApproachCommon SRM structuresSlide 57Slide 58Example: SRM structures for regressionSlide 60SRM SummarySlide 62Summary and Discussion: VC-theory1Introduction to Predictive LearningElectrical and Computer EngineeringLECTURE SET 4Statistical Learning Theory2OUTLINE of Set 4•Objectives and Overview•Inductive Learning Problem Setting•Keep-It-Direct Principle•Analysis of ERM•VC-dimension•Generalization Bounds•Structural Risk Minimization (SRM)•Summary and Discussion3ObjectivesProblems with philosophical approaches- lack quantitative description/ characterization of ideas;- no real predictive power (as in Natural Sciences)- no agreement on basic definitions/ concepts (as in Natural Sciences)Goal: to introduce Predictive Learning as a scientific discipline4Characteristics of Scientific Theory•Problem setting•Solution approach•Math proofs (technical analysis)•Constructive methods•ApplicationsNote: Problem Setting and Solution Approach are independent (of each other)5History and Overview•SLT aka VC-theory (Vapnik-Chervonenkis)•Theory for estimating dependencies from finite samples (predictive learning setting)•Based on the risk minimization approach•All main results originally developed in 1970’s for classification (pattern recognition) – why?but remained largely unknown•Recent renewed interest due to practical success of Support Vector Machines (SVM)6History and Overview(cont’d)MAIN CONCEPTUAL CONTRIBUTIONS•Distinction between problem setting, inductive principle and learning algorithms•Direct approach to estimation with finite data (KID principle)•Math analysis of ERM (standard inductive setting)•Two factors responsible for generalization:- empirical risk (fitting error)- complexity(capacity) of approximating functions7Importance of VC-theory•Math results addressing the main question:- under what general conditions the ERM approach leads to (good) generalization?•New approach to induction:Predictive vs generative modeling (in classical statistics)•Connection to philosophy of science- VC-theory developed for binary classification (pattern recognition) ~ the simplest generalization problem- natural sciences: from observations to scientific law VC-theoretical results can be interpreted using general philosophical principles of induction, and vice versa.8 Inductive Learning Setting•The learning machine observes samples (x ,y), and returns an estimated response •Two modes of inference: identification vs imitation•Risk ),(ˆwfy xmin,y),w)) dP(Loss(y, f( xx9The Problem of Inductive Learning•Given: finite training samples Z={(xi, yi),i=1,2,…n} choose from a given set of functions f(x, w) the one that approximates best the true output. (in the sense of risk minimization)Concepts and Terminology• approximating functions f(x, w) •(non-negative) loss function L(f(x, w),y)•expected risk functional R(Z,w)Goal: find the function f(x, wo) minimizing R(Z,w) when the joint distribution P(x,y) is unknown.10Empirical Risk Minimization•ERM principle in model-based learning–Model parameterization: f(x, w) –Loss function: L(f(x, w),y)–Estimate risk from data:–Choose w* that minimizes Remp•Statistical Learning Theory developed from the theoretical analysis of ERM principle under finite sample settingsniiiempyfLnR1)),,((1)( wxw11Probabilistic Modeling vs ERM12Probabilistic Modeling vs ERM: Example-2 0 2 4 6 8 10-6-4-20246810x1x2Known class distribution  optimal decision boundary13Probabilistic ApproachEstimate parameters of Gaussian class distributions , and plug them into quadratic decision boundary-2 0 2 4 6 8 10-6-4-20246810x1x214ERM ApproachQuadratic and linear decision boundary estimated via minimization of squared loss-2 0 2 4 6 8 10-6-4-20246810x1x215Estimation of multivariate functions•Is it possible to estimate a function from finite data? •Simplified problem: estimation of unknown continuous function from noise-free samples •Many results from function approximation theory:–To estimate accurately a d-dimensional function one needs O(n^^d) data points –For example, if 3 points are needed to estimate 2-nd order polynomial for d=1, then 3^^10 points are needed to estimate 2-nd order polynomial in 10-dimensional space.–Similar results in signal processing•Never enough data points to estimate multivariate functions in most practical applications (image recognition, genomics etc.)•For multivariate function estimation, the number of free parameters increases exponentially with problem dimensionality (the Curse of Dimensionality)16Properties of high-dimensional data•Sparse data looks like a porcupine: the volume of a unit sphere inscribed in a d-dimensional cube gets smaller even as the volume of d-cube gets exponentially larger!•A point is closer to an edge than to another point•Pairwise distances between points are the sameIntuition behind kernel (local) methods no longer holds.•How generalization is possible, in spite of the curse of dimensionality?17OUTLINE of Set 4•Objectives and Overview•Inductive Learning Problem Setting•Keep-It-Direct Principle•Analysis of ERM•VC-dimension•Generalization

View Full Document


School:
Email:
New Password:
Confirm Password:

U of M EE 4389W - Statistical Learning Theory

Sign up for free to view:

Please select your school