UB CSE 574 - Learning as Optimization - D2117704

Home> Schools> University at Buffalo, The State University of New York> Computer Science & Engineering (CSE) > CSE 574> Learning as Optimization

DOC PREVIEW

UB CSE 574 - Learning as Optimization

School name University at Buffalo, The State University of New York

Course Cse 574- Introduction to Machine Learning

Pages 35

This preview shows page 1-2-16-17-18-34-35 out of 35 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 35 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 35 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 35 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 35 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 35 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 35 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 35 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 35 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

Machine Learning ! ! ! ! ! Srihari 1 Learning as Optimization Sargur Srihari [email protected] Learning ! ! ! ! ! Srihari Topics in Learning as Optimization • Evaluation of Learned Model • Empirical Risk and Overfitting – Bias vs. Variance Trade-off – Design and Evaluation of Learning Procedures – Goodness of Fit – PAC bounds • Discriminative versus Generative Training • Learning Tasks – Model Constraints – Data Observability – Taxonomy of Learning Tasks 2Machine Learning ! ! ! ! ! Srihari Optimization Viewpoint • We have: 1. Hypothesis Space • Set of candidate models 2. Objective Function • Criterion for quantifying preference over models • Learning Task – Find a high-scoring model within model class • Learning as optimization is predominant approach 3 e.g., set of all BNs or MNs given a set of variablesMachine Learning ! ! ! ! ! Srihari Criteria for Optimization • Numerical Criteria (Loss functions) we optimize 1. Density Estimation • Relative Entropy or K-L Divergence – Expected value of log-difference – Equivalent to Empirical Risk over instances D!» Called Empirical log-loss 2. Classification • Classification error: 0/1 loss • Hamming loss • Conditional log-likelihood • Can view learning as optimization 4 D(P* || P) = Eξ P *logP * (ξ)P(ξ)⎛⎝⎜⎞⎠⎟⎡⎣⎢⎤⎦⎥ −1|D|log P(ξ[m] :M)m =1M∑E( x, y)~ P[I {hP(x) ≠ y}]E(x, y)~ P*log P(y | x)[ ]Machine Learning ! ! ! ! ! Srihari Why discuss Optimization? • Different choices of Objective Functions – Have ramification to results of learning procedures • Has implications to further discussions on learning 5Machine Learning ! ! ! ! ! Srihari Empirical Distribution & Log-loss • We have an unknown distribution P* • Empirical Distribution where ξ[1], ξ[2]…is a sequence of iid samples from P* – For a sufficiently large training set PD,will be quite close to P* • Consider empirical log-loss – Can be shown that distribution that maximizes likelihood of data set D is the empirical distribution 6 ˆPD(A) =1MI {ξ[m] ∈A}m∑limM → ∞ˆPDM(A) = P * (A)Probability of Event A is the fraction of samples that satisfy A −1|D|log P(ξ[m] :M)m =1M∑Machine Learning ! ! ! ! ! Srihari Data Requirements • How many data samples are needed? • Consider two cases 1. 100 binary random variables 2. Bayesian network where a node has k parents • In both we will see that the hypothesis space is too large 7Machine Learning ! ! ! ! ! Srihari Data Requirement with Binary Variables • Consider 100 binary variables – 2100 possible joint assignments • D has 1,000 instances (most likely distinct) – probability of 0.001 to each assignment – 0 to remaining 2100-1000 • Example is extreme, but phenomenon is general 8Machine Learning ! ! ! ! ! Srihari Data Requirement with Bayesian network • M* is a Bayesian network with a variable ‘Fever’ – “Fever” has many parents (diseases) X1,..Xk – In a table CPD • parameters grow exponentially with no. of parents k – Unlikely D has cases representing all parent instantiations • i.e., all combinations of diseases X1,..Xk • Results in very poor CPDs • Data requirement: exponential with BN connectivity – No of samples needed grows linearly with no. of parameters – No. of parameters grows exponentially with network connectivity 9Machine Learning ! ! ! ! ! Srihari Problem with using Empirical Risk • Learned model tends to overfit to training data • But our goal is to answer queries about unseen data – Not data samples we have seen • New patients in disease example • New unsegmented images • Need to generalize well • This leads to a trade-off – Should not overfit but generalize well 10Machine Learning ! ! ! ! ! Srihari Overfitting and Bias-Variance Trade-off • If hypothesis space is very limited (few bins), unable to represent true P* – Even with unlimited data – Called Bias • If hypothesis space is large (expensive) – Examples: binary data set and disease BN – If data set is small even random fluctuations in data set have an effect – Results in large variance over different data sets 11Machine Learning ! ! ! ! ! Srihari Implication to Model Structure • Not allow too rich a class of models • With limited data – Error due to variance > Error due to bias – Restrict to models too simple to correctly encode P* – Ability to estimate parameters reliably compensates for – Error from incorrect structural assumptions • Restricting models àlearn important edges – Distance to P* is better • Combining approaches – Hard constraint over model class – Optimization objective that leads us away from over-fitting 12Machine Learning ! ! ! ! ! Srihari Generalization Performance • Have not discussed – How well model generalizes • Performance on unseen data sets – Design of hypothesis space to reduce over-fitting • Next: 1. Basic experimental protocol in design/evaluation of learning procedure 2. Theoretical framework for appropriate complexityMachine Learning ! ! ! ! ! Srihari Relative and Absolute Learning 1. Relative – Compare two or more alternatives • Different hypothesis spaces • Different training objectives – Methods: Holdout and Cross-Validation • Training, Validation, Testing 2. Absolute – Whether model captures true distribution – Method: Goodness of Fit 14Machine Learning ! ! ! ! !

View Full Document