Machine Learning ! ! ! ! ! Srihari 1 Learning as Optimization Sargur Srihari [email protected] Learning ! ! ! ! ! Srihari Topics in Learning as Optimization • Evaluation of Learned Model • Empirical Risk and Overfitting – Bias vs. Variance Trade-off – Design and Evaluation of Learning Procedures – Goodness of Fit – PAC bounds • Discriminative versus Generative Training • Learning Tasks – Model Constraints – Data Observability – Taxonomy of Learning Tasks 2Machine Learning ! ! ! ! ! Srihari Optimization Viewpoint • We have: 1. Hypothesis Space • Set of candidate models 2. Objective Function • Criterion for quantifying preference over models • Learning Task – Find a high-scoring model within model class • Learning as optimization is predominant approach 3 e.g., set of all BNs or MNs given a set of variablesMachine Learning ! ! ! ! ! Srihari Criteria for Optimization • Numerical Criteria (Loss functions) we optimize 1. Density Estimation • Relative Entropy or K-L Divergence – Expected value of log-difference – Equivalent to Empirical Risk over instances D!» Called Empirical log-loss 2. Classification • Classification error: 0/1 loss • Hamming loss • Conditional log-likelihood • Can view learning as optimization 4 D(P* || P) = Eξ P *logP * (ξ)P(ξ)⎛⎝⎜⎞⎠⎟⎡⎣⎢⎤⎦⎥ −1|D|log P(ξ[m] :M)m =1M∑E( x, y)~ P[I {hP(x) ≠ y}]E(x, y)~ P*log P(y | x)[ ]Machine Learning ! ! ! ! ! Srihari Why discuss Optimization? • Different choices of Objective Functions – Have ramification to results of learning procedures • Has implications to further discussions on learning 5Machine Learning ! ! ! ! ! Srihari Empirical Distribution & Log-loss • We have an unknown distribution P* • Empirical Distribution where ξ[1], ξ[2]…is a sequence of iid samples from P* – For a sufficiently large training set PD,will be quite close to P* • Consider empirical log-loss – Can be shown that distribution that maximizes likelihood of data set D is the empirical distribution 6 ˆPD(A) =1MI {ξ[m] ∈A}m∑limM → ∞ˆPDM(A) = P * (A)Probability of Event A is the fraction of samples that satisfy A −1|D|log P(ξ[m] :M)m =1M∑Machine Learning ! ! ! ! ! Srihari Data Requirements • How many data samples are needed? • Consider two cases 1. 100 binary random variables 2. Bayesian network where a node has k parents • In both we will see that the hypothesis space is too large 7Machine Learning ! ! ! ! ! Srihari Data Requirement with Binary Variables • Consider 100 binary variables – 2100 possible joint assignments • D has 1,000 instances (most likely distinct) – probability of 0.001 to each assignment – 0 to remaining 2100-1000 • Example is extreme, but phenomenon is general 8Machine Learning ! ! ! ! ! Srihari Data Requirement with Bayesian network • M* is a Bayesian network with a variable ‘Fever’ – “Fever” has many parents (diseases) X1,..Xk – In a table CPD • parameters grow exponentially with no. of parents k – Unlikely D has cases representing all parent instantiations • i.e., all combinations of diseases X1,..Xk • Results in very poor CPDs • Data requirement: exponential with BN connectivity – No of samples needed grows linearly with no. of parameters – No. of parameters grows exponentially with network connectivity 9Machine Learning ! ! ! ! ! Srihari Problem with using Empirical Risk • Learned model tends to overfit to training data • But our goal is to answer queries about unseen data – Not data samples we have seen • New patients in disease example • New unsegmented images • Need to generalize well • This leads to a trade-off – Should not overfit but generalize well 10Machine Learning ! ! ! ! ! Srihari Overfitting and Bias-Variance Trade-off • If hypothesis space is very limited (few bins), unable to represent true P* – Even with unlimited data – Called Bias • If hypothesis space is large (expensive) – Examples: binary data set and disease BN – If data set is small even random fluctuations in data set have an effect – Results in large variance over different data sets 11Machine Learning ! ! ! ! ! Srihari Implication to Model Structure • Not allow too rich a class of models • With limited data – Error due to variance > Error due to bias – Restrict to models too simple to correctly encode P* – Ability to estimate parameters reliably compensates for – Error from incorrect structural assumptions • Restricting models àlearn important edges – Distance to P* is better • Combining approaches – Hard constraint over model class – Optimization objective that leads us away from over-fitting 12Machine Learning ! ! ! ! ! Srihari Generalization Performance • Have not discussed – How well model generalizes • Performance on unseen data sets – Design of hypothesis space to reduce over-fitting • Next: 1. Basic experimental protocol in design/evaluation of learning procedure 2. Theoretical framework for appropriate complexityMachine Learning ! ! ! ! ! Srihari Relative and Absolute Learning 1. Relative – Compare two or more alternatives • Different hypothesis spaces • Different training objectives – Methods: Holdout and Cross-Validation • Training, Validation, Testing 2. Absolute – Whether model captures true distribution – Method: Goodness of Fit 14Machine Learning ! ! ! ! !
View Full Document