Introduction to Modeling 6.872/HST950Why build Models? • To predict (identify) something • Diagnosis • Best therapy • Prognosis • Cost • To understand something • Structure of model may correspond to structure of realityWhere do models come from? • • • • Assumes uniform priors over all hypotheses in the space • A-priori knowledge, expressed in • Structure of the space of models • • Adjustments to observed data Pure induction from data Even so, need some “space” of models to explore Maximum A-posteriori Probability (MAP) • Maximum Likelihood (ML)An Example (Russell & Norvig) • Surprise Candy Corp. makes two flavors of candy: cherry and lime • Both flavors come in the same opaque wrapper • Candy is sold in large bags, which have one of the following distributions of flavors, but are visually indistinguishable: • h1: 100% cherry • h2: 75% cherry, 25% lime • h3: 50% cherry, 50% lime • h4: 25% cherry, 75% lime • h5: 100% lime • Relative prevalence of these types of bags is (.1, .2, .4, .2, .1) • As we eat our way through a bag of candy, predict the flavor of the next piece; actually a probability distribution.Bayesian Learning • Calculate the probability of each hypothesis given the data • To predict the probability distribution over an unknown quantity, X, • If the observations d are independent, then • E.g., suppose the first 10 candies we taste are all limeh1: 100% cherry h2: 75% cherry, 25% lime h3: 50% cherry, 50% lime Learning Hypothesesh4: 25% cherry, 75% lime h5: 100% lime and Predicting from Them • (a) probabilities of hi after k lime candies; (b) prob. of next lime • Image by MIT OpenCourseWare.MAP prediction: predict just from most probable hypothesis • After 3 limes, h5 is most probable, hence we predict lime • Even though, by (b), it’s only 80% probable 0 2 4 6 8 100 2 4 6 8 1000.20.40.60.810.40.50.60.70.80.91Number of samples in da bNumber of samples in dPosterior probability of hypothesisProbability that next candy is limeP(h1 | d) P(h2 | d) P(h3 | d) P(h4 | d) P(h5 | d)Observations • Bayesian approach asks for prior probabilities on hypotheses! • Natural way to encode bias against complex hypotheses: make their prior probability very low • Choosing hMAP to maximize • is equivalent to minimizing • but as we know that entropy is a measure of information, these two terms are • # of bits needed to describe the data given hypothesis • # bits needed to specify the hypothesis • Thus, MAP learning chooses the hypothesis that maximizes compression of the data; Minimum Description Length principle • Regularization is similar to 2nd term—penalty for complexity • Assuming uniform priors on hypotheses makes MAP yield hML, the maximum likelihood hypothesis, which maximizesLearning More Complex Hypotheses • Input: • Set of cases, each of which includes • numerous features: categorical labels, ordinals, continuous • these correspond to the independent variables • Output: • For each case, a result, prediction, classification, etc., corresponding to the dependent variable • In regression problems, a continuous output • a designated feature the model tries to predict • In classification problems, a discrete output • the category to which the case is assigned • Task: learn function f(input)=output • that minimizes some measure of errorLinear Regression • General form of the function • For each case: • Find to minimize some function of over all • e.g., mean squared error:Logistic Regression • Logistic function: • E.g, how risk factors contribute to probability of death are the log odds ratios •More sophisticated models • Nearest Neighbor Methods • Classification Trees • Artificial Neural Nets • Support Vector Machines • Bayes Networks (much on this, later) • Rough Sets, Fuzzy Sets, etc. (see 6.873/HST951 or other ML classes)How? • Given: pile of training data, all cases labeled with gold standard outcome • Learn “best” model • Gather new test data, also all labeled with outcomes • Test performance of model on new test data • Simple, no?Simplest Example • Relationship between a diagnostic conclusion and a diagnostic test Test Positive Test Negative Disease Present True Positive False Negative TP+FN Disease Absent False Positive True Negative FP+TN TP+FP FN+TNDefinitions Test Positive Test Negative Disease Present True Positive False Negative TP+FN Disease Absent False Positive True Negative FP+TN TP+FP FN+TN Sensitivity (true positive rate): TP/(TP+FN) ! False negative rate: 1-Sensitivity = FN/(TP+FN) Specificity (true negative rate): TN/(FP+TN) ! False positive rate: 1-Specificity = FP/(FP+TN) Positive Predictive Value (PPV): TP/(TP+FP) Negative Predictive Value (NPV): TN/(FN+TN)Test Thresholds + -FPFN TWonderful Test + -FPFN TTest Thresholds Change Trade-off between Sensitivity and Specificity + -FPFN TReceiver Operator Characteristic TPR (sensitivity) (ROC) Curve 0 FPR (1-specificity)1 0 1 TTPR What makes a better test? 0 FPR (1-specificity)1 (sensitivity) 0 1 worthless superb OKNeed to explore many models • Remember: • training set => model • model + test set => measure of performance • But • How do we choose the best family of models? • How do we choose the important features? • Models may have structural parameters • Number of hidden units in ANN • Max number of parents in Bayes Net • Parameters (like the betas in LR), and meta-parameters • Not legitimate to “try all” and report the best !!!!!!!!!!!!!!!!!!The Lady Tasting Tea • R.A. Fisher & the Lady • B. Muriel Bristol claimed she prefers tea added to milk rather than milk added to tea • Fisher was skeptical that she could distinguish • Possible resolutions • Reason about the chemistry of tea and milk • Milk first: a little tea interacts with a lot of milk • Tea first: vice versa • Perform a “clinical trial” • Ask her to determine order for a series of test cups • Calculate probability that her answers could have occurred by chance guessing; if small, she “wins” • ... Fisher’s Exact Test • Significance testing • Reject the null hypothesis (that it happened by chance) if its probability is < 0.1, 0.05, 0.01, 0.001, ..., 0.000001, ..., ????How to deal with multiple testing • Suppose Ms. Bristol had tried this test 100 times, and passed once.Would you be convinced of her ability to
View Full Document