CMU CS 15381 - Learning Conclusion: Cross- Validation - D1963765

Home> Schools> Carnegie Mellon University> Computer Science (CS) > CS 15381> Learning Conclusion: Cross- Validation

DOC PREVIEW

CMU CS 15381 - Learning Conclusion: Cross- Validation

School name Carnegie Mellon University

Course Cs 15381- Artificial Intelligence: Representation and Problem Solving

Pages 24

This preview shows page 1-2-23-24 out of 24 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 24 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 24 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 24 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 24 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 24 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

1Learning Conclusion: Cross-ValidationBayes Nets Intro:Representing and Reasoning about UncertaintyFinal Considerations: Avoiding Overfitting• We have a choice of different techniques:• Decision trees, Neural Networks, Nearest Neighbors, Bayes Classifier,…• For each we have different levels of complexity:– Depth of trees– Number of layers and hidden units– Number of neighbors in K-NN– …..• How to choose the right one?• Overfitting: A complex enough model (e.g., enough units in a neural network, large enough trees,..) will always be able to fit the training data well2Example• Construct a predictor of y from x given this training dataxyxyxyxyLinearQuadraticPiecewise LinearWhich model is best for predicting yfrom x ????3xyxyxyLinearQuadraticPiecewise LinearWhich model is best for predicting yfrom x ????We want the model that generate the best predictions on future data. Not necessarily the one with the lowest error on training dataUsing a Test Set1. Use a portion (e.g., 30%) of the data as test data2. Fit a model to the remaining training data3. Evaluate the error on the test dataxy4xyxyxyLinear QuadraticError = 2.4Error = 2.2Error = 0.9Piecewise LinearxyxyxyLinear QuadraticError = 2.4Error = 2.2Error = 0.9Piecewise LinearUsing a Test Set:+ Simple- Wastes a large % of the data- May get lucky with one particular subset of the data5“Leave One Out” Cross-Validation• For k=1 to R– Train on all the data leaving out (xk,yk)– Evaluate error on (xk,yk)• Report the average error after trying allthe data pointsxy(xk,yk)Error = 2.12Note: Numerical examples in this and subsequent slides from A. Moore6Error = 0.962Error = 3.337Leave One Out Cross-Validation• For k=1 to R– Train on all the data leaving out (xk,yk)– Evaluate error on (xk,yk)• Report the average error after trying allthe data pointsxy(xk,yk)“Leave One Out” Cross-Validation:+ Does not waste data+ Average over large number of trials- ExpensiveK-Fold Cross-Validation• Randomly divide the data set into Ksubsets • For each subset S:– Train on the data not in S– Test on the data in S• Return the average error over the K subsetsxyExample: K = 3, each color corresponds to a subset8Error = 2.05 Error = 1.11Error = 2.93Cross-Validation SummaryWastes only 1/K of the data!Only K times slower than Test Set!Wastes 1/K of the dataK times slower than Test SetK-FoldDoes not waste dataInefficientLeave One OutSimple/EfficientWastes a lot of dataPoor predictor of future performanceTest Set+-9Classification Problems• The exact same approaches apply for cross-validation except that the error is the number of data points that are misclassified.y = 1y = 0Example: Training a Neural Net• Train neural nets with different numbers of hidden units (more and more complex NNs)• For each NN, evaluate the error using K-fold Cross-Validation• Choose the one with the minimum cross-validation errorMinimum cross-validation error10Summary (R&N Chapter 20)• Learning Algorithms:– Naïve Bayes– Decision Trees– Nearest Neighbors– Neural Networks• Validation:– Error on training set should never be used directly for evaluate learning algorithm on a data set– Validation on test set– Cross-validation to avoid wasting data• Leave one out• K-fold– Used for:• Finding best configuration of learned model (complexity of neural network, K-NN, etc.)• Deciding between different learning algorithms (neural networks, nearest neighbors, decision trees,…) Bayes NetsRepresenting and Reasoning about Uncertainty11Bayes Nets• Material covered in Russell & Norvig, Chapter 14• Not covered in lectures: Networks with continuous variables• Not covered in chapter: d-separationReasoning with Uncertainty• Most real-world problems deal with uncertain information– Diagnosis: Likely disease given observed symptoms– Equipment repair: Likely component failure given sensor reading– Help desk: Likely operation based on past operations12Reasoning with Uncertainty• We saw how to use probability to represent uncertainty and to perform queries such as inference– Diagnosis: Prob (disease | observed symptoms)– Equipment repair: Prob (component | sensor readings)– Help desk: Prob (Likely operation | past operations) • We saw that representing probability distributions can be inefficient (or intractable) for large problems.Reasoning with Uncertainty• We saw how to use probability to represent uncertainty and to perform queries such as inference– Diagnosis: Prob (disease | observed symptoms)– Equipment repair: Prob (component | sensor readings)– Help desk: Prob (Likely operation | past operations) • We saw that representing probability distribution can be inefficient (or intractable) for large problems.• Today: Bayes Nets provide a powerful tool for making reasoning with uncertainty manageable by taking advantage of dependence relations between variables• For example: Knowing that the hand brake is operational does not help diagnose why the engine does not start! • We’ll start by reviewing our key probability tools.13Probability Reminder• Conditional probability for 2 events A and B:P(A|B) = P(A,B)P(B)• Chain rule:P(A,B) = P(A|B) P(B)Probability Reminder• Conditional probability for 2 variables X and Y:P(X=x | Y=y) = P(X=x,Y=y)P(Y=y)• Chain rule:P(X=x,Y=y) = P(X=x|Y=y) P(Y=y)• For any values x,y14The Joint Distribution• Joint distribution = collection of all the probabilities P(X = x,Y = y,Z = z…) for all possible combinations of values.• For m binary variables, size is 2m• Any query can be computed from the joint distribution0.08FFF0.07TFF0.15FTF0.1TTF0.08FFT0.2TFT0.22FTT0.1TTTProbZYXThe Joint Distribution• Any query can be computed from the joint distribution• Marginal distributionP(X = True), P(X = False)• Conditional distribution:P(X = True | Y = True) =P (X = True,Y = True)/P(Y = True)• In general:P(E1| E2) = P(E1,E2)/P(E2) P(E2) = Σ P(Joint Entries)Entries that match E20.08FFF0.07TFF0.15FTF0.1TTF0.08FFT0.2TFT0.22FTT0.1TTTProbZYX15The Joint Distribution• Any query can be computed from the joint distribution• Marginal distributionP(Y = True), P(Y = False)• Conditional distribution:P(X = True | Y = True) =P (X = True,Y = True)/P(Y = True)• In general:P(E1| E2) = P(E1,E2)/P(E2) P(E2) = Σ P(Joint Entries)Entries that match E20.08FFF0.07TFF0.15FTF0.1TTF0.08FFT0.2TFT0.22FTT0.1TTTProbZYXE1and E2are assignments of values to subsets of

View Full Document