Artificial Intelligence: Representation and Problem Solving15-381May 8, 2007Review Session for FinalMichael S. Lewicki ! Carnegie MellonArtificial Intelligence: Final Review SessionEssential things to know (from 2nd half)•Probability and Uncertainty•Bayes nets•Decision Trees•Probabilistic Learning•Cross Validation•Clustering and k-Nearest Neighbors•Neural Nets•MDPs•HMMs•Reinforcement Learning2Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Final Review SessionProbability and Uncertainty•sum rule (marginalization)•product rule•Bayes rule•independence 3p(x, y) = p(x|y)p(y)p(X = xi) =N!j=1p(X = xi, Y = yj)p(x) =!yp(x, y)p(x) ="dy p(x, y)short hand notationp(y|x) =p(x|y)p(y)p(x)=p(x|y)p(y)!yp(x|y)p(y)p(x, y) = p(x)p(y)iff x and y are independentp(x1, x2, . . . , xN) =N!n=1p(xn)Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Final Review SessionBayes Nets•Bayes Net = directed graph of variables with the property that a variable is conditionally independent of its non-descendants given its parentsP(X | Parents(X), Y,Z,W) = P(X|Parents(X))•Associated with each variable is the Conditional Probability Table (CPT) of P(X|Parents(X))•Any entry of the joint probability distribution can be computed from the Bayes Net by looking only at the CPT:4P (x1, . . . , xn) ≡ P (X1= x1∧ . . . ∧ Xn= xn)=n!i=1P (xi|parents(Xi))Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Final Review SessionBayes Nets•Inference: Given observed values for a subset of variables (the “evidence” E2), compute the probability distribution of another set of variables E1 (the “query”), symbolically denoted by: P(E1 | E2)•In principle, any inference can be computed from a Bayes Nets by summing up the appropriate entries of the joint distribution: 5Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Final Review SessionBayes Nets•Inference requires in general exponential time in the number of variables, because of the number of terms in the sums•The sums can be simplified by cleverly grouping the terms that depend only on one (or a small number of) variable•This can implemented in polynomial time if the graph is a “polytree” (i.e., the undirected version of the graph is a tree)•Expected:-Understand how probabilities can be expressed as sums of products-How to group terms advantageously and do so on simple examples6Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Final Review SessionfebacdeabcStatistical dependences in belief nets•Is an independence statement, e.g. is a b | c implied by a graph?•If two nodes are conditionally independent given another variable, e.g. a b | c then p(a|b,c) = p(a|c).•Note: There is a technique called “d-separation” for determining whether two noes (or sets of nodes) are independent. It was covered in previous years and is in previous exams, but the book does not cover this and we did not cover it in lecture, so you are not responsible for this material.•What you should understand: How information can propagate in belief networks to affect the beliefs about certain nodes.7Is a independent of b given c?No. The inferred value of f depends on b. e is also factor in determining d. d in turn influences a. So the value of b can influence the value of a, so they are not independent.Is a is independent of b given e ?Yes. e fixed completely determines the probability of b via p(b|e).febacdeabcMichael S. Lewicki ! Carnegie MellonArtificial Intelligence: Learning and Decision TreesTypes of learning8world(or data)model{θ1, . . . , θn}desired output{y1, . . . , yn}supervisedworld(or data)model{θ1, . . . , θn}unsupervisedworld(or data)model{θ1, . . . , θn}model outputreinforcementreinforcementMichael S. Lewicki ! Carnegie MellonArtificial Intelligence: Learning and Decision TreesPolynomial curve fitting90 1!1010 1!1010 1!1010 1!101y(x, w) = w0+ w1x + w2x2+ · · · + wMxM=M!j=0wjxjE(w) =12N!n=1[y(xn, w) − tn]2example from Bishop (2006), Pattern Recognition and Machine Learning“Overfitting”Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Decision Trees 210General principle•As its complexity increases, the model is able to better classify the training data•Performance on the test data initially increases, but then falls as the model overfits, or becomes specialized for classifying the noise training•The complexity in decision trees is the number of free parameters, ie the number of nodes%correct classificationComplexity of model (eg size of tree)Classification performance on training dataClassification performance on test dataRegion of overfitting the training dataMichael S. Lewicki ! Carnegie MellonArtificial Intelligence: Learning and Decision TreesDecision trees: classifying from a set of attributes11<2 years at current job?missed payments?defaulted?NNNYNYNNNNNNNYYYNNNYNNYYYNNYNNPredicting credit riskbad: 3good: 7missed payments?N Ybad: 2good: 1bad: 1good: 6<2 years at current job?N Ybad: 0good: 3bad: 1good: 3•each level splits the data according to different attributes•goal: achieve perfect classification with minimal number of decisions-not always possible due to noise or inconsistencies in the dataMichael S. Lewicki ! Carnegie MellonArtificial Intelligence: Learning and Decision TreesEntropy12•How many bits does it take to specify the attribute of ‘defaulted?’-P(defaulted = Y) = 3/10-P(defaulted = N) = 7/10•How much can we reduce the entropy (or uncertainty) of ‘defaulted’ by knowing the other attributes?•Ideally, we could reduce it to zero, in which case we classify perfectly.<2 years at current job?missed payments?defaulted?NNNYNYNNNNNNNYYYNNNYNNYYYNNYNNPredicting credit riskH(Y ) = −!i=Y,NP (Y = yi) log2P (Y = yi)= −0.3 log20.3 − 0.7 log20.7= 0.8813Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Learning and Decision TreesConditional entropy•H(Y|X) is the remaining entropy of Y given XorThe expected (or average) entropy of P(y|x)•H(Y|X=x) is the specific conditional entropy, i.e. the entropy of Y knowing the value of a specific attribute x.13H(Y |X) ≡ −!xP (x)!yP (y|x) log2P (y|x)= −!xP (x)!yP (Y = y|X = x) log2P (Y = y|X = x)= −!xP (x)!yH(Y |X = x)Michael S. Lewicki ! Carnegie MellonArtificial Intelligence: Learning and Decision TreesBack to the credit risk example14<2 yrsmisseddef?NNNYNYNNNNNNNYYYNNNYNNYYYNNYNNPredicting credit riskH(Y |X) ≡ −!xP (x)!yP (y|x) log2P (y|x)=
View Full Document