1CS 559: Machine LearningCS 559: Machine Learning Fundamentals and Applications(Midterm Recap)Instructor: Philippos MordohaiWebpage: www.cs.stevens.edu/~mordohaiEmail:Philippos Mordohai@stevens eduE-mail: [email protected]: Lieb 215MidtermMidterm•October 21(first part of the class)October 21 (first part of the class)– Open books/notesNo graphingcalculators–No graphing calculatorsPattern Classification, Chapter 2 2OutlineOutline• Probability theory• Bayes decision theory• Maximum-likelihood and Bayesian yparameter estimation•Expectation maximizationExpectation maximization• Non-parametric techniquesHiddMk dl•Hidden Markov models • Principal component analysis3Pairs of Discrete Random VariablesPairs of Discrete Random Variables• Let xand ybe two discrete r.v. • For each possible pair of values, we can define a joint probability pij=Pr[x=xi, y=yj]jp ypij[i,yyj]• We can also define a joint probability mass functionP(xy) which offers a completefunctionP(x,y) which offers a complete characterization of the pair of r.v.PP)()(Marginal distributionsyYyxyxPyPyxPxP),()(),()(4Note that Pxand Pyare different functionsXxyyy),()(Conditional ProbabilityConditional Probability•When tworvare not independentWhen two r.v. are not independent, knowing one allows better estimate of the other (e g outside temperature season)other (e.g. outside temperature, season)]yPr[y]yy , xPr[x]yy |xPr[xjiji• If independent P(x|y)=P(x)]yPr[yj5Law of Total ProbabilityLaw of Total Probability•If an event A can occur inmdifferent waysIf an event A can occur in mdifferent ways and if these mdifferent ways are mutually exclusive then the probability of Aexclusive, then the probability of A occurring is the sum of the probabilities of the sub-eventsthe subeventsjjiiyYPyYxXPxXP )()|()(jjj6BayesRuleBayesRule)()()|()(),()|( PxPxyPPyxPyxPprior*likelihoodposterior),()(XxyxPyP•xis the unknown causeevidenceposterior• y is the observed evidence• Denominator often omitted (maximum a posteriori li )solution)• Bayes rule shows how probability of xchanges after we have observedyafter we have observed y7Normal (Gaussian) DistributionNormal (Gaussian) Distribution•Central Limit Theorem: under variousCentral Limit Theorem: under various conditions, the distribution of the sum of dindependent random variables approachesindependent random variables approaches a limiting form known as the normal distribution1)(2xdistribution),(21)(22)(2Nexpx8Normal (Gaussian) DistributionNormal (Gaussian) Distribution9OutlineOutline• Probability theory• Bayes decision theory• Maximum-likelihood and Bayesian yparameter estimation•ExpectationmaximizationExpectation maximization• Non-parametric techniquesHiddMk dl•Hidden Markov models • Principal component analysis10Bayes Decision TheoryBayes Decision Theory•Probability distributions for the categoriesProbability distributions for the categories are known•Do notneedtraining data•Do not need training data• Can design optimal classifier• Very rare in real lifeyPattern Classification, Chapter 2 11Decision RulesDecision Rules• Decision rule with only the prior informationyp– Decide 1if P(1) > P(2) otherwise decide 2– Prior comes from prior knowledge, no data have been seen yetbeen seen yet– If there is a reliable source prior knowledge, it should be used• Use of the class–conditional information• p(x | 1) and p(x | 2) describe the difference in lightness between populations12Pattern Classification, Chapter 2Decision using PosteriorsDecision using Posteriors• Decision given the posterior probabilitiesX is an observation for which:if P(1| x) > P(2| x) True state of nature = 1if P(1| x) < P(2| x) True state of nature = 2Therefore:whenever we observe a particular x, the probability of error is :of error is :P(error | x) = P(1| x) if we decide 2P(error | x) = P(2| x) if we decide 113Pattern Classification, Chapter 214Pattern Classification, Chapter 2Minimizing RiskMinimizing Risk•Let {} be the set of c states ofLet {1, 2,…, c} be the set of c states of nature (or “categories”)• Let {1, 2,…, a} be the set of possible actionsactionsL(|)bhl i df ki•Let (i| j) be the loss incurred for taking action iwhen the state of nature is j15Pattern Classification, Chapter 2Overall RiskRis the expected loss associated with a given decision ruleMinimizing R Minimizing R(i| x) for i = 1,…, a(lt tith t i i i i k f ti f )(select action that minimizes risk as a function of x)cjjjii)x|(P)|()x|(Rfor i = 1,…,aSelect the action ifor which R(i| x)is minimum1jjjii)|()|()|(R is minimum and R in this case is called the Bayes risk= best performance that can be achievedrisk = best performance that can be achievedPattern Classification, Chapter 2 16Conditional RiskConditional Risk•Two-category classificationTwocategory classification1: decide 12: decide22ij= (i| j)loss incurred for deciding iwhen the true state of nature is jgijConditional risk:R(1| x) = 11P(1| x) + 12P(2| x)R(2| x) = 21P(1| x) + 22P(2| x) (2|)21(1 |)22(2|)17Pattern Classification, Chapter 2Decision RuleDecision RuleOur rule is the following:Our rule is the following:if R(1| x) < R(2| x) action1:decide1action 1: decide 1This results in the equivalentrule:This results in the equivalent rule:decide 1if:()P(x|)P()>()P(x|)P()(21-11) P(x | 1) P(1) > (12-22) P(x | 2) P(2)and decide2otherwise18Pattern Classification, Chapter 2Likelihood ratioLikelihood ratioThe preceding rule is equivalent to the following rule:)(P)(P.)|x(P)|x(P if121121221221)()|(111212Then take action 1(decide 1)Otherwise take action 2(decide 2)2(2)19Pattern Classification, Chapter 2The Zero-one Loss FunctionThe Zeroone Loss Function• Zero-one loss function:c,...,1j,i ji 1ji 0),(jiTherefore, the conditional risk is: )|()|()|(cjxPxR1)|(1)|( )|()|()|(ijjjjiixPxPxPxR• The risk corresponding to this loss function is the
View Full Document