##
This **preview** shows page *1-2-3-19-20-38-39-40*
out of 40 **pages**.

*View Full Document*

End of preview. Want to read all 40 pages?

Upload your study docs or become a GradeBuddy member to access this document.

View Full Document**Unformatted text preview:**

15-381: Artificial IntelligenceDensity estimationInference methods: SummaryLinear in the numberof samples drawn(N)ApproximateUse randomsamples from fulljoint to compute thepartial jointStochasticinferenceWorth caseexponentialExactReusecomputationsVariableeliminationExponential in thenumber ofunobservedvariablesExactLoop over allunobservedvariablesEnumerationRunning timeSolutionBasic ideaMethodConditional Probability Tables(CPT)AJ MB EP(B)=.05P(E)=.1P(A|B,E) )=.95P(A|B,¬E) = .85P(A| ¬ B,E) )=.5P(A| ¬ B, ¬ E) = .05P(J|A) )=.7P(J|¬A) = .05P(M|A) )=.8P(M|¬A) = .15But where do we getthem?Density Estimation• A Density Estimator learns a mapping from a set ofattributes to a ProbabilityDensityEstimatorProbabilityInput data for avariable or a set ofvariablesDensity estimation• Estimate the distribution (or conditional distribution) of arandom variable• Types of variables: - Binary coin flip, alarm - Discrete dice, car model year - Continuous height, weight, temp.,Not just for Bayesian networks …• Density estimators can do many good things…– Can sort the records by probability, and thus spotweird records (anomaly detection)– Can do inference: P(E1|E2)Medical diagnosis / Robot sensors– Ingredient for Bayes networksDensity estimation• Binary and discrete variables:• Continuous variables:Easy: Just count!Harder (but just a bit): Fita modelLearning a density estimatorrecords ofnumber total ][in which records#)][(ˆuixuixP===A trivial learning algorithm!Course evaluation120135501330249031713191EvaluationSizeSummer?P(summer) = #Summer / # records= 23/151 = 0.15P(Evaluation = 1) = #Evaluation=1/ # records= 49/151 = 0.32P(Evaluation = 1 | summer) =P(Evaluation = 1 & summer) /P(summer) = 2/23 = 0.09But why do we count?Computing the joint likelihood ofthe data120135501330249031713191EvaluationSizeSummer?P(summer) = #Summer / # records= 23/151 = 0.15P(Evaluation = 1) = #Evaluation=1/ # records= 49/151 = 0.32P(Evaluation = 1 | summer) =P(Evaluation = 1 & summer) /P(summer) = 2/23 = 0.09!==""=RkkR|MP|MP|MP121)(ˆ)(ˆ)dataset(ˆxxxx KThe next slide presents one of the mostimportant ideas in probabilistic inference. Ithas a huge number of applications in manydifferent and diverse problemsMaximum Likelihood Principle• We can fit models by maximizing the probability ofgenerating the observed samples:L(x1, … ,xn | Θ) = p(x1 | Θ) … p(xn | Θ)• The samples (rows in the table) are assumed to beindependent)• For a binary random variable A with P(A=1)=q argmaxq = #1/#samples• Why?•For a binary random variable A with P(A=1)=q argmaxq = #1/#samples• Why?Data likelihood:We would like to find:Maximum Likelihood Principle21)1()|(nnqqMDP !=21)1(maxargnnqqq !Data likelihood:We would like to find:Maximum Likelihood Principle21)1()|(nnqqMDP !=21)1(maxargnnqqq !211211212111121112110)1(0))1(()1(0)1()1(0)1()1()1(212121212121nnnqqnqnnqnqnqnqnqqqnqqqnqqnqqqnqqqnnnnnnnnnnnn+=!+=!=""!="""!="""!=##"""="##""""""Log ProbabilitiesWhen working with products, probabilities ofentire datasets often get too small. A possiblesolution is to use the log of probabilities, oftentermed ‘log likelihood’!"====RkkRkk|MP|MP|MP11)(ˆlog)(ˆlog)dataset(ˆlog xxLog valuesbetween 0 and 1Density estimation• Binary and discrete variables:• Continuous variables:Easy: Just count!Harder (but just a bit): Fita modelBut what if weonly have veryfew samples?The danger of joint densityestimation120135501330249031713191EvaluationSizeSummer?P(summer & size > 20 & evaluation = 3)= 0- No such example in our datasetNow lets assume we are given anew (often called ‘test’) dataset. Ifthis dataset contains the lineSummer Size Evaluation 1 30 3Then the probability we wouldassign to the entire dataset is 0Naïve Density EstimationThe problem with the Joint Estimator is that it justmirrors the training data.We need something which generalizes more usefully.The naïve model generalizes strongly:Assume that each attribute is distributedindependently of any of the other attributes.Joint estimation, revisited120135501330249031713191EvaluationSizeSummer?Assuming independence we cancompute each probability independentlyP(Summer) = 0.15P(Evaluation = 1) = 0.32P(Size > 20) = 0.63How do we do on the joint?P(Summer & Evaluation = 1) = 0.09P(Summer)P(Evaluation = 1) = 0.05P(size > 20 & Evaluation = 1) = 0.23P(size > 20)P(Evaluation = 1) = 0.20Not bad !Joint estimation, revisited120135501330249031713191EvaluationSizeSummer?Assuming independence we cancompute each probability independentlyP(Summer) = 0.15P(Evaluation = 1) = 0.32P(Size > 20) = 0.63How do we do on the joint?P(Summer & Size > 20) = 0.026P(Summer)P(Size > 20) = 0.094We must be careful when using the Naïvedensity estimatorContrastGiven 100 records and 10,000multivalued attributes will be fineGiven 100 records and more than 6Boolean attributes will screw upbadlyOutside Naïve’s scopeNo problem to model “C is a noisycopy of A”Can model only very boringdistributionsCan model anythingNaïve DEJoint DEDealing with small datasets• We just discussed one possibility: Naïve estimation• There is another way to deal with small number ofmeasurements that is often used in practice.• Assume we want to compute the probability of heads in acoin flip - What if we can only observe 3 flips? - 25% of the times a maximum likelihood estimator willassign probability of 1 to either the heads or tailsPseudo counts- What if we can only observe 3 flips?- 25% of the times a maximum likelihood estimator will assign probability of 1 toeither the heads or tails• In these cases we can use prior belief about the‘fairness’ of most coins to influence the resulting model.• We assume that we have observed 10 flips with 5 tailsand 5 heads• Thus p(heads) = (#heads+5)/(#flips+10)• Advantages: 1. Never assign a probability of 0 to an event 2. As more data accumulates we can get very close to the realdistribution (the impact of the pseudo counts will diminish rapidly)Pseudo counts- What if we can only observe 3 flips?- 25% of the times a maximum likelihood estimator will assign probability of 1 toeither the heads or tails• In these cases we can use prior belief about the‘fairness’ of most coins to influence the resulting model.• We assume that we have observed 10 flips with 5 tailsand 5 heads• Thus p(heads) = (#heads+5)/(#flips+10)• Advantages: 1. Never assign a probability of 0 to an event 2. As more data

View Full Document