New version page

CMU CS 15381 - Density estimation

Documents in this Course
Planning

Planning

19 pages

Planning

Planning

19 pages

Lecture

Lecture

42 pages

Lecture

Lecture

27 pages

Lecture

Lecture

19 pages

FOL

FOL

41 pages

lecture

lecture

34 pages

Exam

Exam

7 pages

Lecture

Lecture

22 pages

Handout

Handout

11 pages

Midterm

Midterm

14 pages

lecture

lecture

83 pages

Handouts

Handouts

38 pages

mdp

mdp

37 pages

HW2

HW2

7 pages

nn

nn

25 pages

lecture

lecture

13 pages

Handout

Handout

5 pages

Lecture

Lecture

27 pages

Lecture

Lecture

62 pages

Lecture

Lecture

5 pages

Load more

This preview shows page 1-2-3-19-20-38-39-40 out of 40 pages.

View Full Document
View Full Document

End of preview. Want to read all 40 pages?

Upload your study docs or become a GradeBuddy member to access this document.

View Full Document
Unformatted text preview:

15-381: Artificial IntelligenceDensity estimationInference methods: SummaryLinear in the numberof samples drawn(N)ApproximateUse randomsamples from fulljoint to compute thepartial jointStochasticinferenceWorth caseexponentialExactReusecomputationsVariableeliminationExponential in thenumber ofunobservedvariablesExactLoop over allunobservedvariablesEnumerationRunning timeSolutionBasic ideaMethodConditional Probability Tables(CPT)AJ MB EP(B)=.05P(E)=.1P(A|B,E) )=.95P(A|B,¬E) = .85P(A| ¬ B,E) )=.5P(A| ¬ B, ¬ E) = .05P(J|A) )=.7P(J|¬A) = .05P(M|A) )=.8P(M|¬A) = .15But where do we getthem?Density Estimation• A Density Estimator learns a mapping from a set ofattributes to a ProbabilityDensityEstimatorProbabilityInput data for avariable or a set ofvariablesDensity estimation• Estimate the distribution (or conditional distribution) of arandom variable• Types of variables: - Binary coin flip, alarm - Discrete dice, car model year - Continuous height, weight, temp.,Not just for Bayesian networks …• Density estimators can do many good things…– Can sort the records by probability, and thus spotweird records (anomaly detection)– Can do inference: P(E1|E2)Medical diagnosis / Robot sensors– Ingredient for Bayes networksDensity estimation• Binary and discrete variables:• Continuous variables:Easy: Just count!Harder (but just a bit): Fita modelLearning a density estimatorrecords ofnumber total ][in which records#)][(ˆuixuixP===A trivial learning algorithm!Course evaluation120135501330249031713191EvaluationSizeSummer?P(summer) = #Summer / # records= 23/151 = 0.15P(Evaluation = 1) = #Evaluation=1/ # records= 49/151 = 0.32P(Evaluation = 1 | summer) =P(Evaluation = 1 & summer) /P(summer) = 2/23 = 0.09But why do we count?Computing the joint likelihood ofthe data120135501330249031713191EvaluationSizeSummer?P(summer) = #Summer / # records= 23/151 = 0.15P(Evaluation = 1) = #Evaluation=1/ # records= 49/151 = 0.32P(Evaluation = 1 | summer) =P(Evaluation = 1 & summer) /P(summer) = 2/23 = 0.09!==""=RkkR|MP|MP|MP121)(ˆ)(ˆ)dataset(ˆxxxx KThe next slide presents one of the mostimportant ideas in probabilistic inference. Ithas a huge number of applications in manydifferent and diverse problemsMaximum Likelihood Principle• We can fit models by maximizing the probability ofgenerating the observed samples:L(x1, … ,xn | Θ) = p(x1 | Θ) … p(xn | Θ)• The samples (rows in the table) are assumed to beindependent)• For a binary random variable A with P(A=1)=q argmaxq = #1/#samples• Why?•For a binary random variable A with P(A=1)=q argmaxq = #1/#samples• Why?Data likelihood:We would like to find:Maximum Likelihood Principle21)1()|(nnqqMDP !=21)1(maxargnnqqq !Data likelihood:We would like to find:Maximum Likelihood Principle21)1()|(nnqqMDP !=21)1(maxargnnqqq !211211212111121112110)1(0))1(()1(0)1()1(0)1()1()1(212121212121nnnqqnqnnqnqnqnqnqqqnqqqnqqnqqqnqqqnnnnnnnnnnnn+=!+=!=""!="""!="""!=##"""="##""""""Log ProbabilitiesWhen working with products, probabilities ofentire datasets often get too small. A possiblesolution is to use the log of probabilities, oftentermed ‘log likelihood’!"====RkkRkk|MP|MP|MP11)(ˆlog)(ˆlog)dataset(ˆlog xxLog valuesbetween 0 and 1Density estimation• Binary and discrete variables:• Continuous variables:Easy: Just count!Harder (but just a bit): Fita modelBut what if weonly have veryfew samples?The danger of joint densityestimation120135501330249031713191EvaluationSizeSummer?P(summer & size > 20 & evaluation = 3)= 0- No such example in our datasetNow lets assume we are given anew (often called ‘test’) dataset. Ifthis dataset contains the lineSummer Size Evaluation 1 30 3Then the probability we wouldassign to the entire dataset is 0Naïve Density EstimationThe problem with the Joint Estimator is that it justmirrors the training data.We need something which generalizes more usefully.The naïve model generalizes strongly:Assume that each attribute is distributedindependently of any of the other attributes.Joint estimation, revisited120135501330249031713191EvaluationSizeSummer?Assuming independence we cancompute each probability independentlyP(Summer) = 0.15P(Evaluation = 1) = 0.32P(Size > 20) = 0.63How do we do on the joint?P(Summer & Evaluation = 1) = 0.09P(Summer)P(Evaluation = 1) = 0.05P(size > 20 & Evaluation = 1) = 0.23P(size > 20)P(Evaluation = 1) = 0.20Not bad !Joint estimation, revisited120135501330249031713191EvaluationSizeSummer?Assuming independence we cancompute each probability independentlyP(Summer) = 0.15P(Evaluation = 1) = 0.32P(Size > 20) = 0.63How do we do on the joint?P(Summer & Size > 20) = 0.026P(Summer)P(Size > 20) = 0.094We must be careful when using the Naïvedensity estimatorContrastGiven 100 records and 10,000multivalued attributes will be fineGiven 100 records and more than 6Boolean attributes will screw upbadlyOutside Naïve’s scopeNo problem to model “C is a noisycopy of A”Can model only very boringdistributionsCan model anythingNaïve DEJoint DEDealing with small datasets• We just discussed one possibility: Naïve estimation• There is another way to deal with small number ofmeasurements that is often used in practice.• Assume we want to compute the probability of heads in acoin flip - What if we can only observe 3 flips? - 25% of the times a maximum likelihood estimator willassign probability of 1 to either the heads or tailsPseudo counts- What if we can only observe 3 flips?- 25% of the times a maximum likelihood estimator will assign probability of 1 toeither the heads or tails• In these cases we can use prior belief about the‘fairness’ of most coins to influence the resulting model.• We assume that we have observed 10 flips with 5 tailsand 5 heads• Thus p(heads) = (#heads+5)/(#flips+10)• Advantages: 1. Never assign a probability of 0 to an event 2. As more data accumulates we can get very close to the realdistribution (the impact of the pseudo counts will diminish rapidly)Pseudo counts- What if we can only observe 3 flips?- 25% of the times a maximum likelihood estimator will assign probability of 1 toeither the heads or tails• In these cases we can use prior belief about the‘fairness’ of most coins to influence the resulting model.• We assume that we have observed 10 flips with 5 tailsand 5 heads• Thus p(heads) = (#heads+5)/(#flips+10)• Advantages: 1. Never assign a probability of 0 to an event 2. As more data


View Full Document
Loading Unlocking...
Login

Join to view Density estimation and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Density estimation and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?