DOC PREVIEW
CMU CS 10601 - lecture

This preview shows page 1-2-15-16-31-32 out of 32 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 32 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

10 601 Machine Learning Density estimation Density Estimation A Density Estimator learns a mapping from a set of attributes to a Probability Input data for a variable or a set of variables Density Estimator Probability Density estimation Estimate the distribution or conditional distribution of a random variable Types of variables Binary coin flip alarm Discrete dice car model year Continuous height weight temp When do we need to estimate densities Density estimators can do many good things Can sort the records by probability and thus spot weird records anomaly detection Can do inference P E1 E2 Medical diagnosis Robot sensors Ingredient for Bayes networks and other types of ML methods Density estimation Binary and discrete variables Easy Just count Continuous variables Harder but just a bit Fit a model Learning a density estimator for discrete variables records in which x i u P x i u total number of records A trivial learning algorithm But why is this true Maximum Likelihood Principle We can define the likelihood of the data given the model as follows P dataset M P x1 x 2 n x n M P x k M k 1 M is our model usually a collection of parameters For example M is The probability of head for a coin flip The probabilities of observing 1 2 3 4 and 5 for a dice etc Maximum Likelihood Principle P dataset M P x1 x 2 n x n M P x k M k 1 Our goal is to determine the values for the parameters in M We can do this by maximizing the probability of generating the observed samples For example let be the probabilities for a coin flip Then L x1 xn p x1 p xn The observations different flips are assumed to be independent For such a coin flip with P H q the best assignment for h is argmaxq H samples Why Maximum Likelihood Principle Binary variables For a binary random variable A with P A 1 q argmaxq 1 samples Why Data likelihood P D M q n1 1 q n2 n1 n2 arg max q 1 q q We would like to find Maximum Likelihood Principle Data likelihood P D M q n1 1 q n2 We would like to find arg max q q n1 1 q n2 n1 q 1 q n2 n1q n1 1 1 q n2 q n1 n2 1 q n2 1 q 0 q n1q n1 1 1 q n2 q n1 n2 1 q n2 1 0 q n1 1 1 q n2 1 n1 1 q qn2 0 n1 1 q qn2 0 n1 n1q n2 q q n1 n1 n2 Log Probabilities When working with products probabilities of entire datasets often get too small A possible solution is to use the log of probabilities often termed log likelihood n n k 1 k 1 log P dataset M log P x k M log P x k M Maximizing this likelihood function is the same as maximizing P dataset M Log values between 0 and 1 In some cases moving to log space would also make computation easier for example removing the exponents Density estimation Binary and discrete variables Easy Just count Continuous variables Harder but just a bit Fit a model But what if we only have very few samples The danger of joint density estimation P summer size 20 evaluation 3 0 No such example in our dataset Now lets assume we are given a new often called test dataset If this dataset contains the line Summer 1 Size Evaluation 30 Then the probability we would assign to the entire dataset is 0 3 Summer Size Evaluation 1 19 3 1 17 3 0 49 2 0 33 1 0 55 3 1 20 1 Na ve Density Estimation The problem with the Joint Estimator is that it just mirrors the training data We need something which generalizes more usefully The na ve model generalizes strongly Assume that each attribute is distributed independently of any of the other attributes If two variables are independent then p A B p A p B Joint estimation revisited Assuming independence we can compute each probability independently P Summer 0 5 P Evaluation 1 0 33 P Size 20 0 66 Not bad Summer Size Evaluation 1 19 3 1 17 3 0 49 2 0 33 1 0 55 2 1 21 1 How do we do on the joint P Summer Evaluation 1 0 16 P Summer P Evaluation 1 0 16 P size 20 Evaluation 1 0 33 P size 20 P Evaluation 1 0 22 Joint estimation revisited Assuming independence we can compute each probability independently Summer Size Evaluation 1 19 3 P Evaluation 3 0 33 1 17 3 P Size 20 0 66 0 49 2 0 33 1 0 55 2 1 21 1 P Summer 0 5 How do we do on the joint P Summer Eval 3 0 33 P Summer P Eval 3 0 16 We must be careful when using the Na ve density estimator Contrast Joint DE Na ve DE Can model anything Can model only very boring distributions No problem to model C is a noisy copy of A Outside Na ve s scope Given 100 records and more than 6 Given 100 records and 10 000 Boolean attributes will screw up multivalued attributes will be fine badly Dealing with small datasets We just discussed one possibility Na ve estimation There is another way to deal with small number of measurements that is often used in practice Assume we want to compute the probability of heads in a coin flip What if we can only observe 3 flips 25 of the times a maximum likelihood estimator will assign probability of 1 to either the heads or tails Pseudo counts What if we can only observe 3 flips 25 of the times a maximum likelihood estimator will assign probability of 1 to either the heads or tails In these cases we can use prior belief about the fairness of most coins to influence the resulting model We assume that we have observed 10 flips with 5 tails and 5 heads Thus p heads heads 5 flips 10 Advantages 1 Never assign a probability of 0 to an event 2 As more data accumulates we can get very close to the real distribution the impact of the pseudo counts will diminish rapidly Pseudo counts What if we can only observe 3 flips 25 of the times a maximum likelihood estimator will assign probability of 1 to either the heads or tails In these cases we can use prior belief about the fairness of most coins to influence the resulting model Some distributions for example the We assume that we have observed 10 flips with 5 tails Beta distribution can incorporate and 5 heads pseudo counts as part of the model Thus p heads heads 5 flips 10 Advantages 1 Never assign a probability of 0 to an event 2 As more data accumulates we can get very close to the real distribution the impact of the pseudo counts will diminish rapidly Beta distribution The …


View Full Document

CMU CS 10601 - lecture

Documents in this Course
lecture

lecture

40 pages

Problem

Problem

12 pages

lecture

lecture

36 pages

Lecture

Lecture

31 pages

Review

Review

32 pages

Lecture

Lecture

11 pages

Lecture

Lecture

18 pages

Notes

Notes

10 pages

Boosting

Boosting

21 pages

review

review

21 pages

review

review

28 pages

Lecture

Lecture

31 pages

lecture

lecture

52 pages

Review

Review

26 pages

review

review

29 pages

Lecture

Lecture

37 pages

Lecture

Lecture

35 pages

Boosting

Boosting

17 pages

Review

Review

35 pages

Lecture

Lecture

28 pages

Lecture

Lecture

30 pages

lecture

lecture

29 pages

leecture

leecture

41 pages

lecture

lecture

34 pages

review

review

38 pages

review

review

31 pages

Lecture

Lecture

41 pages

Lecture

Lecture

15 pages

Lecture

Lecture

21 pages

Lecture

Lecture

38 pages

Notes

Notes

37 pages

lecture

lecture

29 pages

Load more
Download lecture
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view lecture and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view lecture 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?