DOC PREVIEW
Purdue CS 59000 - Lecture notes

This preview shows page 1-2-24-25 out of 25 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 25 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

CS 59000 Statistical Machine learningLecture 8Alan QiOutlineReview of exponential familyNon-informative priorNonparametric methodsLinear Regression with basis functionsThe Exponential Familywhere ´ is the natural parameter andso g(´) can be interpreted as a normalization coefficient.ML estimation for the Exponential FamilyGive a data set, , the likelihood function is given by Thus we have Sufficient statisticConjugate priorsFor any member of the exponential family, there exists a priorCombining with the likelihood function, we getPrior corresponds to º pseudo-observations with value Â.Noninformative Priors (1)With little or no information available a-priori, we might choose a non-informative prior.• ¸ discrete, K-nomial :• ¸2[a,b] real and bounded: • ¸ real and unbounded: improper!A constant prior may no longer be constant after a change of variable; consider p(¸) constant and ¸=´2:6Noninformative Priors (2)Translation invariant priors. ConsiderFor a corresponding prior over ¹, we havefor any A and B. Thus p(¹) = p(¹ { c) and p(¹) must be constant.7Noninformative Priors (3)Example: The mean of a Gaussian, ¹ ; the conjugate prior is also a Gaussian,As , this will become constant over ¹ .8Noninformative Priors (4)Consider . It is scale invariant since by changing variable with a scale c:For a prior over ¾, we have (why the second equality holds?)for any A and B. Thus p(¾) 1/¾ and so this prior is improper too. Note that this corresponds to p(ln¾) being constant.9Noninformative Priors (5)Example: For the variance of a Gaussian, ¾2, we have It is scale invariant density. Consider the prior: If ¸ = 1/¾2and p(¾) 1/¾ , then p(¸) 1/¸.We know that the conjugate distribution for ¸ is the Gamma distribution, A noninformative prior is obtained when a0= 0 and b0= 0.10Nonparametric Methods (1)Parametric distribution models are restricted to specific forms, which may not always be suitable; for example, consider modelling a multimodal distribution with a single, unimodal model.Nonparametric approaches make few assumptions about the overall shape of the distribution being modelled.11Nonparametric Methods (2)Histogram methods partition the data space into distinct bins with widths ¢iand count the number of observations, ni, in each bin.•Assume a uniform distribution inside each bin.•Often, the same width is used for all bins, ¢i= ¢.•¢ acts as a smoothing parameter. •In a D-dimensional space, using M bins in each dimen-sion will require MDbins!12Nonparametric Methods (3)Assume observations drawn from a density p(x) and consider a small region R containing x such thatThe probability that K out of N observations lie inside R is Bin(K|N,P ):13Nonparametric Methods (4)Assume observations drawn from a density p(x) and consider a small region Rcontaining x such thatThe probability that K out of N observations lie inside R is Bin(K|N,P ) and if N is largeIf the volume of R, V, is sufficiently small, p(x) is approximately constant over R andThus14What is the relation to the histogram method?Nonparametric Methods (5)Kernel Density Estimation: fix V, estimate K from the data. Let R be a hypercube centred on x and define the kernel function (Parzen window)It follows that and hence15What are the relation to the histogram method and its drawback?Nonparametric Methods (5)To avoid discontinuities in p(x), use a smooth kernel, e.g. a GaussianAny kernel such thatwill work.h acts as a smoother.16Nonparametric Methods (6)Nearest Neighbour Density Estimation: fix K, estimate V from the data. Consider a hyperspherecentred on x and let it grow to a volume, V?, that includes K of the given N data points. ThenK acts as a smoother.17K-Nearest-Neighbours for Classification (1)Given a data set with Nkdata points from class Ckand , we haveand correspondinglySince , Bayes’ theorem gives18Then how to classify the data points?K-Nearest-Neighbours for Classification (2)K = 1K = 319K-Nearest-Neighbours for Classification (3)• K acts as a smother• For , the error rate of the 1-nearest-neighbour classifier is never more than twice the optimal error (obtained from the true conditional class distributions).20Nonparametric vs Parametric Nonparametric models (not histograms) requires storing and computing with the entire data set. Parametric models, once fitted, are much more efficient in terms of storage and computation.21Linear Regression22Basis Functions23Examples of Basis Functions (1)24Examples of Basis Functions


View Full Document

Purdue CS 59000 - Lecture notes

Documents in this Course
Lecture 4

Lecture 4

42 pages

Lecture 6

Lecture 6

38 pages

Load more
Download Lecture notes
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Lecture notes and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Lecture notes 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?