Unformatted text preview:

Maximum-Likelihood and Bayesian Parameter Estimationz Sufficient Statisticsz Common Probability Distributionsz Problems of DimensionalityProblems of DimensionalityProblems involving 50 or 100 features (binary valued)z Classification accuracy depends upon the dimensionality and the amount of training dataz Case of two classes multivariate normal with the same covariance0)error(Plim)()(r :wheredue21)error(Pr211t21222u2/r=µ−µΣµ−µ=∫π=∞→−−∞7Error Rate and Dimensionalityz If features are independent then:z Most useful features are the ones for which the difference between the means is large relative to the standard deviationz It has frequently been observed in practice that, beyond a certain point, the inclusion of additional features leads to worse rather than better performance.2di1ii2i1i22d2221r),...,,(diag∑⎟⎟⎠⎞⎜⎜⎝⎛σµ−µ=σσσ=Σ==777Non-overlapping distributions in 3-dimensions where Bayeserror vanishes. When projected to x1-x2sub-space or x1sub-space there is a greater overlap and hence greater Bayes errorDecrease in error rate with features7Computational ComplexityDesign methodology is affected by computational difficulty“big oh” notation f(x) = O(h(x)) “big oh of h(x)”If:(An upper bound on f(x) grows no worse than h(x) for sufficiently large x)Example: f(x) = 2+3x+4x2h(x) = x2and c can be appropriately chosenTherefore f(x) = O(x2)00)()(,xxxhcxf xc>∀≤∋∃7Big Theta Notation“Big oh” is not uniquee.g., f(x) = O(x2); f(x) = O(x3); f(x) = O(x4)Thus we introduce “big theta” notationf(x) = θ(h(x)) if: Therefore f(x) = θ(x2) but f(x) ≠θ(x3))()()(0,,210210xhcxfxhcxxccx≤≤≤>∀∈∃7Gaussian priors in d dimensions classifier with ntraining samples for each of c classesz For each category, we have to compute the discriminantfunctionTotal = O(d2..n)Total for c classes = O(cd2.n) ≅O(d2.n)z Cost increase when d and n are large!}}321321876)n(O)n.2d(O)1(O)2d.n(O1t)n.d(O)(Plnˆln212ln2d)ˆx()ˆx(21)x(g ω+Σ−π−µ−Σµ−−=−7Complexity of ML EstimationOverfittingz Frequently, no of available samples usually inadequate, how to proceed?z Solutions:z Reduce dimensionality: redesign feature extractor, or use a subset of featuresz Assume all c classes share same covariance matrix: pool available dataz Assume statistical independence: all off-diagonal elements of covariance matrix are zeroz Classifier performs better thanif we overfit the data, why?Training Data from quadratic function plus Gaussian noisef(x)=ax2+bx+c+εwhere p(ε) ~ N(0,σ 2). Tenth order fits perfectly but second order performs better for new samplesCapacity of a Separating PlaneOverdetermined solution is significant for classificationas it is for estimationTask of partitioning a d-dimensional feature space by a hyperplanen samples in general position labelled either ω1or ω2.Of the 2npossible dichotomies a fraction f(n,d) are linear dichotomies⎪⎩⎪⎨⎧+>⎟⎟⎠⎞⎜⎜⎝⎛−+≤=∑=112211),(0dnindndnfdinFraction of dichotomies of n points in d dimensions that are linear5 points can belabelled as eitherclass in 32 waysf(5,2) = 11/16Capacity of a Separating PlaneOf the 2npossible dichotomies a fraction f(n,d) are linear dichotomiesFraction of dichotomies of n points in d dimensions that are linear⎪⎩⎪⎨⎧+>⎟⎟⎠⎞⎜⎜⎝⎛−+≤=∑=112211),(0dnindndnfdinHyperplane is notoverconstrained in classifyingd+1 or fewer


View Full Document

UB CSE 555 - Maximum-Likelihood and Bayesian Parameter Estimation

Download Maximum-Likelihood and Bayesian Parameter Estimation
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Maximum-Likelihood and Bayesian Parameter Estimation and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Maximum-Likelihood and Bayesian Parameter Estimation 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?