DOC PREVIEW
UCLA STATS 238 - Machine Learning and Probabilistic Models

This preview shows page 1-2-3 out of 10 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 10 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

1Machine Learning and Probabilistic Models ofVisionAlan Yuille and Xuming HeAbstractThis paper draws connections between Probabilistic Methods and Machine Learning Approaches. In particular,we show that many Support Vector Machine criteria – binary, multi-class, and latent – can be obtained as upperbound approximations to standard probabilistic formulations. The advantage of these ’Machine Learning bounds’is that it greatly simplifies the computation and, possibly, may yield greater robustness. These connections enableus to take complex models formulated in terms of probabilistic distribution defined over graph structure andapproximate/bound them by machine learning techniques. We illustrate this with examples from the literaturewhich are applied to a range of vision problems — including image labeling, object detection and parsing, andmotion estimation – and which achieve state of the art results.I. INTRODUCTIONMachine learning techniques are sometimes proposed as radical alternatives to standard probabilisticmethods. In this paper we show how many machine learning methods can be obtained as bounds ofprobabilistic approaches. In particular, this enables us to apply machine learning methods to problemsthat can be formulated as probabilistic inference over graphical models. This is particularly importantfor vision applications, because these graphical models are able to capture the spatial relations whichare often neglected in machine learning approaches (because non-visual data often does not have suchrelations). We will illustrate the advantages by describing state of the art results from the literature forimage labeling, object detection and parsing, and motion estimation.The advance is conceptual. It enables us to relate practical machine learning methods with the sophis-tication representation of probabilistic models defined on graph structures. This may help probabilistictheories by obtaining good classes of approximations and bounds. Conversely it may enable machinelearning methods to be extended to a larger range of problems. Our results also show that machinelearning intuitions – such as the desirability of maximizing margins to ensure generalization – correspondto fitting probabilistic model while encouraging large variance (which also helps prevent over-fitting).Papers by Hastie, Friedman, Tishbirani [1] have helped illuminate some relations between machinelearning models and standard probabilistic estimation – e.g., by showing that AdaBoost learning convergesto a posterior distribution in the limit of large amounts of data. Poggio and Smale [2] formulate learning asfunction approximation using regularization, state that the maximizing margin criteria can be re-expressedin terms of regularization, and point out the connections to Bayesian formulations (but there are somelimitations – they do not have a separate loss function and probability model), while we have the freedomto specify them separately, and neither do they have models with hidden variables). Nevertheless, thevast majority of the literature starts from the standard perspective based on margins and empirical lossfunctions. (CHECK: Lafferty and Lebanon!!).We stress that our goal is to relate probabilistic models defined over graphs to machine learning methodsbased on maximizing margins and using support vector methods. There is, of course, a large ’machinelearning’ literature such as boltzmann machine and deep belief networks which is already formulated interms of probability distributions defined on graphs (REFS!!).II. BACKGROUND: SUPPORT VECTOR MACHINES WITH MULTI-VARIATE AND HIDDEN VARIABLES.We first discuss binary classification and the relationships between support vector machine methods andprobabilistic approaches. It will be helpful to show our results for this important special case. It enables2us to introduce our results, and their intuitions, without being burdened by too much notation. Also itenables us introduce notation which can easily be extended to the more complex cases involving multiplevariables and hidden variables.Binary classification starts with training data {(yi, xi) : i = 1, ..., N}, where the output yi∈ {−1, +1}is binary-valued and the input xican be anything (for vision applications, it may be the intensity valuesdefined over an image window).The goal of binary classification is to be able to divide the input data into two classes by a decisionrule of form:ˆy = arg max yw · φ(x).This classifies input data x as positive if w · φ(x) > 0 and negative otherwise. An important specialcase is when φ(x) = (1, x) and w = (w0, w1), then the decision rule is specified by whether the inputdata x is above, or below, the hyperplane w0+w1·x = 0. The use of more general functions φ(x) enablesthe use of the kernel trick (ref!!).Support Vector Machines specify a criterion to determine the optimal w from the training dataset. Thestandard criterion requests minimizing the following criterion:L(w) =12|w|2+ CNXi=1max{0, 1 − yiw · φ(xi)}This criterion maximizes the margin 1/|w| while requiring that data points on the wrong side of themargin – i.e. yiw · φ(xi) < 1 – pay a penalty C{1 − yiw · φ(xi)} > 0.This can be re-expressed in terms of function L(w, ζ : α, γ) which includes slack variables ζi} andlagrange parameters {αi, γi}. This function L(w, ζ : α, γ) should be minimized with respect to w, ζ andmaximized with respect to α , γ and takes the form:L(w, ζ : α, γ) =12|w|2+ CNXi=1ζi−NXi=1αi(yiw · φ(xi) − 1 + ζi) −NXi=1αiζiThe second term penalizes the amount of slack variables used, the third term ensures that datapointsare on the correct side of the margin after ’borrowing’ the slack variable, and the fourth term constrainsthe slack variables to be non-negative. We can obtain our original formulation by observing that the slackvariable ζishould be zero unless the data point (xi, yi) is on the wrong side of the discriminant, in whichcase it takes value 1 − yiw · φ(xi).The re-formulation is useful because: (I) it leads directly to the concept of support vectors – differenti-ating wrt w yields w =PNi=1αiyiφ(xi) – hence w is expressed in terms of those data points (xi, yi), thesupport vectors, for which αi6= 0. (II) We can eliminate w, ζ and obtain a dual formulation which is afunction of α only – hence we can solve by maximizing the dual to estimate α and using the result aboveto solve for w. (III) The form


View Full Document

UCLA STATS 238 - Machine Learning and Probabilistic Models

Download Machine Learning and Probabilistic Models
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Machine Learning and Probabilistic Models and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Machine Learning and Probabilistic Models 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?