Learning to Localize Objectswith Structured Output RegressionMatthew B. Blaschko and Christoph H. LampertMax Planck Institute for Biological CyberneticsT¨ubingen, GermanyOctober 13th, 2008Object (Category) RecognitionIs there a cow in this picture?Binary classification problem: classifiers do the job.Object (Category) RecognitionIs there a cow in this picture?Binary classification problem: classifiers do the job.Object (Category) RecognitionIs there a cow in this picture?Binary classification problem: classifiers do the job.Object (Category) RecognitionIs there a cow in this picture?Binary classification problem: classifiers do the job.Object (Category) RecognitionIs there a cow in this picture?Binary classification problem: classifiers do the job.Object (Category) RecognitionIs there a cow in this picture?Binary classification problem: classifiers do the job.Object (Category) RecognitionIs there a cow in this picture?Binary classification problem: classifiers do the job.Object (Category) RecognitionIs there a cow in this picture?Binary classification problem: classifiers do the job.Object (Category) LocalizationWhere in the picture is the cow?Object (Category) LocalizationWhere in the picture is the cow?center pointObject (Category) LocalizationWhere in the picture is the cow?center point segmentationObject (Category) LocalizationWhere in the picture is the cow?center point segmentationoutlineObject (Category) LocalizationWhere in the picture is the cow?center point segmentationoutline bounding boxMany possible answers, none is binary.How can we build a trainable system to localize objects?From Classification to Localization: Sliding WindowMost common: binary training + sliding window.Train a binary classifier f : { all images } → R.training imagesIpositive: object class imagesInegative: background regionstrain a classifier, e.g.Isupport vector machine,IViola-Jones cascade, . . .decision function f : {images} → RIf > 0 means “image shows the object.”If < 0 means “image does not show the object.”Use f in sliding window procedure:apply f to many subimagessubimage with largest score is object locationProblems of Sliding Window LocalizationBinary training + sliding window is inconsistent:1) We learn for the wrong task:ITraining: only sign f matters.ITesting: f used as real-valuedquality measure.2) The training samples do not reflectthe test situation.ITraining: samples show eithercomplete object or no object.ITesting: many subregions showobject parts.Learning to Localize ObjectsIdeal setup:Learn a true localization function:g : {all images } → {all boxes }g !=that predicts object location from images.Train in a consistent end-to-end way.Training distribution reflects test distribution.Object Localization as Structured Output RegressionRegression task:training pairs (x1, y1), . . . , (xn, yn) ∈ X × YIxiare images, yiare bounding boxesLearn a mappingIg : X → Ythat generalizes from the given examples:Ig(xi) ≈ yi, for i = 1, . . . , n.Prefer smooth mappings to avoid overfitting.Regression is not R → R, but input and output are structured spaces:inputs are 2D imagesoutputs are 4-tuples y = [left, top, right, bottom] ∈ R4that must be predicted jointly.Alternatives: Predicting with Structured OutputsIndependent Training?Learn independent functions gleft, gtop, gright, gbottom: X → R.Unable to integrate constraints and correlations.Nearest Neighbor?Store all example (xi, yi) as prototypes.For new image x, predict box yiwhere i = argmini=1,...ndist(x, xi).No invariance e.g. against translation.Requires a lot of training data.Probabilistic Modeling?Build a model of p(x, y ) or p(y |x), e.g. Gaussian Mixture.Difficult to integrate invariances, e.g. against scale changes.Requires a lot of training data to cover 4D output space.Background: Structured Output Support Vector MachineStructured Output Support Vector Machine: [Tsochantaridis2005]Generalization of SVMs to arbitrary output domains.Goal: prediction function g : X → YILearn a compatibility functionF : X × Y → Rand defineg(x):= argmaxy∈YF (x, y)g(x) is learned discriminatively.Non-linearity and domain-knowledge included by kernelization of F .I. Tsochantaridis, T. Joachims, T. Hofmann, Y. Altun: Large Margin Methods for Structuredand Interdependent Output Variables, JMLR, 2005.Structured Output Support Vector MachineSetup:Define positive definite kernel k : (X × Y) × (X × Y) → R.Ik(. , .) induces a Reproducing Kernel Hilbert Space Hand an implicit feature map φ : X × Y → H.Define loss function ∆ : Y × Y → R.SO-SVM Training:Solve the convex optimization problemminw,ξ12kwk2+ CnXi=1ξisubject to margin constraints with loss function ∆:∆(yi, y) + hw, φ(xi, y)i − hw , φ(xi, yi)i ≤ ξi,for all y ∈ Y \ {yi} and i = 1, . . . , n.Structured Output Support Vector MachineUnique solution w ∈ H defines the compatibility functionF (x, y)=hw, φ(x, y )ilinear in w ∈ H, but nonlinear in X and Y.F (x, y) measures how well the output y fits to the input x.Ianalogue in probabilistic model: F (x, y) ˆ= log p(y|x).Ibut: F (x, y) max-margin trained, not probabilistic!Best prediction for new x is the most compatible y :g(x) := argmaxy∈YF (x, y).Evaluating g : X → Y is like generalized sliding window:Ifor fixed x, we evaluate a quality function for every box y ∈ Y.Fapproximate: sliding windowFexact: branch-and-bound [Lampert&Blaschko, CVPR2008]Iother parameterization: min-cut, (loopy) BP,...SO-SVM Training: InterpretationSO-SVM Training Revisited: Hard-Margin CaseSolve the convex optimization problemminw,ξ12kwk2subject to margin constraints with loss function ∆ ≥ 0:hw, φ(xi, yi)i − hw, φ(xi, y)i ≥ ∆(yi, y)for all y ∈ Y \ {yi} and i = 1, . . . , n.Implies:F (xi, yi) > F (xi, y) for all y ∈ Y \ {yi}.Because g(x) := argmaxyF (x, y), this meansg(xi) = yifor all training pairs (xi, yi).SO-SVM Training: InterpretationSO-SVM Training Revisited: Hard-Margin CaseSolve the convex optimization problemminw,ξ12kwk2subject to margin constraints with loss function ∆ ≥ 0:hw, φ(xi, yi)i| {z }=F (xi,yi)− hw, φ(xi, y)i| {z }=F (xi,y)≥ ∆(yi, y)for all y ∈ Y \ {yi} and i = 1, . . . , n.Implies:F (xi, yi) > F (xi, y) for all y ∈ Y \ {yi}.Because g(x) := argmaxyF (x, y), this meansg(xi) = yifor all training pairs (xi, yi).SO-SVM Training: InterpretationSO-SVM Training Revisited: General CaseSolve the convex optimization problemminw,ξ12kwk2+ CnXi=1ξisubject to margin
View Full Document