Fast Discriminative Visual Codebooks using Randomized Clusering ForestsFrank Moosmann, Bill Triggs, and Frederic JuriePresented by: Andrew F. DreherCS 395T - Spring 2007Contributions1) Creating visual “words” using classification trees2) Small ensembles of randomized trees can outperform k-means clusteringUsing stochasticity to improve accuracyTrees as “Words”Visual “Words”1) High dimensional vectors; typically extracted features or clusters of features summarized at a point2) Clusters forming is usually performed using k-means clustering3) Used with “bag of words” methods derived from text processingTrees as “words”1) Trees are trained as classifiers2) Leaves are used as “words”Represent a classified cluster of visual featuresProvides spacial information and intuition lacking in k-means3) Classification is a separate stage (using SVM) over the leavesInformation Gain with Entropy1) Useful with limited number of values2) Often prefers “pure” nodesRandomization of thresholds helps create different splits and treesPaper parameters Smin and Tmax[0, 1] Completely random trees[1, D] Discriminative trees (classic ID3)Basic Example of EntropyBasic Example of Entropy1Basic Example of Entropy132Basic Example of EntropyExperimentsGeneral Overview1) Descriptors - Dataset dependentHSV color (768-D vector)Wavelet (768-D vector)Created from HSV using Haar transformSIFT (128-D vector)2) Performance Metrics1) Receiver Operating Characteristic (ROC)2) Equal Error Rate (EER)Haar Wavelet1) First known wavelet2) Not continuous or differentiable3) Described as:Source: Wikipedia (http://en.wikipedia.org/wiki/Haar_wavelet){1 0 ≤ x ≤ ½-1 ½ ≤ x ≤ 11 0 otherwisef (x) =Specific Parameters1) Descriptors: Color Wavelet2) Tree parameters: Smin = 0.5; Tmax ≈ 503) Dataset: GRAZ-02Three categories300 Images from each category½ for training; ½ for testingSpacial ResultsFigure 3: ‘Bike’ visual words for 4 different images. The brightness denotes the posterior probabilityfor the visual word at the given image position to be labelled ‘bike’.is chosen using a validation set. For the 768-D Color Wavelet Descriptor on the GRAZ-02 dataset,Tmax≈ 50.Our algorithm’s ability to produce meaningful visual words is illustrated in figure 3 (c.f. [16]).Each white dot corresponds to the center of an image sub-window that reached an unmixed leafnode for the given object category (i.e. all of the training vectors belonging to the leaf are labeledwith that category). Note that even though they have been learned on entire images without objectsegmentation, the visual vocabulary is discriminative enough to detect local structures in the testimages that correspond well with representative object fragments, as illustrated in figure 2(right).The tests here were for individual object categories versus negatives (N). We took 300 images fromeach category, using images with even numbers for training and ones with odd numbers for testing.For Setting 1 tests we trained on the whole image as in [19], while for Setting 2 ones we used thesegmentation masks provided with the images to train on the objects alone without background.For the GRAZ-02 database the wavelet descriptors gave the best performance. We report resultsfor these on the two hardest categories, bikes and cars. For B vs. N we achieve 84.4% average EERclassification rate for setting 1 and 84.1% for setting 2, in comparison to 76.5% from Opelt et al.[19].For C vs. N the respective figures are 79.9%, 79.8% and 70.7%. Remarkably, using segmentationmasks during training does not improve the image classification performance. This suggests that themethod is able to pick out the relevant information from a significant amount of clutter.Comparing ERC-Forests with k-means and kd-clustering trees. Unless otherwise stated,20 000 features (67 per image) were used to learn 1000 spatial bins per tree for 5 trees, and 8000patches were sampled per image to build the resulting 5000-D histograms. The histograms are bina-rized using trivial thresholding at count 1 before being fed to the global linear SVM image classifier.We also tested with histograms normalized to total sum 1, and with thresholding by maximizing themutual information of each dimension, but neither yielded better results for ERC-Forests.Fig. 4 gives some quantitative results on the bikes category (B vs. N). Fig. 4(a) shows the cleardifference between our method and classical k-means for vocabulary construction. Note that wewere not able to extend the k-means curve beyond 20 000 windows per image owing to prohibitiveexecution times. The figure also shows results for ‘unsupervised trees’ – ERC-Forests built with-out using the class labels during tree construction. The algorithm remains the same, but the nodescoring function is defined as the ratio between the splits so as to encourage balanced trees similarto randomized KD-trees. If only a few patches are sampled this is as good as k-means and muchfaster. However the spatial partition is so bad that with additional test windows, the binarized his-togram vectors become almost entirely filled with ones, so discrimination suffers. As the dotted lineshows, using binarization thresholds that maximize the mutual information can fix this problem butthe results are still far below ERC-Forests. This comparison clearly shows the advantages of usingsupervision during clustering.Fig. 4(b) shows that codebooks with around 5000 entries (1000 per tree) suffice for good results.Fig. 4(c) shows that when the number of features used to build the codebooks is increased, theFigure 3: ‘Bike’ visual words for 4 different images. The brightness denotes the posterior probabilityfor the visual word at the given image position to be labelled ‘bike’.is chosen using a validation set. For the 768-D Color Wavelet Descriptor on the GRAZ-02 dataset,Tmax≈ 50.Our algorithm’s ability to produce meaningful visual words is illustrated in figure 3 (c.f. [16]).Each white dot corresponds to the center of an image sub-window that reached an unmixed leafnode for the given object category (i.e. all of the training vectors belonging to the leaf are labeledwith that category). Note that even though they have been learned on entire images without objectsegmentation, the visual vocabulary is discriminative enough to detect local structures in the testimages that correspond well with representative object fragments, as illustrated in figure
View Full Document