sac04 (6 pages)

Previewing pages 1, 2 of 6 page document View the full content.
View Full Document

sac04



Previewing pages 1, 2 of actual document.

View the full content.
View Full Document
View Full Document

21 views

Unformatted text preview:

Similarity between Euclidean and cosine angle distance for nearest neighbor queries Gang Qian Shamik Sural Department of Computer Science and Engineering Michigan State University East Lansing MI 48824 USA qiangang guyuelon pramanik cse msu edu ABSTRACT Understanding the relationship among di erent distance measures is helpful in choosing a proper one for a particular application In this paper we compare two commonly used distance measures in vector models namely Euclidean distance EUD and cosine angle distance CAD for nearest neighbor NN queries in high dimensional data spaces Using theoretical analysis and experimental results we show that the retrieval results based on EUD are similar to those based on CAD when dimension is high We have applied CAD for content based image retrieval CBIR Retrieval results show that CAD works no worse than EUD which is a commonly used distance measure for CBIR while providing other advantages such as naturally normalized distance Keywords Vector model Euclidean distance Cosine angle distance Content based image retrieval Inter feature normalization 1 INTRODUCTION Distance measure is an important part of a vector model Among all distance measures that are proposed in the literature some have very similar behaviors in similarity queries while others may behave quite di erently Understanding the relationship among distance measures can help us to choose a proper distance measure for a particular application One way of comparing distance measures is to study their retrieval performance in terms of precision and recall in a particular application area such as content based image retrieval CBIR 18 and video copy detection 6 One concern in choosing a particular distance measure is the impact of computational overhead on system performance When feature vectors are large some distance measures may consume more computing resources than the others One possible approximation of EUD is proposed in 5 On the other Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page To copy otherwise to republish to post on servers or to redistribute to lists requires prior specific permission and or a fee SAC 04 March 14 17 2004 Nicosia Cyprus Copyright 2004 ACM 1 58113 812 1 03 04 5 00 Yuelong Gu Sakti Pramanik School of Information Technology Indian Institute of Technology Kharagpur 721302 India shamik cse iitkgp ernet in hand it is also important to choose a similarity measure that is consistent with human ideas of similarity The authors of 17 have proposed a similarity measure based on noise distribution of the image database In 16 a mathematical analysis of EUD and CAD has been done It was shown that CAD has a special property to favor relatively larger component in a vector In this paper we compare two commonly used distance measures in vector models namely Euclidean distance EUD and cosine angle distance CAD for nearest neighbor NN queries in high dimensional data spaces From theoretical analysis and experimental results we nd that the retrieval results based on EUD are similar to those based on CAD We use a high dimensional geometrical model to analyze how similar these two distance measures are under the assumption of uniform data distribution We nd that the NN of EUD is also ranked high by CAD when dimension is high We de ne NN as the rst nearest neighbor of the query Our experimental results have corroborated the correctness of our model We have also compared these two distance measures experimentally using normalized datasets and clustered datasets Our conclusions are that 1 In high dimensional data spaces the NN query results by EUD and CAD are very similar 2 For clustered data the NN query results by EUD and CAD are more similar 3 When vectors are normalized by its size the NN query results by EUD and CAD are also more similar One application of the above properties is to combine features that are semantically di erent e g color and texture in CBIR as explained below As EUD is often used as a distance measure for individual features in CBIR inter feature normalization is needed to combine EUD values of di erent scales into an over all score for an image Based on the property that NN query results of EUD and CAD are very similar in high dimensional spaces we propose to use CAD instead of EUD for individual features As CAD values are naturally normalized by norm there is no need for further inter feature normalization Thus the distance value from di erent features can be summed up directly as the nal score for an image in the database Our experimental results show that our proposal works not only no worse than other commonly used methods but also has some favorable advantages There are a number of methods proposed for combining features in CBIR They can be divided into two categories rank based 8 10 and distance value based 14 15 Rank based methods are also called voting methods in 12 In rank based methods the rank of an image for di erent features are calculated rst then these ranks are summed to derive the nal rank of the image Distance value based methods use distance value of individual features directly Since distance values from di erent features have di erent scales inter feature normalization is necessary for computing the nal rank of an image One simple method in this category is to divide all distance values of a feature by the maximum distance value 15 Another widely used method for combining features are based on the assumption that distance values have a Gaussian distribution 14 Some discussion and comparisons of methods for combining features can be found in 3 7 14 Using CAD as a normalized distance measure was mentioned in 4 but no analysis or experimental results were presented in the context of vector models or CBIR We have done experiments to compare the CAD based method we proposed with two widely used methods namely rank based method by EUD and distance value based method with Gaussian assumption The rest of this paper is organized as follows Section 2 presents the theoretical analysis of NN queries by EUD and CAD using a geometrical model in high dimensions Section 3 and 4 present experimental results for comparing EUD and CAD and combining multiple features respectively Section 5 discusses the conclusion and future


Access the best Study Guides, Lecture Notes and Practice Exams

Loading Unlocking...
Login

Join to view sac04 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view sac04 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?