DOC PREVIEW
sac04

This preview shows page 1-2 out of 6 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 6 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

Similarity between Euclidean and cosine angle distance fornearest neighbor queriesGang Qian†Shamik Sural‡Yuelong Gu†Sakti Pramanik††Department of Computer Science and Engineering‡School of Information TechnologyMichigan State University Indian Institute of TechnologyEast Lansing, MI 48824, USA Kharagpur 721302, India{qiangang,guyuelon,pramanik}@cse.msu.edu [email protected] the relationship among different distance mea-sures is helpful in choosing a proper one for a particular ap-plication. In this paper, we compare two commonly useddistance measures in vector models, namely, Euclidean dis-tance (EUD) and cosine angle distance (CAD), for nearestneighbor (NN) queries in high dimensional data spaces. Us-ing theoretical analysis and experimental results, we showthat the retrieval results based on EUD are similar to thosebased on CAD when dimension is high. We have appliedCAD for content based image retrieval (CBIR). Retrieval re-sults show that CAD works no worse than EUD, which is acommonly used distance measure for CBIR, while providingother advantages, such as naturally normalized distance.KeywordsVector model, Euclidean distance, Cosine angle distance,Content based image retrieval, Inter-feature normalization1. INTRODUCTIONDistance measure is an important part of a vector model.Among all distance measures that are proposed in the litera-ture, some have very similar behaviors in similarity queries,while others may behave quite differently. Understandingthe relationship among distance measures can help us tochoose a proper distance measure for a particular applica-tion.One way of comparing distance measures is to study theirretrieval performance in terms of precision and recall in aparticular application area, such as content-based image re-trieval (CBIR) [18] and video copy detection [6]. One con-cern in choosing a particular distance measure is the impactof computational overhead on system performance. Whenfeature vectors are large, some distance measures may con-sume more computing resources than the others. One possi-ble approximation of EUD is proposed in [5]. On the otherPermission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SAC’04, March 14-17, 2004, Nicosia, CyprusCopyright 2004 ACM 1-58113-812-1/03/04 ...$5.00.hand, it is also important to choose a similarity measurethat is consistent with human ideas of similarity. The au-thors of [17] have proposed a similarity measure based onnoise distribution of the image database. In [16], a mathe-matical analysis of EUD and CAD has been done. It wasshown that CAD has a special property to favor relativelylarger component in a vector.In this paper, we compare two commonly used distancemeasures in vector models, namely, Euclidean distance (EUD)and cosine angle distance (CAD), for nearest neighbor (NN)queries in high dimensional data spaces. From theoreticalanalysis and experimental results, we find that the retrievalresults based on EUD are similar to those based on CAD.We use a high dimensional geometrical model to analyzehow similar these two distance measures are under the as-sumption of uniform data distribution. We find that theNN of EUD is also ranked high by CAD when dimensionis high. We define NN as the first nearest neighbor of thequery. Our experimental results have corroborated the cor-rectness of our model. We have also compared these twodistance measures experimentally using normalized datasetsand clustered datasets. Our conclusions are that:1. In high dimensional data spaces, the NN query resultsby EUD and CAD are very similar.2. For clustered data, the NN query results by EUD andCAD are more similar.3. When vectors are normalized by its size, the NN queryresults by EUD and CAD are also more similar.One application of the above properties is to combine fea-tures that are semantically different (e.g. color and texture)in CBIR as explained below. As EUD is often used as a dis-tance measure for individual features in CBIR, inter-featurenormalization is needed to combine EUD values of differ-ent scales into an over-all score for an image. Based on theproperty that NN query results of EUD and CAD are verysimilar in high dimensional spaces, we propose to use CADinstead of EUD for individual features. As CAD values arenaturally normalized by norm, there is no need for furtherinter-feature normalization. Thus, the distance value fromdifferent features can be summed up directly as the finalscore for an image in the database. Our experimental re-sults show that our proposal works not only no worse thanother commonly used methods but also has some favorableadvantages.There are a number of methods proposed for combiningfeatures in CBIR. They can be divided into two categories:rank based [8, 10] and distance value based [14, 15]. Rankbased methods are also called “voting methods” in [12]. Inrank based methods, the rank of an image for different fea-tures are calculated first, then these ranks are summed toderive the final rank of the image. Distance value basedmethods use distance value of individual features directly.Since distance values from different features have differentscales, inter-feature normalization is necessary for comput-ing the final rank of an image. One simple method in thiscategory is to divide all distance values of a feature by themaximum distance value [15]. Another widely used methodfor combining features are based on the assumption thatdistance values have a Gaussian distribution [14]. Some dis-cussion and comparisons of methods for combining featurescan be found in [3, 7, 14].Using CAD as a normalized distance measure was men-tioned in [4], but no analysis or experimental results werepresented in the context of vector models or CBIR. Wehave done experiments to compare the CAD based methodwe proposed with two widely used methods, namely, rankbased method by EUD and distance value based methodwith Gaussian assumption.The rest of this paper is organized as follows. Section 2presents the theoretical analysis of NN queries by EUD andCAD using a geometrical model in high dimensions. Section3 and 4 present


sac04

Download sac04
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view sac04 and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view sac04 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?