Efficient Similarity Search in Sequence Databases Rakesh Agrawal, Christos Faloutsos and Arun SwamiSimilaritySimilarity QueriesExtracting Features from SequencesDiscrete Fourier TransformDFT Concept IDFT Concept IIDFT CharacteristicsProposed TechniqueEuclidean distance featuresSlide 11Number of Fourier coefficientsPerformance ExperimentsSlide 14Different Sequence Set SizeVarying Sequence LengthDiscussionSummaryEfficient Similarity Search in Efficient Similarity Search in Sequence DatabasesSequence Databases Rakesh Agrawal, Christos Faloutsos and Arun Swami Rakesh Agrawal, Christos Faloutsos and Arun SwamiLeila KaghazianSimilaritySimilarity•Exact queries •Similarity–Identify companies with similar pattern of growth–Determine products with similar selling pattern–Discover stocks with similar movement in stock prices.Similarity QueriesSimilarity Queries•Whole Matching. The sequences to be compared have the same length n.–Range Query. Given a query sequences that are similar within distance “e”.–All-Pairs queries. Given N sequences, find the pairs of sequences that are within “e” of each other.•Subsequence Matching. The query sequence is smaller; we look for a subsequence in the large sequence that best matches the query sequence.Extracting Features from Extracting Features from SequencesSequences•For numerical sequences, extracting K features, mapping it to k-dimensional space and using multidimensional index methods (R*-tree, R-tree,grid-files,…) to store and search these points.•Completeness of feature extracting•Dimensionality “curse”Discrete Fourier TransformDiscrete Fourier Transform•All periodic waves can be generated by combining Sin and Cos waves of different frequencies•Number of Frequencies may not be finite•Fourier Transform Decomposes a Periodic Wave into its Component FrequenciesDFT Concept IDFT Concept IDFT Concept IIDFT Concept IIDFT CharacteristicsDFT Characteristics•Completeness of feature extracting•Dimensionality curse•Parseval theorem gives that Euclidean distance between two signals x and y in the time domain is the same as their Euclidean distance in the frequency domainProposed TechniqueProposed Technique•Obtain the coefficients of DFT of each sequence in the database•Build a multidimensional index (F-index) using the first fc (<5)Fourier coefficients.•For a range query, obtain the first fc Fourier coefficients of the query.•For an all-pairs query, doing a spatial join using the F-index (superset of the answer set)•The actual answer set is obtained in a post-processing stepEuclidean distance featuresEuclidean distance features•Euclidean distance is useful in many cases•It can be used with any other type of similarity measure•Euclidean distance is the optimal distance measure of estimation if signals are corrupted by Gaussian additive noise•It is preserved under orthonormal transformsDFT CharacteristicsDFT Characteristics•Preserves the distance•Is easy to compute •Concentrate the energy of the signal in few coefficients•It’s a orthonormal transform•The data dependent ones–+ better performance–- expensive data reorganization if data set evolves over time•Data independent ones(DFT, DCT, wavelet)Number of Fourier coefficients Number of Fourier coefficients •Worst-case signal is White noise when xt is completely independent of its neighbors.–It has the same energy in every frequency means all frequency are equally important. This is bad for F-index.•Random walks (brown noise)–Stock movements and exchange rates•Primary and secondary trends correspond to strong, low frequency signals while minor trends corresponds to weak, high frequency signalsPerformance ExperimentsPerformance Experiments•How to choose the number of Fourier coefficients to be retained (cut-off frequency fc) in the F-index method. –A larger fc•reduces the false hits•increases the search time.•How does the search time grow as a function of number of sequences in the database?•How does the length n of the sequences affect the performance?Range QueriesAll-Pairs QueriesNumber of Fourier coefficients Number of Fourier coefficientsDifferent Sequence Set SizeDifferent Sequence Set SizeAll-Pairs QueriesRange QueriesVarying Sequence LengthVarying Sequence LengthAll-Pairs QueriesRange QueriesDiscussion Discussion •The minimum execution time for both range and all-pairs queries is achieved for a small number of fc•Increasing the number of sequences in the database results in higher gains for this method•Increasing the length of the sequence n also results in higher gain for the methodSummary Summary •Use DFT to extract sequence features•Only first few coefficient is strong enough •DFT is orthonotmal•Use R*-tree for indexing•Use Euclidean distance •Complexity is
View Full Document