The VLDB Journal (2000) 9: ♣–♣One-dimensional and multi-dimensionalsubstring selectivity estimationH.V. Jagadish1, Olga Kapitskaia2, Raymond T. Ng3, Divesh Srivastava41University of Michigan, Ann Arbor; E-mail: [email protected]ˆole Universitaire L´eonard de Vinci; E-mail: [email protected] of British Columbia; E-mail: [email protected]&T Labs – Research, 180 Park Avenue, Bldg 103, Florham Park, NJ 07932, USA; E-mail: [email protected] by M.P. Atkinson. Received: April 28, 2000 / Accepted: July 11, 2000Abstract. With the increasing importance of XML, LDAPdirectories, and text-based information sources on the Inter-net, there is an ever-greater need to evaluate queries involv-ing (sub)string matching. In many cases, matches need to beon multiple attributes/dimensions, with correlations betweenthe multiple dimensions. Effective query optimization in thiscontext requires good selectivity estimates. In this paper, weuse pruned count-suffix trees (PSTs) as the basic data struc-ture for substring selectivity estimation. For the 1-D problem,we present a novel technique called MO (Maximal Overlap).We then develop and analyze two 1-D estimation algorithms,MOC and MOLC, based on MO and a constraint-based char-acterization of all possible completions of a given PST. For thek-D problem, we first generalize PSTs to multiple dimensionsand develop a space- and time-efficient probabilistic algorithmto construct k-D PSTs directly. We then show how to extendMO to multiple dimensions. Finally, we demonstrate, both an-alytically and experimentally, that MO is both practical andsubstantially superior to competing algorithms.Key words: String selectivity – Maximal overlap – Shortmemory property – Pruned count-suffix tree1 IntroductionOne often wishes to obtain a quick estimate of the numberof times a particular substring occurs in a database. A tra-ditional application is for optimizing SQL queries with thelike predicate (e.g., name like %jones%). Such predicatesare pervasive in data warehouse queries, because of the pres-ence of “unclean” data [HS95]. With the growing importanceof XML, LDAP directories, and other text-based informationstores on the Internet, substring queries are becoming increas-ingly common.Furthermore, in many situations with these applications, a query may specify substrings to be matched on multiple al-phanumeric attributes or dimensions. The query [(name like%jones%) AND (tel like 973360%) AND (mail like%research.att.com)] is one example. Often the attri-butes mentioned in these kinds of multi-dimensional queriesmay be correlated. For the above example, because of the ge-ographical location of the research labs, people that satisfythe query (mail like %research.att.com) may have anunexpectedly high probability to satisfy the query (tel like973360%). For such situations, assuming attribute indepen-dence and estimating the selectivity of the query as a productof the selectivity of each individual dimension can lead togross inaccuracy.1.1 The data structureA natural question that arises is which data structure doesone use for substring selectivity estimation. Histograms havelong been used for selectivity estimation in databases (see,e.g.,[SAC+79,MD88,LN90,Ioa93,IP95,PIHS96,JKM+98]).They have been designed to work well for numeric attributevalue domains. For the string domain, one could continueto use such “value-range” histograms by sorting substringsbased on the lexicographic order and computing the appro-priate counts. However, in this case, a histogram bucket thatincludes a range of consecutive lexicographic values is notlikely to produce a good approximation, since the number oftimes a string occurs as a substring is likely to be very differ-ent for lexicographically successive substrings.As a result, welook for a different solution, one that is suitable for the stringdomain.A commonly used data structure for indexing substringsin a database is the suffix tree [Wei73,McC76], which is atrie that satisfies the following property: whenever a string αis stored in the trie, then all suffixes of α are stored in thetrie as well. Given a substring query, one can locate all thedesired matches using the suffix tree. Krishnan et al. [KVI96]proposed an interesting variation of the suffix tree: the prunedcount-suffix tree (PST), which maintains a count, Cα, for eachsubstring α in the tree and retains only those substrings α (andtheir counts) for which Cαexceeds some pruning threshold.In this paper, following [KVI96], we use PSTs as the basicsummary data structure for substring selectivity estimation.To estimate substring selectivity in multiple dimensions,we need to generalize the PST to multiple dimensions. Here,only those k-D substrings (α1,...,αk) for which the count2 H.V. Jagadish et al.: One-dimensional and multi-dimensional substring selectivity estimationexceeds the pruning threshold are maintained in the tree (alongwith their counts).1.2 The problemThe substring selectivity estimation problem can be formallystated as follows:Given a pruned count-suffix tree T , and a (1-D or k-D)substring query q, estimate the fraction Cq/N , whereN is the count associated with the root of T .The 1-D version of the above problem considers the situ-ation when the pruned tree T is created for a single attribute(e.g., name). The k-D version of the problem considers thecase when T is set up for multiple attributes (e.g., name andtel).What we gain in space by pruning a count-suffix tree, welose in accuracy in the estimation of the selectivities of thosestrings that are not completely retained in the pruned tree. Ourmain challenge, then, is: given a pruned tree, to try to estimateas accurately as possible the selectivity of such strings.1.3 Our contributionsWe begin by describing the 1-D problem and its solution first(in Sects. 4–6), and then go on to generalize our results tomultiple dimensions (in Sects. 7–10). Specifically, we makethe following contributions:– In Sect. 4, for the 1-D problem, we present a novel se-lectivity estimation algorithm MO (Maximal Overlap),which estimates the selectivity of the query string σ, basedon all maximal substrings, βi, of σ in the 1-D PST. Wedemonstrate that MO is provably better than KVI, theindependence-based estimation technique developed in[KVI96], using a greedy parsing of σ, under the natural as-sumption that strings exhibit the so-called short memoryproperty. We also
View Full Document