USC CSCI 599 - DM3_jkns2000-12pst - D156688

Home> Schools> University of Southern California> Computer Science (CSCI) > CSCI 599> DM3_jkns2000-12pst

DOC PREVIEW

USC CSCI 599 - DM3_jkns2000-12pst

School name University of Southern California

Course Csci 599- Special Topics

Pages 17

This preview shows page 1-2-3-4-5-6 out of 17 pages.

Save

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

View full document

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Premium Document

Do you want full access? Go Premium and unlock all 17 pages.

Access to all documents

Download any document

Ad free experience

Subscribe for instant access Get instant access

Unformatted text preview:

The VLDB Journal (2000) 9: ♣–♣One-dimensional and multi-dimensionalsubstring selectivity estimationH.V. Jagadish1, Olga Kapitskaia2, Raymond T. Ng3, Divesh Srivastava41University of Michigan, Ann Arbor; E-mail: [email protected]ˆole Universitaire L´eonard de Vinci; E-mail: [email protected] of British Columbia; E-mail: [email protected]&T Labs – Research, 180 Park Avenue, Bldg 103, Florham Park, NJ 07932, USA; E-mail: [email protected] by M.P. Atkinson. Received: April 28, 2000 / Accepted: July 11, 2000Abstract. With the increasing importance of XML, LDAPdirectories, and text-based information sources on the Inter-net, there is an ever-greater need to evaluate queries involv-ing (sub)string matching. In many cases, matches need to beon multiple attributes/dimensions, with correlations betweenthe multiple dimensions. Effective query optimization in thiscontext requires good selectivity estimates. In this paper, weuse pruned count-sufﬁx trees (PSTs) as the basic data struc-ture for substring selectivity estimation. For the 1-D problem,we present a novel technique called MO (Maximal Overlap).We then develop and analyze two 1-D estimation algorithms,MOC and MOLC, based on MO and a constraint-based char-acterization of all possible completions of a given PST. For thek-D problem, we ﬁrst generalize PSTs to multiple dimensionsand develop a space- and time-efﬁcient probabilistic algorithmto construct k-D PSTs directly. We then show how to extendMO to multiple dimensions. Finally, we demonstrate, both an-alytically and experimentally, that MO is both practical andsubstantially superior to competing algorithms.Key words: String selectivity – Maximal overlap – Shortmemory property – Pruned count-sufﬁx tree1 IntroductionOne often wishes to obtain a quick estimate of the numberof times a particular substring occurs in a database. A tra-ditional application is for optimizing SQL queries with thelike predicate (e.g., name like %jones%). Such predicatesare pervasive in data warehouse queries, because of the pres-ence of “unclean” data [HS95]. With the growing importanceof XML, LDAP directories, and other text-based informationstores on the Internet, substring queries are becoming increas-ingly common.Furthermore, in many situations with these applications, a query may specify substrings to be matched on multiple al-phanumeric attributes or dimensions. The query [(name like%jones%) AND (tel like 973360%) AND (mail like%research.att.com)] is one example. Often the attri-butes mentioned in these kinds of multi-dimensional queriesmay be correlated. For the above example, because of the ge-ographical location of the research labs, people that satisfythe query (mail like %research.att.com) may have anunexpectedly high probability to satisfy the query (tel like973360%). For such situations, assuming attribute indepen-dence and estimating the selectivity of the query as a productof the selectivity of each individual dimension can lead togross inaccuracy.1.1 The data structureA natural question that arises is which data structure doesone use for substring selectivity estimation. Histograms havelong been used for selectivity estimation in databases (see,e.g.,[SAC+79,MD88,LN90,Ioa93,IP95,PIHS96,JKM+98]).They have been designed to work well for numeric attributevalue domains. For the string domain, one could continueto use such “value-range” histograms by sorting substringsbased on the lexicographic order and computing the appro-priate counts. However, in this case, a histogram bucket thatincludes a range of consecutive lexicographic values is notlikely to produce a good approximation, since the number oftimes a string occurs as a substring is likely to be very differ-ent for lexicographically successive substrings.As a result, welook for a different solution, one that is suitable for the stringdomain.A commonly used data structure for indexing substringsin a database is the sufﬁx tree [Wei73,McC76], which is atrie that satisﬁes the following property: whenever a string αis stored in the trie, then all sufﬁxes of α are stored in thetrie as well. Given a substring query, one can locate all thedesired matches using the sufﬁx tree. Krishnan et al. [KVI96]proposed an interesting variation of the sufﬁx tree: the prunedcount-sufﬁx tree (PST), which maintains a count, Cα, for eachsubstring α in the tree and retains only those substrings α (andtheir counts) for which Cαexceeds some pruning threshold.In this paper, following [KVI96], we use PSTs as the basicsummary data structure for substring selectivity estimation.To estimate substring selectivity in multiple dimensions,we need to generalize the PST to multiple dimensions. Here,only those k-D substrings (α1,...,αk) for which the count2 H.V. Jagadish et al.: One-dimensional and multi-dimensional substring selectivity estimationexceeds the pruning threshold are maintained in the tree (alongwith their counts).1.2 The problemThe substring selectivity estimation problem can be formallystated as follows:Given a pruned count-sufﬁx tree T , and a (1-D or k-D)substring query q, estimate the fraction Cq/N , whereN is the count associated with the root of T .The 1-D version of the above problem considers the situ-ation when the pruned tree T is created for a single attribute(e.g., name). The k-D version of the problem considers thecase when T is set up for multiple attributes (e.g., name andtel).What we gain in space by pruning a count-sufﬁx tree, welose in accuracy in the estimation of the selectivities of thosestrings that are not completely retained in the pruned tree. Ourmain challenge, then, is: given a pruned tree, to try to estimateas accurately as possible the selectivity of such strings.1.3 Our contributionsWe begin by describing the 1-D problem and its solution ﬁrst(in Sects. 4–6), and then go on to generalize our results tomultiple dimensions (in Sects. 7–10). Speciﬁcally, we makethe following contributions:– In Sect. 4, for the 1-D problem, we present a novel se-lectivity estimation algorithm MO (Maximal Overlap),which estimates the selectivity of the query string σ, basedon all maximal substrings, βi, of σ in the 1-D PST. Wedemonstrate that MO is provably better than KVI, theindependence-based estimation technique developed in[KVI96], using a greedy parsing of σ, under the natural as-sumption that strings exhibit the so-called short memoryproperty. We also

View Full Document


School:
Email:
New Password:
Confirm Password:

This preview shows page 1-2-3-4-5-6 out of 17 pages.

USC CSCI 599 - DM3_jkns2000-12pst

Sign up for free to view:

Please select your school